TaxWiki.KOM Hyponymy Corpora - Manually annotated Subsets of the Wikipedia Category Graph in different languages
The TaxWiki.KOM hyponymy corpora have been created for the evaluation of approaches that extract or recognize taxonomic relations (hyponymy) using the Wikipedia category graph. The corpora were extracted in 5 different languages: English, German, Spanish, Arabic and Russian. For each language, 1000 randomly selected Wikipedia articles and categories were selected. he corpora were extracted using the Wikipedia'a export page andthe following methodogy:
After we built the corpus for each languages, we labelled it manually with the relevant relations (hyponymy and non-hyponymy. In following table 2, we summarize the size and distribution between hyponymy and non-hyponymy links in the different corpora.
Language | English | Spanish | German | Arabic | Russian |
---|---|---|---|---|---|
Number of hyponymy links | 1293 | 786 | 808 | 1135 | 2545 |
Number of non-hyponymy links | 3043 | 1388 | 1597 | 1604 | 3572 |
Number of labeled links | 4336 | 2405 | 2405 | 2739 | 6297 |
Format
The corpora are in CSV-Format. Each line contains 2 categories and a value, which are separated with commas. The first category represents the subcategory and the second one the supercategory. The value can only be 1 or 0, meaning the class of the link: hyponymy (0) or non-hyponymy (0). Sopme example are
Girl group,Musical group,1
Hystory,Humanities,0
The first line in the file contains a short description of the single rows:
(Article/Category)-From,Category-To,Label: ISA / NOT-ISA
Download
Please find the corpus here.
Contact
For any feedback or questions please contact Renato Domínguez García