TaxWikiML.KOM - Hyponymy Corpora

TaxWiki.KOM Hyponymy Corpora - Manually annotated Subsets of the Wikipedia Category Graph in different languages

The TaxWiki.KOM hyponymy corpora have been created for the evaluation of approaches that extract or recognize taxonomic relations (hyponymy) using the Wikipedia category graph. The corpora were extracted in 5 different languages: English, German, Spanish, Arabic and Russian. For each language, 1000 randomly selected Wikipedia articles and categories were selected. he corpora were extracted using the Wikipedia'a export page andthe following methodogy:

  1. Get random article a using the "Random page"-link and add all links (pairs of categories in the category graph) of a to all its categories in the corpus.
  2. Choose a random category c of a and add all links of c to all its super categories c_(s,i) to the corpus. As the corpora contain 1000 articles, we fi lter out categories that have more than 100 super categories in order to have enough articles and categories from di fferent domains.
  3. Choose randomly a super category c_(s,j) of c_(s,i) and all links of c_(s,j) and insert it into our corpus.
  4. Repeat step 3. until the root category or an already visisted category is reached moving to the top of the category graph.
  5. Go to step 1, until corpus has 1000 articles.

After we built the corpus for each languages, we labelled it manually with the relevant relations (hyponymy and non-hyponymy. In following table 2, we summarize the size and distribution between hyponymy and non-hyponymy links in the di fferent corpora.

Number of hyponymy links129378680811352545
Number of non-hyponymy links30431388159716043572
Number of labeled links43362405240527396297


The corpora are in CSV-Format. Each line contains 2 categories and a value, which are separated with commas. The first category represents the subcategory and the second one the supercategory. The value can only be 1 or 0, meaning the class of the link: hyponymy (0) or non-hyponymy (0). Sopme example are

Girl group,Musical group,1


The first line in the file contains a short description of the single rows:

(Article/Category)-From,Category-To,Label: ISA / NOT-ISA


Please find the corpus here.


For any feedback or questions please contact Renato Domínguez García