Schm280 - A Bilingual Corpus of Word pairs based on WordSim353
The corpus Schm280 had been created for the evaluation of a new measurement on semantic similarity across different languages.
During the state-of-the-art analysis we were watching out for a corpus consistig of English word pairs being annotated with a value of their semantic relatedness and translated to german.
Since we did not find any appropriate corpus we took the existing Wordsim353 corpus by Finkelstein et al which consists of 353 English word pairs annotated with their semantic relatedness (between 0 and 10) and translated it to German. The detailled process of translation and the results can be found below. The translation and clean up has lead to 280 word pairs anotated with their semantic relatedness. This value can be considered as
- the semantic relatedness between the two English words
- the semantic relatedness between the two German words
- the semantic relatedness between one English and one German word
Process of translation
In order to create it, we first realized a translation of the WordSim353 corpus with the help of twelve volunteers. Each word pair got translated by three subjects and each of those subjects had been asked to use another online translator. As online translators the German Web pages dict.leo.org, dict.cc and Pons had been used. Using a variety of translators we achieved an independence of the translations from a concrete translator. In order to ensure a high concentration on the translation task we decided to divide the set of word pairs into four parts and let each volunteer only translate one subset which needed less time compared to the translation of all word pairs.
In general, a consistent translation of words was asked. So, if a word appeared several times with the same meaning, the same translated word should be used in the pairs. The subjects got further asked to translate the words in dependence of their context (the other word of the pair) to achieve consistency between the German and the English word pair. Finally we asked for translations which are not highly specific, only as specific as needed in the concrete context.
After the translation process we cleaned the data by correcting misspellings and by standardizing different spellings of words. If translations appeared in singular and plural they had been put in the singular version. A further semantical standardization had not been applied.
For 165 of the 353 pairs the volunteers had chosen the same translation pair which equates to 46.7%. In 280 of 353 cases (79.3%) two of the three subjects had chosen the same translations. Consequently for 20.7% of the word pairs all three subjects had chosen a different translation pair.
Similarity of translations from different subjects was determined, based on the syntactical similarity and not on the semantical similarity. So synonyms have not been considered as similar.
We decided to use only pairs where the same translations had been chosen by a two-thirds majority. Applying the above explained criteria our corpus Schm280 contains 280 English word pairs with their German translation and a relatedness value which had been determined by humans. Because words in a context represent a concept and those concepts and their relatedness is language independent we can assume that the relatedness of English words can be transfered to the well translated German equivalents.
The corpus is provided as a tab separated text file where each line corresponds to one pair of words. Column 1 and 2 contain the original word pair in English followed by the translated words and the value of semantic relatedness between 0 and 10.
Please find the corpus here.
For any feedback or questions please contact Sebastian Schmidt