Kommunikationsnetze
Multi-label Text Classification Using Semantic Features and Dimensionality Reduction with Autoencoders | |
Key: | ARJ17-1 |
Author: | Wael Alkhatib, Christoph Rensing, Johannes Silberbauer |
Date: | June 2017 |
Kind: | In proceedings |
Publisher: | Springer, Cham |
Book title: | International Conference on Language, Data and Knowledge |
Keywords: | semantics; feature selection; dimensionality reduction; text classication; semantic relations; autoencoders. |
Abstract: | Feature selection is of vital concern in text classication to reduce the high dimensionality of feature space. The wide range of statistical techniques which have been proposed for weighting and selecting features suer from loss of semantic relationship among concepts and ignoring of dependencies and ordering between adjacent words. In this work we propose two techniques for incorporating semantics in feature selection. Furthermore, we use autoencoders to transform the features into a reduced feature space in order to analyse the performance penalty of feature extraction. Our intensive experiments, using the EURlex dataset, showed that semantic-based feature selection techniques significantly outperform the Bag-of-Word (BOW) frequency based feature selection method with term frequency/inverse document frequency (TFIDF) for features weighting. In addition, after an aggressive dimensionality reduction of original features with a factor of 10, the autoencoders are still capable of producing better features compared to BOW with TF-IDF. |
The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, not withstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.