Theses in Progress

Crosslingual Multi-label Text Classification

December 12, 2018 – ,

Multi-label classification is the task of assigning a set of labels from a fixed vocabulary to a sample of data i.e. image, audio, text, etc. Multi-label text classification has been applied to a multitude of tasks, including document indexing, tag suggestion, and sentiment classification. However, many of the applied methods disregard word order, opting to use bag-of-words models or TFIDF weighting to create document vectors. With the advent of powerful semantic embeddings, such as word2vec and GloVe, we want to investigate how word embeddings and word order can be used to improve multi-label classification. Word embeddings is one of the strongest trends in Natural Language Processing (NLP). It is a technique to learn semantically meaningful representations for words from local co-occurrences in sentences. The relative similarity between two words vector representations as well as words order can capture meaningful syntactic and semantic regularities. 

Task description

In this work, we aim to develop/adapt machine learning methods (mainly Deep Learning) to improve the multi-label text classification. By considering word order and their vector representation, new features space will be. The task will be to extend the current system for multi-label text classification using Gated recurrent unit (GRU), which is one of the most remarkable deep learning structures for sequential data. This includes:

  1. Literature and related work on cross lingual text classification.

  2. Extending the current framework with additional techniques for data representation or new neural network structure to improve the classification performance.

  3. Analysing the feasibility of using deep learning structures with long texts and smaller dataset size.


The written report must contain an introduction to the topic and provide an overview of related work. Furthermore, the designed and implemented methods must be described and discussed. 


Good programming skills in mind. A high level language mainly python

Helpful: Previous experience in Natural Language Processing and Machine Learning

Beginning and duration

Immediately, duration 3-6 months (depending on the course)

Keywords: NLP, GRU, Machine Learning, Clustering, Deep Learning, Word Embeddings, crosslingual

Research Area(s): Knowledge & Educational Technologies

Tutor: Alkhatib,

Student: Luna Alrawas

Theses in Progress