Word embeddings, also known as distributed word representations, learn real-valued vectors that encode words’ meaning. They have been widely used in many Natural Language Processing (NLP) tasks, such as text classification, part-of-speech tagging, parsing, and machine translation. Text classification is a task where the input texts have to be classified into different categories based on their content. Word embedding methods have been tailored to text classification for performance improvement.

In this research, two task-specific dependency-based word embedding methods are proposed for Text classification. In contrast with universal word embedding methods that work for generic tasks, we design task-specific word embedding methods to offer better performance in a specific task. Our methods follow the PPMI matrix factorization framework and derive word contexts from the dependency parse tree. As compared linear contexts, dependency-based contexts can find long-range contexts and exclude less informative contexts. One example is shown in Fig. 2, where the target word is ‘found’. Guided by the dependency par-sing tree, its closely related words (e.g. ‘he’, ‘dog’) can be easily identified. In contrast, less related words (e.g. ‘skinny’, ‘fragile’) are gathered by linear contexts.

Firstly, to construct robust and informative contexts, we use dependency relation which represents the word’s syntactic function to locate the keywords in the sentence and treat the keywords and the neighbor words in the dependency parse tree as contexts.

To further increase the text classification performance, we make our word embedding learns from word-context as well as word-class co-occurrence statistics. We combine the word-context and word-class mutual information into a single matrix for factorization.

It is shown by experimental results they outperform several state-of-the-art word embedding methods.

 

Image credits:

Image showing a simple example algorithm framework for text classification is from https://laptrinhx.com/nlp-multiclass-text-classification-machine-learning-model-using-count-vector-bow-tf-idf-2622024659/