MCL Research on Word Embedding
Word embeddings have been widely applied across several NLP tasks. The goal for word embedding is to transferring words into vector representations which embeds both syntactic and semantic information. General word embedding is usually generated by training on a large corpus like the whole wiki text data.
Our first work is mainly focus on improving the performance over trained word embedding models to make is more representative. The motivations are: (1) Even though current model are trained without considering the order of each dimension. But the obtain word embedding is usually carries a large mean and the variance is mostly lies on the first several principal components. This could lead hubness problem and we would like to analysis the statistics to make the whole space more iso-tropical. (2) The information of ordered input sequences is lost because of the context-based training scheme. From the above analysis, we proposed two ways to perform post-processing of word embedding call Post-processing via Variance Normalization (PVN) and Post-processing via Dynamic Embedding (PDE). The effectiveness of our model is verified over both intrinsic and extrinsic evaluation methods. For details, please refer to: [1].
During the past several years, word embedding is very popular, but the evaluation is mainly conducted over intrinsic evaluation methods because of their convenience. In Natural Language Processing society, we care about more the effective of word embedding on real NLP tasks like translation, sentiment analysis and question answering. Our second word focus on the word embedding quality and its relationship with evaluation methods. We have discussed criterions that a good word embedding should have and also for evaluation methods. Also, the properties of intrinsic evaluation methods are discussed because different intrinsic evaluator tests from different perspectives. Finally, [...]









