MCL Research on Variable-Length Word Embeddings
We propose Variable-Length Word Embeddings, a POS-aware and compute-efficient Word2Vec training framework. Traditional embeddings assign the same dimensionality to every token, even though different parts of speech contribute very differently to sentence meaning. In real text, nouns usually carry the main semantic content, verbs encode actions and relations, while many other categories (e.g., articles, prepositions, conjunctions) are comparatively low-information. This motivates a representation strategy that spends more capacity on important words and less capacity on the rest.
Our core idea is to use POS tags to organize training data and allocate embedding dimensions accordingly. We first POS-tag the entire corpus and split it into three views: a noun-only corpus, a noun+verb corpus, and a full corpus containing all tokens. Instead of training one uniform embedding space, we build embeddings in stages so that nouns become the backbone, verbs are learned relative to that backbone, and the remaining words are learned with minimal capacity.
We train nouns progressively with increasing dimensionality. Specifically, we learn noun embeddings at 50D, 100D, and 200D on the noun-only corpus. To make training across dimensions stable and efficient, each higher-dimensional model is initialized from the previous lower-dimensional embeddings using Lanczos interpolation (50D → 100D, and 100D → 200D), and then refined on the noun-only corpus. This produces high-capacity noun representations while preserving continuity across stages.
After obtaining the noun backbone at each dimension, we introduce verbs through a controlled adaptation step. Using the noun+verb corpus, we train verbs on top of the noun space, where noun vectors are soft-frozen (implemented with a reduced update factor) so they remain stable but can still adjust slightly. Verbs, in contrast, are fully trainable and learn to align with noun semantics. We apply this procedure at 50D [...]












