Sentence similarity evaluation has a wide range of applications in natural language processing, such as semantic similarity computation, text generation evaluation, and information retrieval. As one of the word-alignment-based methods, Word Mover’s Distance (WMD)[1] formulates text similarity evaluation as a minimum-cost flow problem. It finds the most efficient way to align the information between text sequences through a flow network defined by word-level similarities. By assigning flows to individual words, WMD computes text dissimilarity as the minimum cost of moving words’ flows from one sentence to another based on pre-trained word embeddings.
However, a naive WMD method does not perform well on sentence similarity evaluation for several reasons.
– First, WMD assigns word flow based on words’ frequency in a sentence. This frequency-based word weighting scheme is weak in capturing word importance when considering the statistics of the whole corpus.
– Second, the distance between words solely depends on the embedding of isolated words without considering the contextual and structural information of input sentences. Since the meaning of a sentence depends on individual words as well as their interaction, simply considering the alignment between individual words is deficient in evaluating sentence similarity.
MCL proposed a new syntax-aware word flow calculation method, Syntax-aware Word Mover’s Distance (SynWMD)[2], for sentence similarity evaluation.
– Words are first represented as a weighted graph based on the co-occurrence statistics obtained by dependency parsing trees. Then, a PageRank-based algorithm is used to infer word importance.
– The word distance model in WMD is enhanced by the context extracted from dependency parse trees, which is illustrated in Figure 1. The contextual information of words and structural information of sentences are explicitly modeled as additional subtree embeddings.
– As shown in Table 1, we conduct extensive experiments on semantic textual similarity tasks and k-nearest neighbor sentence classification tasks to evaluate the effectiveness of the proposed SynWMD. The code for SynWMD is available at https: //github.com/amao0o0/SynWMD.
Ref:
[1] Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. From word embeddings to document distances. In International conference on machine learning, pages 957–966. PMLR, 2015.
[2] Wei, Chengwei, Bin Wang, and C-C. Jay Kuo. “SynWMD: Syntax-aware Word Mover’s Distance for Sentence Similarity Evaluation.” arXiv preprint arXiv:2206.10029 (2022).
— by Chengwei Wei