MCL Research on Multi-model Neural Machine Translation
Our long-term goal is to build intelligent systems that can perceive their visual environment and understand the linguistic information, and further make an accurate translation inference to another language. However, most multi-modal translation algorithms are not significantly better than an off-the-shelf text-only machine translation (MT) model. There remains an open question about how translation models should take advantage of visual context, because from the perspective of information theory, the mutual information of two random variables I(X; Y) will always be no greater than I(X, Z; Y) where Z is the additional visual input. This conclusion makes us believe that the visual content will hopefully help the translation systems.
Since the standard paradigm of multi-modal translation always considers the problem as a supervised learning task, the parallel corpus is usually sufficient to train a good translation model, and the gain from the extra image input is very limited. We however argue that the text-only UMT is fundamentally an ill-posed problem, since there are potentially many ways to associate target with source sentences. Intuitively, since the visual content and language are closely related, the image can play the role of a pivot “language” to bridge the two languages without paralleled corpus, making the problem “more well-defined” by reducing the problem to supervised learning.
We tackle the unsupervised translation with a multi-modal framework which includes two sequence-to-sequence encoder-decoder models and one shared image feature extractor in order to achieve the unsupervised translation. We employ transformer in both the text encoder and decoder of our model and design a novel joint attention mechanism to simulate the relationships among the language and visual domains.
Succinctly, our contributions are three-fold:
We formulate the multi-modal MT problem as unsupervised setting that fits the real [...]