Our long-term goal is to build intelligent systems that can perceive their visual environment and understand the linguistic information, and further make an accurate translation inference to another language. However, most multi-modal translation algorithms are not significantly better than an off-the-shelf text-only machine translation (MT) model. There remains an open question about how translation models should take advantage of visual context, because from the perspective of information theory, the mutual information of two random variables I(X; Y) will always be no greater than I(X, Z; Y) where Z is the additional visual input. This conclusion makes us believe that the visual content will hopefully help the translation systems.

Since the standard paradigm of multi-modal translation always considers the problem as a supervised learning task, the parallel corpus is usually sufficient to train a good translation model, and the gain from the extra image input is very limited. We however argue that the text-only UMT is fundamentally an ill-posed problem, since there are potentially many ways to associate target with source sentences. Intuitively, since the visual content and language are closely related, the image can play the role of a pivot “language” to bridge the two languages without paralleled corpus, making the problem “more well-defined” by reducing the problem to supervised learning.

We tackle the unsupervised translation with a multi-modal framework which includes two sequence-to-sequence encoder-decoder models and one shared image feature extractor in order to achieve the unsupervised translation. We employ transformer in both the text encoder and decoder of our model and design a novel joint attention mechanism to simulate the relationships among the language and visual domains.

Succinctly, our contributions are three-fold:

  1. We formulate the multi-modal MT problem as unsupervised setting that fits the real scenario better and propose an end-to-end transformer based multi-modal model.
  2. We present two technical contributions: successfully train the proposed model with auto-encoding and cycle-consistency losses and design a controllable attention module to deal with both uni-modal and multi-modal data.
  3. We apply our approach to the Multilingual Multi30K dataset in English/French and English/German translation tasks, and the translation output and the attention visualization show the gain from the extra image is significant in the unsupervised setting.


–By Yuanhang Su



Yuanhang Su, Kai Fan, Nguyen Bach, C.C. Jay Kuo, and Fei Huang. Unsupervised Multi-modal Neural Machine TranslationThe Conference on Computer Vision and Pattern Recognition (CVPR) 2019. 2019 Jun. 18. Long Beach, CA, USA.