Conneau et al., 2018, Word Translation Without Parallel Data (Summary)

Paper summarisations in this blog are motivated by the fact that after studying a paper i would like to preserve the key information that will help me quickly review the content without delving into the long text of the original publication. I am not gonna proceed with detailed explanation of the paper, and will only elaborate on the parts where i was a bit unclear about the specifics. I hope the reader can also find it useful and, please, provide me with feedback if needed.

Each paper summarisation will focus on the following 5 key points:

  1. Problem statement
  2. Solution provided
  3. Key results
  4. Data
  5. Personal thoughts (if any)

These points are sorted according to the information value they provide to me. A structure which i personally find useful, without following any summarisation guide or other blogger’s experienced practise.

Problem statement

State-of-the-art methods for retrieving cross-lingual word embeddings are heavily relied on parallel corpora or bilingual dictionaries. However, low availability or bad quality of parallel corpora in many language pairs raises the need for an unsupervised method for learning high quality cross-lingual embeddings. The results are tested in a number of cross-lingual NLP tasks (word translation, cross-lingual semantic word similarity, sentence translation retrieval).


First of all, the solution provided falls under the category of monolingual mapping approaches. Simply said, these cross-lingual embeddings models firstly train separate monolingual word embeddings in each language and then proceed with the learning of a linear (or other) mapping between the two spaces. The algorithm used for monolingual embeddings was fastText trained in Wikipedia corpora. According to the experiments, the factor the affected the embeddings quality the most, was the choice of corpora, and not so much the algorithm.

Given a language pair, two sets of word embeddings are learned (can be of different size, no alignment pre-processing required), and adversarial training is employed to learn a mapping \boldsymbol{W} between these two sets. As the figure illustrates, if \boldsymbol{X} = \{x_1,x_2,...,x_n\}, \boldsymbol{Y} = \{y_1,y_2,...,y_n\} are the source and target word embeddings sets respectively, the discriminator is learning to map \boldsymbol{Wx_i} to \boldsymbol{y_j}.

For each iteration, a word from the source word embeddings (x) and a word from target word embeddings (y) are selected randomly. Then, x is mapped to Wx, which is to be aligned with y.

For the adversarial training, not all words are used. A word frequency threshold is used resulting into the final usage of 200k words. Consequently, the mapping \boldsymbol{W}, is forming a “rotation” of the source word embeddings distribution to be aligned with the target embeddings one. The following image depicts the progress so far. The size of each dot represents the frequency of each word in experimentation corpora.

The first step is completed, without, however, competing with the supervised methods in NLP tasks. So, the cross-lingual space needs refinement. This refinement is simply the choice of some anchor points which will serve as input to the following algorithm that will reduce the distances of source-target words distances even more. This algorithm proceeds with four steps:

  1. Choose the most frequent words. Why? Well, most frequent words can provide higher quality word embeddings. Since a word can appear in more than one contexts, it is imperative that the monolingual word embedding algorithm have encountered it enough times around each context, in order to fully comprehend it and, thus, provide a good word embedding.

Iteratively, proceed with the rest of the steps:

  1. Obtain a synthetic dictionary given the learned mapping \boldsymbol{W}. This dictionary is obtained with the usage of Cross-Domain Similarity Local Scaling (CSLS) criterion which is explained in depth (and can be easily understood) in the original publication. Simply introduced, this criterion uses a bi-part graph which holds the nearest neighbours of all source and target words. It then performs a scaling over word similarities favouring the distant vectors over the heavily clustered ones. Consequently, it deals with the asymmetry of nearest neighbors rules which suffer from the hubness problem, where some vectors are nearest neighbors of many other vectors, while others are completely isolated. Using CSLS, only the source-target pairs that are mutually nearest neighbors are selected creating a high-quality synthetic dictionary which provide good anchor points for step 4.
  2. Proceed with orthogonality restriction to the matrix \boldsymbol{W}. This step is achieved during the adversarial training of generator, where the update or the matrix \boldsymbol{W} is alternated by the following formula: W \leftarrow ( 1 + \beta)W - \beta(WW^T)W. Parameter \boldsymbol{\beta} is found experimentally. Orthogonality ensures that \boldsymbol{W} is an isometry of Euclidean space (illustrated as rotation in the previous image during adversarial training), while serves as a means of using Procrustes algorithm on step 4.
  3. Apply the Procrustes closed form solution to further reduce the word embeddings distances. Conveniently this is obtained by the singular value decomposition of YX_T, where \boldsymbol{X}, \boldsymbol{Y} are the source and target words respectively, of the synthetic dictionary extracted in step 2.

The above procedure is illustrated in the following image:

So far, the mapping between the monolingual word-embeddings is already learned. The final task is to proceed with the translation procedure for the NLP tasks evaluated in the paper. This translation is simply done via word-to-word translation in two steps. First, given any source word \boldsymbol{x}, the learned mapping \boldsymbol{W} is applied, and in turn, CSLS criterion is choosing the closest translation for each word. This is illustrated in the last image provided the authors of the paper:


Key results

The results of this cross-lingual embeddings learning is evaluated over three cross-lingual NLP tasks, namely word translation, cross-lingual semantic word similarity and sentence translation retrieval. The performance is compared against many benchmarks and a supervised baseline model that uses the aforementioned Procrustes formula. In each task, the proposed model demonstrates outstanding results that either are on par with supervised state-of-the-art models, or either outperform them.

Their model is also used for low resource language pairs such as English-Esperanto. The results are quite encouraging since model configuration for this language pair is non-exhaustive.

The most important result (imo) is that Procrustes – CSLS provides strong results and systematically improves performance in all experiments, which i consider it to be one of the highest contributions of the current paper for the problem investigated. Consequently, CSLS is proven to be a very strong similarity metric for cross-lingual embedding domains.


For training monolingual word embeddings, Wikipedia data were used which seemed to provide better resource that WaCky datasets (Baroni et al. 2009).

For different cross-lingual tasks, different data were utilised.

  • Word translation — MUSE library (Facebook research), obtain a a dictionary of 100k pairs of words
  • Cross-lingual semantic word similarity — SemEval 2017 competition data (Camacho-Collados et al. 2017)
  • Sentence translation retrieval — Europarl corpus

Personal Thoughts

Word polysemy is an important aspect of translation in general, and it most definitely raises big challenges in Machine Translation.

According to the authors, word polysemy is not taken into consideration for the language pair English – Esperanto which explains why the word translation task reported for P@1 is so low compared to the other pairs. And as a proof, they report the P@5 for the same pair which is almost two times higher than P@1. However, what i am not clear about is how word polysemy is taken into account for the rest of the language pairs.

Apart from the Esperanto-English pair, looking at evaluation results in table 1 of the paper, the proposed model performs relatively poorly (although still on par with the supervised methods) on two more language pairs, English-Russian and English-Chinese (and reversely). Is this limited performance once again originating from the word polysemy problem? Or should we consider that the variation of this performance is related to the heterogeneity among Russian-Chinese-English, and thus advance it to the problem (once again) of language distances?

I am not sure how distant is English with German since i have no knowledge of the latter. However, i would be very interested into investigating the proposed model’s optimization for English – Chinese pair, or, to be even more challenging, to Greek – Chinese language pair.