Conneau et al., 2018, Word Translation Without Parallel Data (Summary)

Paper summarisations in this blog are motivated by the fact that after studying a paper i would like to preserve the key information that will help me quickly review the content without delving into the long text of the original publication. I am not gonna proceed with detailed explanation of the paper, and will only elaborate on the parts where i was a bit unclear about the specifics. I hope the reader can also find it useful and, please, provide me with feedback if needed.

Each paper summarisation will focus on the following 5 key points:

  1. Problem statement
  2. Solution provided
  3. Key results
  4. Data
  5. Personal thoughts (if any)

These points are sorted according to the information value they provide to me. A structure which i personally find useful, without following any summarisation guide or other blogger’s experienced practise.

Problem statement

State-of-the-art methods for retrieving cross-lingual word embeddings are heavily relied on parallel corpora or bilingual dictionaries. However, low availability or bad quality of parallel corpora in many language pairs raises the need for an unsupervised method for learning high quality cross-lingual embeddings. The results are tested in a number of cross-lingual NLP tasks (word translation, cross-lingual semantic word similarity, sentence translation retrieval).


First of all, the solution provided falls under the category of monolingual mapping approaches. Simply said, these cross-lingual embeddings models firstly train separate monolingual word embeddings in each language and then proceed with the learning of a linear (or other) mapping between the two spaces. The algorithm used for monolingual embeddings was fastText trained in Wikipedia corpora. According to the experiments, the factor the affected the embeddings quality the most, was the choice of corpora, and not so much the algorithm.

Given a language pair, two sets of word embeddings are learned (can be of different size, no alignment pre-processing required), and adversarial training is employed to learn a mapping \boldsymbol{W} between these two sets. As the figure illustrates, if \boldsymbol{X} = \{x_1,x_2,...,x_n\}, \boldsymbol{Y} = \{y_1,y_2,...,y_n\} are the source and target word embeddings sets respectively, the discriminator is learning to map \boldsymbol{Wx_i} to \boldsymbol{y_j}.

For each iteration, a word from the source word embeddings (x) and a word from target word embeddings (y) are selected randomly. Then, x is mapped to Wx, which is to be aligned with y.

For the adversarial training, not all words are used. A word frequency threshold is used resulting into the final usage of 200k words. Consequently, the mapping \boldsymbol{W}, is forming a “rotation” of the source word embeddings distribution to be aligned with the target embeddings one. The following image depicts the progress so far. The size of each dot represents the frequency of each word in experimentation corpora.

The first step is completed, without, however, competing with the supervised methods in NLP tasks. So, the cross-lingual space needs refinement. This refinement is simply the choice of some anchor points which will serve as input to the following algorithm that will reduce the distances of source-target words distances even more. This algorithm proceeds with four steps:

  1. Choose the most frequent words. Why? Well, most frequent words can provide higher quality word embeddings. Since a word can appear in more than one contexts, it is imperative that the monolingual word embedding algorithm have encountered it enough times around each context, in order to fully comprehend it and, thus, provide a good word embedding.

Iteratively, proceed with the rest of the steps:

  1. Obtain a synthetic dictionary given the learned mapping \boldsymbol{W}. This dictionary is obtained with the usage of Cross-Domain Similarity Local Scaling (CSLS) criterion which is explained in depth (and can be easily understood) in the original publication. Simply introduced, this criterion uses a bi-part graph which holds the nearest neighbours of all source and target words. It then performs a scaling over word similarities favouring the distant vectors over the heavily clustered ones. Consequently, it deals with the asymmetry of nearest neighbors rules which suffer from the hubness problem, where some vectors are nearest neighbors of many other vectors, while others are completely isolated. Using CSLS, only the source-target pairs that are mutually nearest neighbors are selected creating a high-quality synthetic dictionary which provide good anchor points for step 4.
  2. Proceed with orthogonality restriction to the matrix \boldsymbol{W}. This step is achieved during the adversarial training of generator, where the update or the matrix \boldsymbol{W} is alternated by the following formula: W \leftarrow ( 1 + \beta)W - \beta(WW^T)W. Parameter \boldsymbol{\beta} is found experimentally. Orthogonality ensures that \boldsymbol{W} is an isometry of Euclidean space (illustrated as rotation in the previous image during adversarial training), while serves as a means of using Procrustes algorithm on step 4.
  3. Apply the Procrustes closed form solution to further reduce the word embeddings distances. Conveniently this is obtained by the singular value decomposition of YX_T, where \boldsymbol{X}, \boldsymbol{Y} are the source and target words respectively, of the synthetic dictionary extracted in step 2.

The above procedure is illustrated in the following image:

So far, the mapping between the monolingual word-embeddings is already learned. The final task is to proceed with the translation procedure for the NLP tasks evaluated in the paper. This translation is simply done via word-to-word translation in two steps. First, given any source word \boldsymbol{x}, the learned mapping \boldsymbol{W} is applied, and in turn, CSLS criterion is choosing the closest translation for each word. This is illustrated in the last image provided the authors of the paper:


Key results

The results of this cross-lingual embeddings learning is evaluated over three cross-lingual NLP tasks, namely word translation, cross-lingual semantic word similarity and sentence translation retrieval. The performance is compared against many benchmarks and a supervised baseline model that uses the aforementioned Procrustes formula. In each task, the proposed model demonstrates outstanding results that either are on par with supervised state-of-the-art models, or either outperform them.

Their model is also used for low resource language pairs such as English-Esperanto. The results are quite encouraging since model configuration for this language pair is non-exhaustive.

The most important result (imo) is that Procrustes – CSLS provides strong results and systematically improves performance in all experiments, which i consider it to be one of the highest contributions of the current paper for the problem investigated. Consequently, CSLS is proven to be a very strong similarity metric for cross-lingual embedding domains.


For training monolingual word embeddings, Wikipedia data were used which seemed to provide better resource that WaCky datasets (Baroni et al. 2009).

For different cross-lingual tasks, different data were utilised.

  • Word translation — MUSE library (Facebook research), obtain a a dictionary of 100k pairs of words
  • Cross-lingual semantic word similarity — SemEval 2017 competition data (Camacho-Collados et al. 2017)
  • Sentence translation retrieval — Europarl corpus

Personal Thoughts

Word polysemy is an important aspect of translation in general, and it most definitely raises big challenges in Machine Translation.

According to the authors, word polysemy is not taken into consideration for the language pair English – Esperanto which explains why the word translation task reported for P@1 is so low compared to the other pairs. And as a proof, they report the P@5 for the same pair which is almost two times higher than P@1. However, what i am not clear about is how word polysemy is taken into account for the rest of the language pairs.

Apart from the Esperanto-English pair, looking at evaluation results in table 1 of the paper, the proposed model performs relatively poorly (although still on par with the supervised methods) on two more language pairs, English-Russian and English-Chinese (and reversely). Is this limited performance once again originating from the word polysemy problem? Or should we consider that the variation of this performance is related to the heterogeneity among Russian-Chinese-English, and thus advance it to the problem (once again) of language distances?

I am not sure how distant is English with German since i have no knowledge of the latter. However, i would be very interested into investigating the proposed model’s optimization for English – Chinese pair, or, to be even more challenging, to Greek – Chinese language pair.


原来我中文老师布置写作作业给我,题目就是“公说公有理,婆说婆有理”。 后面呢,因为我想到的内容是一种我很关注的事情(就是人工智能),所以我不得不写得太啰嗦。请不要评价得太严肃,不要忘记只是个作业的内容。如果我想深入分析下面的话题,一定要用英语写哈!:)


          我父母当了30多年的老师,所以对这行业有深度的了解, 而且他们本来对新技术的事情感兴趣,关注科技给社会所带来的方便与最先进的工具。与此同时,我的专业是计算机工程学,对技术的大部分事情有一定的了解。并且,因为平常父母在家里讨论教育的各种方面,必然使我对这个领域开始感兴趣。在技术要不要整合到教育的这个话题中,我和父母的想法一致,就是技术当老师的助手是种提高教育质量与效率的事情。此外,随着互联网的发展,年轻人可以利用海量的资源学到之前无法自学的新知识,这样老师在上课时回答学生问题的挑战也变得越来越大了。不过,对他们来说,人工智能没有明确的目标,这样会导致巨大的社会问题。相反的,对我来说,该领域是人类知识接着前进的自然过程。

          有一天,我们在讨论人工智能的前途时,父母开始表示该行业最神奇的创造,机器人,能不能在不远的未来取代老师。“为什么不能呢?你看,几十年之前智能机器已经开始代替工人,让许多人失业。另外,在所谓的智能超市里,只要买家刷手机,就能买东西,突然柜台失去它的作用,使工作机会减少了。谁也不能保证机器人和智能机器不会抢走老师的岗位。” 。当然,他们的观点有一定的道理。不过,如果考虑一下教育行业的本质是什么,就可以了解该行业和我父母提到的并不一样。教育本身是种人与人交流的过程。老师不仅是个知识和经验的资源,而且负责监督和陪护学生的学习。按照学生的进步,老师可以鼓励,批评或者表扬学生,同时,学生在面对困难或者烦恼时,也可以在老师的身上找到安慰与引导。上述的方面,在不远的未来,人工智能根本做不到。

          父母接着反对我的想法说道:“好像你不认真关注人工智能目前的成就。几天之前,在电视节目里我们看到一位记者在迪拜的某个创新大会中,就人工智能的未来,对一个机器人进行了采访。而且在采访中,机器人除了幽默聪明,还有它外貌几乎和人的一样。那么,从这个事实出发,几年后高速进步的人工智能行业能够做到你刚才说的距离很远的目标”。他们提到的优秀机器人是众所周知的“索菲亚”,是长得像女生一样的机器人。其实,对我来说,谁看过那次大会的采访,谁就体验到的不是索菲亚的优秀,而是人类在不断创新的能力。不过, 我对父母表示的意见基于人类的神秘性质。从古代到现代的哲学发展中,许多精彩的科学家都努力研究我们现实世界的各种现象,人类的自我意识,还有人类的灵性和感情。到目前为止,没有任何完整的解释,让科学家无法了解我们世界最基本的概念,更加无法教机器人怎么当人。在这个观点上,机器人,再聪明和高效,也比不上普通人的感觉与创造力,更加比不上一位能够给学生带来启发和吸引力的老师。

          我说完后,还是感觉到父母的怀疑和对人工智能将来会产生社会问题的担心。争论停止了一会儿,然后我妈妈又发表了意见:“好的,就算做到这种完美,彻底地取代老师的机器人是根本不可能的,你怎么能反对有一天机器人在某种形式下会投入教育行业,负责次要的任务呢?之前上课的过程只包括人,文具和黑板。随着计算机和互联网的发展,上课的过程一直在改变,开始使用电脑,机器和别的媒体提高教育质量。那么,在我考虑的设想里,下个步骤涉及到机器人当老师的助手。只不过,这些助手原来是刚被录 取、经验不丰富、年轻的老师们,或者研究生和博士。所以,政府为了提高教育质量可能会让机器人代替这些人。从表面来看,这个变化有益的地方比有害的多,那么问题不大。不过,老师的助手还是种即将要消失的岗位,而且,作为经验很丰富的老师,我们认为这些没有经验的老师与研究生会失去了解教育行业的一个重要的起点。后续的发展会更艰难,因为他们根本没有体会过一位老师真正在上课时该怎么样,该如何管理学生在课堂上的反应。这样,政府会提供某种智能的助手,不过会给教育产生某种潜在的问题。”



         相反的,在我们成长的过程中,我们变得越来越智慧,越来越独立。这样,有人会说,在我们自己的知识发展时,会越来越少的依靠别人的看法。然而,最神奇的是,我认识过的最优秀,经验最丰富的人,在他们的脸上我看过最普通,最谦虚的样子。是因为他们很了解的是, 在研究世界时,知识是种神秘的东西。知识并不是越学越少,而是越学越多。从他们的角度来看,“公说公有理,婆说婆有理” 并不是种无法总结的争论,而是一种把不同的看法整合起来的必需条件。

Chinese Idioms aka 成语’s

This article is dedicated to Chinese idioms and phrases learned through my experience learning this language. It’s purpose is mainly self-educational and it serves as a good note of my progress so far on my way to mastering Chinese language. For the readers that stumble upon this page, i would be more than grateful providing me with any clues or corrections for the work so far. Of course, if this material is helpful for you in any means, that would be double satisfaction on my side. For all the Chinese language fans out there, here comes the list: (which is constantly updated…well weekly for the most part)

滴水穿石 :Direct translation would be “little drops can ware a hole in the stone”. It is used as a metaphor for the progress one can make with persisting little efforts.

画蛇添足 :Translation is “adding legs to the painted snake”. This metaphor is used in situations where something is spoiled by reaching the extreme, superfluous level. For example when someone is taking a redundant, unnecessary action.

附庸风雅 :This idiom is referring to the people or entities that mingle with art and literature just for showing off.

一叶障目 :Direct translation “a leaf is blocking the eye”. Metaphor which is used in situations where someone is puzzled from secondary details and phenomena while losing the big picture.

三令五申 :Direct translation would be “three orders and five injunctions”, which means repeated orders and injunctions.

一箭双雕 :Means “kill two vultures with one arrow”. A metaphor that exists in many other languages and means the obvious, that is, achieve two goals with one action or with one try.

愚公移山 :”Yu gong that removed the mountains”. It is based on a story like the idiom right above, and it means that determination can lead to victory and courage can help us overpass any difficulty.

神不守舍 : “Spirit is not placed at home”. This metaphor is used under the situation of a mentally wondering person, or simply said, someone that becomes delirious.

入木三分 : “Enter the wood three points deep”. It is connected with a story of a person famous for his calligraphy skills. His skills were cultivated by continuous practise, which was that frequent and hardworking that the ink of his brush entered the wooden board he was practising on reaching to a big depth. Later, people that tried to carve out his characters, found out that is was almost impossible to do so. Now, this phrase is used to describe something penetrating and profound (like an analysis or a statement)

挨家挨户 : “One house after another household”. It actually means “doing something from door to door”.

照葫芦画瓢 : “Draw wooden dipper when watching at a calabash”. It means that someone is copying or imitating.