The State of Neural Machine Translation (NMT) by Philipp Koehn
Technology Overview
First of all, neural machine translation (NMT) is not a drastic step beyond what we have traditionally done in statistical machine translation (SMT). Its main departure is the use of vector representations (“embeddings”, “continuous space representations”) for words and internal states. The structure of the models is simpler than the phrase-based model. There is no separate language model, translation model, and reordering model, but just a single sequence model that predicts one word at a time.
However, this sequence prediction is conditioned on the entire source sentence and the entire already produced target sequence. This gives these models the capability to take long-distance context into account, which seems to have positive effects on word choice, morphology, and split phrases (e.g, “switch … on”, more prevalent in German and other languages).
The other key benefit is the generalization of data, e.g., the ability to add to the knowledge of the translation behavior of “cars” from examples that contain “car” or “autos”.
Ultimately, the training of the models is similar to phrase-based models. It takes a parallel corpus, and learns all required model parameters from it. How to best use monolingual target language data (used in traditional language modeling) is not fully worked out at the moment. One successful technique creates a synthetic parallel corpus by back-translating it into the source language.
Claims of Superior Performance
This year, neural translation systems that participated in the WMT evaluation outperformed phrase-based statistical neural systems by up to 3 BLEU points. The Google neural system makes similar claims in their September press release. These are significant gains, with upside for future models, but I would be cautious about exaggerated claims compared to human performance. Other statistical systems have made great progress over the last decade as well, and this is another step up the ladder. Claims that near-human performance is just around the corner are misleading and create unrealistic expectations.
Technical Limitations
A lot of the methods that allow customization and adapting machine translation systems to typical customer data have not yet been developed for neural translation models.
A key problem of neural models is the limited vocabulary due to computational constraints. These models are trained with, say, 50,000 word vocabularies, and any longer words are broken up into word pieces. This is a real problem for deployments with large numbers of brand names or large specialized vocabulary.
Domain adaptation techniques have not thoroughly developed yet beyond a few pilot studies. How to best address the typical case of a lot of foundation data and little client data is unclear currently.
Systems we built at Omniscien Technologies involve a lot of data engineering, such as the collection of specialized terminology, rules for reordering and rule-based handling of dates, currency value or other unit translation. The XML mark-up scheme that allows the integration of this additional knowledge in phrase-based models does not exist yet it for neural systems. At this point, even a “word alignment” between source and translation cannot be clearly established (only approximated).
There is also some indication from experimentation that neural systems are quite data hungry. They do produce comparably much worse results with very small data sets, and it is not yet wholly clear how much data is needed for them to surpass phrase-based models.
A serious concern is also that neural systems are very hard to debug. While in phrase-based statistical machine translation systems there is still some hope to trace back why the system came up with a specific translation (and remediating that problem), there is little hope for that in neural systems. In addition, the errors produced by neural systems are sometimes quite capricious. The system may just produce output words that look good in context but have little to do with the source sentence.
These are all problems that future research will likely tackle, and I would expect that most of them will be addressed to some degree in the next 1-3 years.
Deployment
The biggest hurdle for larger scale commercial deployment however at this time is system is the very long training time (weeks) that requires specialized hardware (GPUs that cost $2000 a piece). On the other hand, decoding speeds seem to be reasonably fast so once trained the translation speed is reasonable while not (yet) at the RBMT or SMT level.
There are a lot of open source software tools around that allow to build systems that match the state of the art. Even the Google system that uses customized hardware is structurally not different from current research systems, such as the neural machine translation systems that my colleagues at the University of Edinburgh have built (and won most language pairs at WMT 2016). Since the models are relatively simple, there is also very little ”black magic” in training these systems. However, as always, building the system is only part of the equation. Preparing and manufacturing sufficient good quality data for the training step as well as effective pre- and post-processing in SMT and NMT systems alike have a very significant impact on the ultimate translation quality and should not be underestimated.
Summary
It is clear that Neural Machine Translation (NMT) is yet another step up and a very promising technology. At the same time, as often with new technologies, we are currently in the “hype” phase and while NMT might be right for some applications today, it is important to understand the different technologies and their respective advantages and disadvantages, the production processes, and improvement cycles, as well as the detailed, uses cases before selecting the translation technology for a particular application.