Have questions on Machine Translation, (Deep) Neural Machine Translation or simply practical questions? Check out our Frequently Asked Questions (FAQ) below.
The Omniscien Technologies FAQ Series Covers:
Neural Machine Translation (NMT)
Output words are predicted from the encoding of the full input sentence and all previously produced output words. Do previously translated phrases/segments in the same project predict future segments, or is the encoding/context confined to the segment?
It is currently restricted to single segment translation. Additional segments could be added as additional conditioning context, yes. There have been some efforts to exploit discourse-level context, but these systems are in early stages.
For decoding to occur coding is necessary, but what is the conceptual relationship between coding, decoding and translation?
The goal of encoding is taking the input sentence and get a representation of the sentence that is more – some would argue – semantics, which captures the meaning of words in context. Taking these encoding of the input sentence the goal now is to decode that into an output sentence. Encoding step first and decoding step afterwards.
Typically, the input and output of SMT systems are sentences. This limits SMT systems in modeling discourse phenomena. Could we go beyond sentences with NMT systems?
It seems definitely straightforward to do it, because we already have kind of embedding that are conditioned on anything previous. So, in general the good news is yes, it is relatively straightforward to add conditioning context when you make predictions.
Two objectives were mentioned: Adequacy and Fluency. Where does grammar fit into this and does NMT check grammar rules or only sees sequence of words?
Grammar is part of fluency. Readers will not think a sentence is fluent when it has grammatical errors.
Some researchers proposed full character level NMT to solve the rare word problem in NMT, especially last two months. What is the performance of when using character level NMT?
They seem to be a decent approach to unknown word translation. In practice, there may be better solutions, e.g., the use of additional terminology dictionaries or special handling of names.
Do you think that sequence to sequence gives better results than character based model?
Currently, character-based models have not been proven to be superior.
You mentioned that the machine learns “It’s raining cats and dogs” as a whole phrase. How does it learn something more complex like “Sarah looked up, the sky was bleak but there shouldn’t be any cats and dogs coming out of it.”
Well, that is a tough one. Theoretically it is possible that the concept “cats and dogs” will be learned to be similar to “rain”, but that is a lot to ask for.
When comparing SMT to NMT, what is the difference in the time required to train an engine?
SMT systems can be typically trained in 1-2 days on regular CPU servers. NMT seems to require 1-2 weeks on GPUs.
Is monolingual data still useful in NMT as it is for an SMT language model or is parallel data the only data useful at training time?
The current basic model only uses parallel data, there are several methods and extensions to the model to accommodate monolingual data, but consensus best practices have not yet been established. This is an area of active research.
Can you do the “incremental” training of NMT?
Incremental training improves the model continuously when new training data arrives. This fits closely the “online training” method of neural network models. So, in theory, these models can be incrementally updated relatively straightforwardly.
How much parallel texts do we need when we use NMT systems compared to traditional SMT systems? Is it true that NMT systems needs more data than SMT systems?
That seems to be the consensus in the field that at least to get going you need more data so you can build phrase based machine translation system from a million words easily, but this does not seem to be the case for Neural machine translation where you need more data, especially if your data is slightly off domain, having little data then is really bad.
The differences of type of data that goes into NMT system is that NMT currently is trained only on translated text for instance they are not even trained on any additional target language text on half which is a huge part of traditional statistical … There is no separate language model. When it comes to terminology and certain kinds of additional information, it is currently not entirely clear how to integrate that with NMT.
Do you think that we need more accurate (higher quality) data than is used to train SMT models?
This is an open question at this point and requires more research.
How many companies use NMT on a commercial basis at this point? What’s the general feedback of them so far?
As of today (January ’17) a number of vendors have announced or indicated offerings. We are not in a position to compare or comment on the quality of the different systems.
By when will NMT commercially deployable on a large scale (re industries). Will it still be largely based on SMT as a base?
There are many components in SMT that are also used in NMT models, so there will be re-use. How ideas or code will be merged between SMT and NMT remains to be seen.
One of the things that’s made SMT commercially feasible is adaptivity: to adjust a system’s behaviour based on small bits of human input, without retraining on the entire data set (too costly). As I understand that also needed technology innovation. Do current NMT models have this capability, or is adaptive NMT still some ways off / or not feasible with current technology at all?
The learning algorithm that is used to train all these parameters is called the online learning algorithm meaning that it learns by continuously processing new training examples. That means in practice you give it new sentence pairs and it learns from them. In theory, it should be pretty straight-forward to take a model that exists, have some additional training, might be different in nature, different domain, different style and then adapt the existing model to the new domain. So, there is definitely some hope for that. Theoretically this could work on a per sentence level. That you have an existing system, new sentence come in, you adapt it, and your system knows about the sentence. It is however, still in the research phase.
Is self-learning from post-editing more easy with NMT than with SMT?
Theoretically, it can be done for both systems. There is proven technology for SMT, but the idea has not been fully explored for NMT.
How is the known terminology issues with SMT being addressed with NMT? Is deep learning supposed to solve it or make it more complicated?
This is currently not fully explored and is still in the research stage.
Is predictability supposed to work even with source text? Would it be possible, using the same approach, reckoning the suitability of a source for MT?
A neural language model (trained on the source language) may be used as indicator how similar it is to the training data, and hence its likelihood to be translated correctly. Such confidence measures are typically trained with this feature, among others.
What do the black spots represent in the Neural Language Model Slide?
In the webinar on “Introduction to Neural Machine Translation” the circles stand for individual nodes in the layer. The color indicates their value – the darker the higher. In the mapping from the 1-hot-vector to the embedding you can see how white/black nodes get mapped to continuous values.
Are there any experiments taking into account the training data (e.g. simple sentences / few words first etc.)?
There is a method called “curriculum learning”, where training starts with simpler, shorter sentences. This has not yet been extensively studied in the context of neural machine translation.
Do words have fixed embeddings? How is the information in the embedding decided on?
These embedding is learned with everything else, here just the big picture was drawn of how everything is connected, but once you draw this picture (how everything is connected), there is a machine learning method called back-propagation where you take this model as basically a lay-out, so data added and then run a few weeks and it learns all the parameters of the models including the embedding. So, the embedding is not provided separately. There is no fixed embedding for languages that you build once and use forever.
Could you please explain with more details the concept of ¨Embedding¨?
“Embedding” stands to the representation of a word in a high-dimensional continuous vector space. The hope is that words that are similar get similar representations (i.e., the difference between their vectors is small). There are various ways such embedding can be learned. In the talk “Introduction to Neural Machine Translation”, we discussed both word2vec and within the context of a neural machine translation system. See here for recordings of these webinars.
Do NMT systems have problem with long sentences as SMT systems?
Yes, that still appears to be the case. The attention mechanism in neural models is currently not reliable enough.
How much effort is being paid to translations into highly inflected languages?
There have been little efforts to tailor neural machine translation specifically for morphologically rich languages but the general model has been successfully applied to languages such as Russian and Czech.
Could you recommend a brief bibliography on Neural Machine Translation (NMT)?
A fairly comprehensive bibliography for neural machine translation can be found here.
In December ’16 Google GNMT also proposed multilingual NMT, by training several language pairs at the same time and use the same encoder-decoder, how do you think about the issues in Multilingual NMT?
From our understanding, this may be successful for low resource languages, but building direct translation models for each language pair generally works better.
Can NMT be applied when the source and target languages are the same? For example, English to English. And will it perform better than translating between two different languages?
Depends on what the task is. Models similar to the neural machine translation model have been applied to chat bot translation, where “translation” is between question and answer. For tasks such as text simplification, grammar checking, or automatic post-editing, neural machine translation systems have been successfully applied.
Are there specific languages that NMT is better suited for than others?
This is too early to tell. We have seen relatively better results with morphologically rich output languages and distant language pairs.
Are there cases where Rule-Based Machine Translation RBMT/SMT is better than NMT?
While the NMT technology is evolving quickly, at this moment SMT still offers more control, both in terms of writing style for specific domains using the Language Model (LM) but also using capabilities such as the ability to override and control terminology which is key in many professional translations – both at run-time as well as by training the engine accordingly. Finally, SMT systems still offer more control with their rules for example for the markup and handling of complex content such as patents or eCommerce product catalogues but also conversions such as “on-the-fly” conversion of measurements etc.
Do you think linguists can be involved and work together with the developers in training the machines in order to achieve better results? Is this something that’s happening already perhaps?
There is benefit to linguistic knowledge in preprocessing and preparing language data. We have seen benefits to linguistically motivated preprocessing rules for morphology or reordering in SMT systems. It is not clear yet, if these or similar inputs would also benefit NMT systems. Since NMT are data-driven and typically use unannotated text, there is no obvious place where linguistic insight would be helpful. However, as with SMT, post-edited feedback can be used to improve translation quality in NMT.
What will be required to integrate NMT with translation software, CAT and TMS systems such as SDL Studio and Kilgray memoQ?
While we cannot speak for other vendors, integrating without our Language Studio platform will be no different than it is today. We have a well-established API that is fully integrated with a wide range of CAT an TMS systems and also in use by many enterprises directly.
An engine in Language Studio will be configured for a NMT specific or SMT specific workflow and all the appropriate pre and post processing then takes place automatically. From an external integration perspective, the technology used to perform the translation is transparent.
Do you see room for curated language/knowledge bases (such as FrameNet, for example) in the future of Neural MT? If so, which path do you think is a promising one?
This is a bit difficult. There has been some success in enriching the input to NMT systems with added linguistic annotation, so FrameNet annotations may be useful as well. But this is very superficial integration of these knowledge sources. Deeper integration has not been studied yet.
What are some of the common search (decode) algorithms used to find the translation in neural machine translation? (i.e. Does it still employ Beam Search like Phrase-based SMT or is this substituted by the output word prediction model)
Search is greedy and predicts one word at a time. There are beam search algorithms that give some improvements with small beam sizes. These consider the second-best word translation, third best translation, etc. and proceed from there.
I am wondering who will be absorbing the cost of implementing NMT? Will it be the end client, the LSPs or translators?
Putting this question into perspective, an SMT engine that takes about 1 day to train on a multiple CPU system with about 256GB RAM. Training a NMT system such as this with CPU technology is not viable as it would take many months. Training using GPU technology still takes about 1 month. As such, the costs of training are notably higher.
Similarly, at translation runtime (decoding), GPU technology is faster than CPU, but costlier. Using a CPU to decode NMT delivers about 600 words per minute, while the same CPU for SMT would deliver 3,000 words per minute, 5 times faster. Using GPU requires new hardware investments, which at this time are quite costly, and depending on the GPU technology use, can be several times faster than SMT.
One the one hand the cost will likely come down over the coming years as GPU capacity increases and the systems become more efficient, on the other hand cost/benefit will depend largely on the uses cases and the industry. In some cases, the LSP might absorb some of the cost in return for improved efficiency, in other uses cases the customer might need to absorb the added cost for the improved quality.
Is human/BLEU scoring the preferred/only way of performing LQA for NMT output? Are models like DQF and MQM used as part of the research or is the NMT output quality above what these systems can show?
There are many different frameworks for measurement, but BLEU is still the dominant one. Often this is cross correlated by human. DQF and MQF require a lot more time and investment. One of the preferred metrics in the LSP space has moved to productivity measurement as a core metric as this goes straight to the businesses bottom line. In this model, it is not so much the number of errors made, but how quickly the erros can be fixed and the sentence edited to production quality.
In the German-English context, how would an NMT system handle a sentence such as “Ich stehe um 8 Uhr auf.”, where the two parts of a “trennbares Verb” should be identified as belonging together, although distantly separated within the sentence?
The attention mechanism in NMT systems may align an output words to several input words, and they do not have to be next to each other. Given what we have been seen so far, NMT systems handle verbs with separable prefixes comparably well, in the input and the output.
What is the added value of Neural MT, compared to SMT and what are their drawbacks?
The main benefits are generalization (the ability to learn about the translation of a word from occurrences of similar words) and use of larger context (using the entire input sentence to determine a word meaning and hence correct translation) which results is better readability. The main drawbacks are higher computational cost, some occasional unpredictable translations, less transparency and no control beyond adding bilingual training data. For the moment, NMT systems are also less customizable, but this will be remedied in the near future as many researchers, including Omniscien researchers, are working to address such issues.
Have the word alignment algorithms been successfully transported to the publicly available cloud engines? What problems exist in creating a proper many-to-many word alignment? (i.e. 0, 1, 2 or more words in the source language can be mapped to 0, 1, 2 or more words in the target language).
NMT does not use word alignment in the same manner that SMT does. NMT translates as entire sentences, while SMT translates as phrases. As such, the public NMT systems that we have seen not have the word alignment available and only show the sentence alignment.
In Language Studio, our upcoming NMT offerings will include word alignment information in the same manner as our SMT offerings that are available as part of the programmatically accessible workflow and log files. As with our current core SMT platform which is a hybrid rules, syntax and SMT engine, our NMT offering will also be hybrid to ensure that features such as word alignment is backwards compatible.
Is the ideal training data the same as for SMT?
The training data is basically the same, in that you use the bilingual data for both SMT and NMT. However, you don’t use the language model at all as NMT only learns from bilingual data. There is some research into using monolingual data, but there are not substantial results that show improvement yet. As a result, if you only have a small amount of data then you may get better results from SMT using a combination of bi-lingual data and target language data for writing style in conjunction with data manufacturing such as that provided by Language Studio Professional custom engines. In addition, unlike SMT you do not yet have the ability to override terminology either during engine training or at runtime which in the case of SMT is often another key factor that assists with improved quality of the translation. All these areas are still areas of research for NMT.
How do you think the reliability/accuracy of NMT will evolve? Like high with frequency trading, might we get to a situation in which humans cannot check the reliability/accuracy of NMT output?
This is a multi-dimensional question – firstly, as discussed on the webinar, the core NMT systems, while generally producing very good results in terms of understandability and fluency, but still lack the level of control available in Statistical Machine Translation (SMT) systems and errors are difficult to trace and more unpredictable. At present SMT is easier to manage and provides support for features such as glossaries, non-translatable terms, rules and format handling that are needed for higher levels of customization and suitable to more complex applications.
In the near term, NMT does indeed deliver higher quality general fluency, readability and accuracy – this is important. However, a major feature that is available in Language Studio Professional SMT custom engines is the ability to manage and control terminology and writing style. This means NMT output may be more understandable, but may not match your style, resulting in greater levels of editing prior to publishing. To put this in context, if you compared the writing style of an automotive engineering manual to marketing for an automotive company they are very different. If your marketing content reads like an engineering manual, you would have to rewrite large portions of the sentence. At this time, both approaches to MT have benefits and disadvantages. We are working on resolve this and the other outstanding limitations of NMT for our own upcoming NMT product offerings.
Hence at this time, depending on the application, the technology choice might vary.
As to applications where “raw” MT is used (as opposed to post-edited MT), again the application requirements will drive the choice of technology. For example, where terminology normalization is required, for example for analytics applications, SMT is far better suited since it allows full control over terminology, even at runtime. If fluency is required on the output some NMT engines will do better than SMT, so again detailed understanding of the requirements is needed to assist with the choice of technology.