Omniscien » FAQ » Neural Machine Translation Questions and Answers

NMT Conference and Webinar Questions and Answers

During conferences and webinar, the Omniscien team are often asked many questions about Neural Machine Translation. We have collected many of the common questions and provide the answers here in an easy reference list. Please keep in context that technology evolves quickly. At the time of being asked these questions, the response was accurate but there may be advances that would outdate the response. If you have more questions, please feel free to Contact Us:

The Questions and Answers List

Output words are predicted from the encoding of the full input sentence and all previously produced output words. Do previously translated phrases/segments in the same project predict future segments, or is the encoding/context confined to the segment?

Output words are predicted from the encoding of the full input sentence and all previously produced output words. Do previously translated phrases/segments in the same project predict future segments or is the encoding/context confined to the segment?

For decoding to occur coding is necessary, but what is the conceptual relationship between coding, decoding and translation?

The goal of encoding is taking the input sentence and get a representation of the sentence that is more – some would argue – semantics, which captures the meaning of words in context. Taking these encoding of the input sentence the goal now is to decode that into an output sentence. The encoding step first and the decoding step afterwards.

Typically, the input and output of SMT systems are sentences. This limits SMT systems in modeling discourse phenomena. Could we go beyond sentences with NMT systems?<br />

It seems definitely straightforward to do it, because we already have kind of embedding that are conditioned on anything previous. So, in general the good news is yes, it is relatively straightforward to add conditioning context when you make predictions.

Two objectives were mentioned: Adequacy and Fluency. Where does grammar fit into this and does NMT check grammar rules or only sees sequence of words?

Grammar is part of fluency. Readers will not think a sentence is fluent when it has grammatical errors.

Some researchers proposed full character level NMT to solve the rare word problem in NMT, especially last two months. What is the performance of when using character level NMT?

They seem to be a decent approach to unknown word translation. In practice, there may be better solutions, e.g., the use of additional terminology dictionaries or special handling of names.

Do you think that sequence to sequence gives better results than character based model?

Currently, character-based models have not been proven to be superior.

You mentioned that the machine learns “It’s raining cats and dogs” as a whole phrase. How does it learn something more complex like “Sarah looked up, the sky was bleak but there shouldn’t be any cats and dogs coming out of it.”

Well, that is a tough one. Theoretically it is possible that the concept “cats and dogs” will be learned to be similar to “rain”, but that is a lot to ask for.

When comparing SMT to NMT, what is the difference in the time required to train an engine?

SMT systems can be typically trained in 1-2 days on regular CPU servers. NMT seems to require about 1 week on a single GPU. We use multiple GPUs working together to speed this up. In the past training was much faster as there was less data involved. Now we train with hundreds of millions or in some cases even more than 1 billion bilingual sentences, so it takes more processing power.

Is monolingual data still useful in NMT as it is for an SMT language model or is parallel data the only data useful at training time?

The current basic model only uses parallel data, there are several methods and extensions to the model to accommodate monolingual data, but consensus best practices have not yet been established. This is an area of active research.

At Omniscien we have developed techniques and technologies to leverage monolignol data for data synthesis and data manufacturing that offer significant benefits to translation quality. This is available as part of our Professional Guided Custom MT engines.

Can you do the “incremental” training of NMT?

Incremental training improves the model continuously when new training data arrives. This fits closely with the “online training” method of neural network models. At Omniscien we frequently use incremental training to improve an existing engine.

How much parallel texts do we need when we use NMT systems compared to traditional SMT systems? Is it true that NMT systems needs more data than SMT systems?

That seems to be the consensus in the field that at least to get going you need more data so you can build a phrase-based (SMT) machine translation system from a million words easily, but this is not the case for Neural Machine Translation where you need more data, especially if your data is slightly off domain, having little data then is really bad.

The differences of type of data that goes into NMT system is that NMT currently is trained only on translated text for instance they are not even trained on any additional target language text on half which is a huge part of traditional SMT engines. There is no separate language model.

Do you think that we need more accurate (higher quality) data than is used to train SMT models?

Research has shown that NMT does indeed require higher quality translations as training data when compared to SMT. SMT is helped in part by having a language model in the target language that works as a kind of filter to some of the lower probability choices.

One of the things that’s made SMT commercially feasible is adaptivity: to adjust a system’s behaviour based on small bits of human input, without retraining on the entire data set (too costly). As I understand that also needed technology innovation. Do current NMT models have this capability, or is adaptive NMT still some ways off / or not feasible with current technology at all?

The learning algorithm that is used to train all these parameters is called the online learning algorithm meaning that it learns by continuously processing new training examples. That means in practice you give it new sentence pairs and it learns from them.

In theory, it should be pretty straight-forward to take a model that exists, have some additional training, might be different in nature, different domains, different styles, and then adapt the existing model to the new domain. So, there is definitely some hope for that. Theoretically, this could work on a per sentence level. That you have an existing system, new sentences come in, you adapt it, and your system knows about the sentence. . While this works in the context of incremental training, it is not yet available as real-time learning. It is still in the research phase in the real-time context.

There are some other important things to consider about real-time learning. NMT engines cannot “unlearn”. So if the input sentence is bad, it is going to learn how to do things badly. If you are learning based on a sentence just entered in an editor previously and then the later translations may be produced incorrectly. In general, we would not recommend this approach as the data it is learning from has not yet been through quality assurance and been signed off.

Is self-learning from post-editing more easy with NMT than with SMT?

Both technologies can be used to progressively learn how to translate better. However, this should generally be done in stages after quality assurance rather than from post edits as the pre-quality check edits may not be correct and the systems will learn from the incorrect data.

Is predictability supposed to work even with source text? Would it be possible, using the same approach, reckoning the suitability of a source for MT?

A neural language model (trained on the source language) can be used as an indicator of how similar it is to the training data, and hence its likelihood to be translated correctly. Such confidence measures are typically trained with this feature, among others.

At Omniscien we use this approach as part of our data synthesis and data creation processes when customizing an engine with a Professional Guided Custom MT Engine.

Do words have fixed embeddings? How is the information in the embedding decided on?

These embeddings are learned with everything else, here just the big picture was drawn of how everything is connected, but once you draw this picture (how everything is connected), there is a machine learning method called back-propagation where you take this model as basically a lay-out, so data added and then run a few weeks and it learns all the parameters of the models including the embedding. So, the embedding is not provided separately. There is no fixed embedding for languages that you build once and use forever.

Could you please explain with more details the concept of ¨Embedding¨?

“Embedding” stands to the representation of a word in a high-dimensional continuous vector space. The hope is that words that are similar get similar representations (i.e., the difference between their vectors is small). There are various ways such embedding can be learned. In the talk “Introduction to Neural Machine Translation”, we discussed both word2vec and within the context of a neural machine translation system. See here for recordings of these webinars.

Do NMT systems have problem with long sentences as SMT systems?

Yes, that still appears to be the case. The attention mechanism in neural models is currently not reliable enough. We have found that when sentences exceed 50 words the reliability of the output is decreased. We have developed ways to work around this within the Omniscien workflows.

How much effort is being paid to translations into highly inflected languages?

There have been little efforts to tailor neural machine translation specifically for morphologically rich languages but the general model has been successfully applied to languages such as Russian and Czech. However, it must be said that NMT overall performs notably better than SMT on heavily inflected languages without any special tailoring.

Could you recommend a brief bibliography on Neural Machine Translation (NMT)?

A fairly comprehensive bibliography for neural machine translation can be found here.

In December ’16 Google GNMT also proposed multilingual NMT, by training several language pairs at the same time and use the same encoder-decoder, how do you think about the issues in Multilingual NMT?

We have already used this technique for a number of customers with very good results. We also use a related technique to support multiple writing styles, domains, and genres within a single language pair.

Can NMT be applied when the source and target languages are the same? For example, English to English. And will it perform better than translating between two different languages?

This depends on what the task is. Models similar to the neural machine translation model have been applied to chat bot translation, where “translation” is between question and answer. For tasks such as text simplification, grammar checking, or automatic post-editing, neural machine translation systems have been successfully applied.

Are there specific languages that NMT is better suited for than others?

NMT is performing better than SMT alone in all language pairs. However, we have seen relatively better results with morphologically rich output languages and distant language pairs.

Are there cases where Rule-Based Machine Translation RBMT/SMT is better than NMT?

While the NMT technology is evolving quickly, at this moment SMT still offers more control. NMT uses a very different technique to translate and is more difficult to guide for structure, writing style, and glossary vocabulary.

The Omniscien team has worked around this in part with the advanced customization techniques used to synthesize and create data for training. SMT has a language model to guide many features that is currently not part of the NMT approach. SMT systems also tend to translate better on very short (1-2 words) and very long (50+ words) text).

Finally, SMT systems still offer more control with their rules for example for the markup and handling of complex content such as patents or eCommerce product catalogs but also conversions such as “on-the-fly” conversion of measurements, etc.

Based on all the above, Omniscien has developed a Hybrid MT model that combines the strengths of both technologies seamlessly to deliver a higher translation quality.

Do you think linguists can be involved and work together with the developers in training the machines in order to achieve better results? Is this something that’s happening already perhaps?

There is a benefit to linguistic knowledge in preprocessing and preparing language data. We have seen benefits to linguistically motivated preprocessing rules for morphology or reordering in SMT systems. It is not clear yet if these or similar inputs would also benefit NMT systems. Since NMT are data-driven and typically use unannotated text, there is no obvious place where linguistic insight would be helpful. However, as with SMT, post-edited feedback can be used to improve translation quality in NMT.

What will be required to integrate NMT with translation software, CAT and TMS systems such as SDL Studio, XTM, MemSource and memoQ?

Omniscien has a well-established API that is fully integrated with a wide range of CAT an TMS systems and also in use by many enterprises directly.

An engine in Language Studio can be configured using a unique Domain Code (ID number) for a NMT specific, MT specific, or hybrid NMT/SMT workflow and all the appropriate pre and post processing then takes place automatically. From an external integration perspective, the technology used to perform the translation is transparent.

Do you see room for curated language/knowledge bases (such as FrameNet, for example) in the future of Neural MT? If so, which path do you think is a promising one?

This is a bit difficult. There has been some success in enriching the input to NMT systems with added linguistic annotation, so FrameNet annotations may be useful as well. But this is a very superficial integration of these knowledge sources. Deeper integration has not been studied yet.

What are some of the common search (decode) algorithms used to find the translation in neural machine translation? (i.e. Does it still employ Beam Search like Phrase-based SMT or is this substituted by the output word prediction model)

Search is greedy and predicts one word at a time. There are beam search algorithms that give some improvements with small beam sizes. These consider the second-best word translation, third-best translation, etc. and proceed from there.

I am wondering who will be absorbing the cost of implementing NMT? Will it be the end client, the LSPs or translators?

Putting this question into perspective, an SMT engine that takes about 1 day to train on a multiple CPU system with about 256GB RAM. Training a NMT system such as this with CPU technology is not viable as it would take many months. Training using GPU technology still takes about 3 days or more for a large enigne. As such, the costs of training NMT engines are notably higher.

Similarly, at translation runtime (decoding), GPU technology is faster than CPU, but costlier. Using a CPU to decode NMT delivers about 600 words per minute, while the same CPU for SMT would deliver 3,000 words per minute, 5 times faster. Using GPU requires new hardware investments, which at this time are quite costly, and depending on the GPU technology use, can be several times faster than SMT. Our current speeds are 9,000 words per minute with GPU. With the upcoming release of Language Studio 6.0 our speeds will increase to 40,000-45,000 words per minute.

One the one hand the cost will likely come down over the coming years as GPU capacity increases and the systems become more efficient, on the other hand, cost/benefit will depend largely on the uses cases and the industry. For now, we have kept our prices at the same level for several years while passing this benefit on to our customers.

Is human/BLEU scoring the preferred/only way of performing LQA for NMT output? Are models like DQF and MQM used as part of the research or is the NMT output quality above what these systems can show?

There are many different frameworks for measurement, but BLEU is still the dominant one. Often this is cross-correlated by humans. DQF and MQF require a lot more time and investment. One of the preferred metrics in the LSP space has moved to productivity measurement as a core metric as this goes straight to the business’s bottom line. In this model, it is not so much the number of errors made, but how quickly the errors can be fixed and the sentence edited to production quality.

In the German-English context, how would an NMT system handle a sentence such as “Ich stehe um 8 Uhr auf.”, where the two parts of a “trennbares Verb” should be identified as belonging together, although distantly separated within the sentence?

The attention mechanism in NMT systems may align output words to several input words, and they do not have to be next to each other. Given what we have been seen so far, NMT systems handle verbs with separable prefixes comparably well, in the input and the output.

What is the added value of Neural MT, compared to SMT and what are their drawbacks?

The main benefits are generalization (the ability to learn about the translation of a word from occurrences of similar words) and use of larger context (using the entire input sentence to determine a word meaning and hence correct translation) which results is better readability. The main drawbacks are higher computational cost, some occasional unpredictable translations, less transparency, and no control beyond adding bilingual training data. For the moment, NMT systems are also less customizable, but this will be remedied in the near future as many researchers, including Omniscien researchers, are working to address such issues.

Have the word alignment algorithms been successfully transported to the publicly available cloud engines? What problems exist in creating a proper many-to-many word alignment? (i.e. 0, 1, 2 or more words in the source language can be mapped to 0, 1, 2 or more words in the target language).

NMT does not use word alignment in the same manner that SMT does. NMT translates as entire sentences, while SMT translates as phrases. As such, the public NMT systems that we have seen not have the word alignment available and only show the sentence alignment.

In Language Studio, our upcoming NMT offerings will include word alignment information in the same manner as our SMT offerings that are available as part of the programmatically accessible workflow and log files. As with our current core SMT platform which is a hybrid rules, syntax and SMT engine, our NMT offering will also be hybrid to ensure that features such as word alignment is backwards compatible.

How do you think the reliability/accuracy of NMT will evolve? Like high with frequency trading, might we get to a situation in which humans cannot check the reliability/accuracy of NMT output?

This is a multi-dimensional question – firstly, as discussed on the webinar, the core NMT systems, while generally producing very good results in terms of understandability and fluency, but still lack the level of control available in Statistical Machine Translation (SMT) systems and errors are difficult to trace and more unpredictable. At present SMT is easier to manage and provides support for features such as glossaries, non-translatable terms, rules and format handling that are needed for higher levels of customization and suitable to more complex applications.

In the near term, NMT does indeed deliver higher quality general fluency, readability and accuracy – this is important. However, a major feature that is available in Language Studio Professional SMT custom engines is the ability to manage and control terminology and writing style. This means NMT output may be more understandable, but may not match your style, resulting in greater levels of editing prior to publishing. To put this in context, if you compared the writing style of an automotive engineering manual to marketing for an automotive company they are very different. If your marketing content reads like an engineering manual, you would have to rewrite large portions of the sentence. At this time, both approaches to MT have benefits and disadvantages. We are working on resolve this and the other outstanding limitations of NMT for our own upcoming NMT product offerings.
Hence at this time, depending on the application, the technology choice might vary.

As to applications where “raw” MT is used (as opposed to post-edited MT), again the application requirements will drive the choice of technology. For example, where terminology normalization is required, for example for analytics applications, SMT is far better suited since it allows full control over terminology, even at runtime. If fluency is required on the output some NMT engines will do better than SMT, so again a detailed understanding of the requirements is needed to assist with the choice of technology.