Primer: Different Types Of Machine Translation
If you are involved with translation or localization in any form, you would have had to have been hiding under a rock to not have heard of, and perhaps had to work with, machine translation. While challenging to adopt for some, machine translation offers a tremendous set of benefits to both human resources and the business. The most obvious is time and cost savings. For human resources it can remove much of the more boring work, leaving more complex tasks to be addressed by skills, knowledge, and expertise that are not yet possible with a machine alone.
While many have heard of, and perhaps used machine translation, few are likely to understand the choices of type of machine translation available and which are best suited for given tasks. When evaluating different types of machine translation and choosing one that suits their needs the best many translators, project managers, product managers, and marketers are challenged to understand the choices available and the benefits.
Before we explore these benefits, it is important to point out that machine translation should never be considered a replacement for human translators. Machine translation serves two primary purposes:
- A productivity-enhancing technology that augments human intelligence. For example, providing the first draft in the traditional Translation + Edit + Proof (TEP) process where a professional editor can review and make corrections rather than translate everything from scratch.
- A content creation technology that can produce large volumes of content that would not be practical from a time and cost perspective with a human translator. For example, LexisNexis is using Language Studio to translate every English language patent produced by the European Union and the United States Patent Offices into multiple other languages. Each language pair contains more than 500 billion words.
Broadly speaking, machine translation fits into two categories:
- Generic Machine Translation – machine translation that is designed to translate anything, anywhere, anytime. Google is one example of this. Translations are not domain-specific and provide no control over vocabulary, writing style, or domain context. Generic machine translation is useful for getting the gist or rough meaning of text in another language, but less useful for publishing content. Generic machine translation also has a wide range of privacy issues to consider, including giving up rights to your intellectual property to the translation provider such as Google.
- Custom Machine Translation – machine translation that has been trained to more closely meet your needs. Some machine translation providers will provide you with domain-specific data to help customize your own machine translation engine. Omniscien goes one step further with its advanced data gathering and processing tools. Tools such as Data Synthesis and Bilingual Data Mining provide unique advantages by providing millions of additional high-quality bilingual sentences. Irrespective of translation technology, the highest quality translations are always produced by customizing machine translation for a purpose and domain.
The history of machine translation goes back to the 1950s. However, machine translation only really started to be practical about 15 years ago. There were several different approaches explored. Today there are really only 3 approaches that are still in mainstream use, with some examples of older technologies still around but being superseded by more modern approaches.
The five common most types of machine translation
Rule-Based Machine Translation (RBMT)
Rules-Based Machine Translation (RBMT) systems were the first commercial machine translation systems and are based on linguistic and grammatical rules that allow the words to be put in different places and to have different meanings depending on the context.
Statistical Machine Translation (SMT)
Statistical Machine Translation (SMT) learns how to translate by analyzing existing human translations (known as bilingual text corpora). Most modern SMT systems are phrase-based and assemble translations using overlap phrases. In phrase-based translation, the aim is to reduce the restrictions of word-based translation by translating whole sequences of words, where the lengths may differ. Matching bilingual phrase patterns and their optimal sequence are selected by comparing them to patterns of monolingual data in the target language.
Omniscien’s Professor Philipp Koehn wrote the leading academic textbook on SMT. It can be purchased on Amazon.com
Syntax Based Machine Translation (SBMT)
The idea of Syntax-Based Machine Translation (SBMT) is quite old and is based on the idea of translating syntactic units, rather than single words or strings of words (as in phrase-based/SMT), i.e. (partial) parse trees of sentences/utterances. The goal of Syntax-Based Machine Translation techniques is to incorporate an explicit representation of syntax into the statistical machine translation systems.
Neural Machine Translation (NMT)
Neural Machine Translation (also known as Neural MT, NMT, Deep Neural Machine Translation, Deep NMT, or DNMT) is a state-of-the-art machine translation approach that utilizes neural network techniques to predict the likelihood of a set of words in sequence. This can be a text fragment, complete sentence, or with the latest advances an entire document.
Omniscien’s Professor Philipp Koehn wrote the leading academic textbook on NMT. It can be purchased on Amazon.com
Hybrid Machine Translation (HMT)
Hybrid machine translation is a method of machine translation that is characterized by the use of multiple machine translation approaches within a single machine translation system to deliver a higher quality translation. There are several approaches to Hybrid MT such as multi-engine, statistical rule generation, multi-pass, and confidence-based.
Omniscien uses confidence-based hybrid due to the notably higher quality translations that it produces than any single machine translation technology in isolation. Confidence-based machine translation uses and algorithm to determine the quality of a candidate translation and then switch between different technologies base on user configured criteria. The latest iteration of confidence-based Hybrid Machine translation that are incorporated into Language Studio 6.0 onwards allow for a large number of translation candidate sources in layers that waterfall down.
What About Data?
All modern machine translation solutions require large volumes of bilingual data (and in some cases monolingual data) in order to train a machine translation engine. Academics, Google and many others originally thought, that as long as you had large volumes of data, that the good data would rise to the top and the bad data would be statistically irrelevant. To the Omniscien team, right from the outset, this just did not make sense. There is an old adage in the computer industry – “Garbage in, garbage out”. Omniscien pioneered the Clean Data Machine Translation Model in 2006. In recent years, with the advent of Neural Machine Translation (NMT) it has become clear to all that clean data is essential.
Which Should You Choose?
While NMT certainly is better overall (in most cases) than all the previous technologies, it has a number of weaknesses. There is no one-technology that will always perform well under any circumstance thrown at it. For this reason a hybrid confidence-based approach makes a lot more sense. This approach is user configurable and enables multiple technologies to “compete” with each other to produce the highest-possible quality machine translation output. SMT will typically produce a higher quality translation than NMT in 8-15 percent of the cases.
As with all machine learning technologies, the right data will deliver better translation quality results. Language Studio includes the tools and technologies to process, gather, and synthesize the data needed for training. With sufficient in-domain data Neural MT is able to “think” more like the human brain when producing translations. Unlike SMT, this can be in excess of 50 million bilingual sentences as just the starting point. Omniscien provides much of this data, and mines and synthesizes millions of new bilingual sentences for each engine.
Our recommendation for the highest-quality machine translation is based on these four principles:
- Customized machine translation will produce an output quality that requires the least amount of human effort in order to publish.
- Clean, high-quality, in-domain data is essential to delivering high-quality machine translation.
- Hybrid confidence-based machine translation will delivery the best translation quality across a variety of business scenarios by leveraging the strengths of all the technologies combined into one final output.
- Data volume is critical but you don’t have enough and will never have enough. Omniscien’s advanced data creation, bilingual data mining, and data synthesis technologies can produce high-quality bilingual training data to bridge the gap.
For more details and clarity, please contact us.
- Introduction to Machine Translation at Omniscien
- Hybrid Neural and Statistical Machine Translation
- Custom Machine Translation Engines
- Powerful Tools for Data Creation, Preparation, and Analysis
- Clean Data Machine Translation
- Industry Domains
- Ways to Translate
- Supported Languages
- Supported Document Formats
- Deployment Models
- Data Security & Privacy
- Secure by Design
- Localization Glossary - Terms that you should know
- Language Studio
Enterprise-class private and secure machine translation.
- Media Studio
Media and Subtitle Translation, Project Management and Data Processing
- Workflow Studio
Enterprise-class private and secure machine translation.
- FAQ Home Page
- Primer; Different Types of Machine Translation
- What is Custom Machine Translation?
- What is Generic Machine Translation?
- What is Do-It-Yourself (DIY) Machine Translation?
- What is Hybrid Machine Translation?
- What is Neural Machine Translation?
- What is Statistical Machine Translation?
- What is Rules-Based Machine Translation?
- What is Syntax-Based Machine Translation?
- What is the difference between “Clean Data MT” and “Dirty Data MT”?