Clean Data Machine Translation
Omniscien Technologies pioneered the Clean Data Machine Translation Model in 2006. At the time the industry and academia were claiming that more data was better than lower volumes of cleaner data. Omniscien has built thousands of custom translation engines with this approach. In recent years, with the advent of Neural Machine Translation (NMT) it has become clear to all that clean data is essential.
Even when dealing with some of the largest companies in the world that have large volumes of corpora that have been collected over many years, there is still not enough corpora to deliver high-quality translations, especially when a variety of writing styles are required (i.e. marketing and technical manuals). Omniscien NMT and SMT engines are trained with tens or even hundreds of millions of sentences of Industry Domain and Language Foundation data. To have a real impact on translation quality and style that is matched to an organizations needs, additional data is needed.
Language Studio provides powerful data creation, data synthesis, and data cleaning technologies that were developed to address this data shortfall and ensure that only clean data is used to create custom machine translation engines. A customized quality development plan is executed for every professional custom translation engine. While a custom translation engine will always be higher quality when your data is provided, Language Studio data creation tools have been proven to work well even when little or no corpora is available. We have specific template plans to match all use cases. Even when you have no data at all, we can synthesize millions of high-quality, clean bilingual data that is in-domain, in-context, and matches your writing style.
Available as three Platform Editions specifically designed to match different business needs.
- Product Overview
- Machine Translation
- Custom MT Engines
- Industry Domains
- Data Creation Tools
- Clean Data MT Approach
- Ways to Translate
- Hybrid NMT/SMT Engines
- Detailed Features
- Supported Languages
- Document Formats
- Deployment Models
- Data Security & Privacy
- Secure by Design
- Secure Cloud
Hosted by Omniscien
- Enterprise Translation Server
On-premises or private cloud
- Data Center Platform
Understanding the Clean Data MT Model
To better understand the Clean Data MT Model, it is helpful to explain what Clean Data MT is not. Some MT technology vendors will offer rapid customization – “simply upload your corpora and that is it!!” – We refer to this model as “upload and pray”. Any customization process that does not have deep hands on linguistic knowledge and a deep understanding on the data, target audience, preferred writing style cannot be considered a Clean Data MT Model. High quality translations can never be achieved by simply uploading your existing data. The Clean Data MT Model cannot be achieved without a deep understanding of the data and the goals of the translation engine that only a human specialist can provide.
Many have experimented with these vendors or open source tools such as Moses and found that frequently the quality of their translations are lower than Google. Google follows a Dirty Data MT Model, but has vast volumes of corpora that are probably unmatched by any other organization. A Dirty Data MT Model is the correct model for Google as Google’s goal is to translate anything, for anyone, anytime. But the significant limitations of the Dirty Data MT Model mean that the output translation quality is not generally suitable for professional translation due to lack of control and the inability to address issues.
Other machine translation vendors frequently claim that they offer Clean Data MT, but in reality, they are only removing formatting tags and bad data (“cleaning”). Clean Data MT requires linguistic skills that only human specialists possess to develop and execute a data strategy that is specific to your custom engine, target audience, preferred writing style and preferred terminology. This is the approach taken with every full customization when creating your own custom engine with Language Studio.
Clean Data MT and MT in general require very large bilingual corpora. One of the major advantages of the Language Studio platform is the extensive collections of bilingual corpora, known as Foundation Data, in more than 600 language pairs and is progressively adding a set of 14 Industry Domains to each language pair. When combined with Data Synthesis and other data creation techniques, the data requirements are greatly reduced and it is even possible to create custom engines when no bilingual corpora are available.
Language Studio has the Clean Data MT Model as its data approach and MT technologies at its core, but is built around a Hybrid Machine Translation Model to take advantage of all the best features in DNMT, NMT, SMT, RBMT and Syntax approaches.
Common Approaches to Machine Translation Corpora Gathering
There are two common approaches to machine translation corpora gathering:
Dirty Data Machine Translation Model
Clean Data Machine Translation Model