Clean Data Machine Translation

Omniscien Technologies pioneered the Clean Data Machine Translation Model in 2006. At the time the industry and academia were claiming that more data was better than lower volumes of cleaner data. Omniscien has built thousands of custom translation engines with this approach. In recent years, with the advent of Neural Machine Translation (NMT) it has become clear to all that clean data is essential.

Even when dealing with some of the largest companies in the world that have large volumes of corpora that have been collected over many years, there is still not enough corpora to deliver high-quality translations, especially when a variety of writing styles are required (i.e. marketing and technical manuals). Omniscien NMT and SMT engines are trained with tens or even hundreds of millions of sentences of Industry Domain and Language Foundation data. To have a real impact on translation quality and style that is matched to an organizations needs, additional data is needed. 

Language Studio provides powerful data creation, data synthesis, and data cleaning technologies that were developed to address this data shortfall and ensure that only clean data is used to create custom machine translation engines. A customized quality development plan is executed for every professional custom translation engine. While a custom translation engine will always be higher quality when your data is provided, Language Studio data creation tools have been proven to work well even when little or no corpora is available. We have specific template plans to match all use cases. Even when you have no data at all, we can synthesize millions of high-quality, clean bilingual data that is in-domain, in-context, and matches your writing style.

Understanding the Clean Data MT Model

To better understand the Clean Data MT Model, it is helpful to explain what Clean Data MT is not. Some MT technology vendors will offer rapid customization – “simply upload your corpora and that is it!!” – We refer to this model as “upload and pray”. Any customization process that does not have deep hands on linguistic knowledge and a deep understanding on the data, target audience, preferred writing style cannot be considered a Clean Data MT Model. High quality translations can never be achieved by simply uploading your existing data. The Clean Data MT Model cannot be achieved without a deep understanding of the data and the goals of the translation engine that only a human specialist can provide.

Many have experimented with these vendors or open source tools such as Moses and found that frequently the quality of their translations are lower than Google. Google follows a Dirty Data MT Model, but has vast volumes of corpora that are probably unmatched by any other organization. A Dirty Data MT Model is the correct model for Google as Google’s goal is to translate anything, for anyone, anytime. But the significant limitations of the Dirty Data MT Model mean that the output translation quality is not generally suitable for professional translation due to lack of control and the inability to address issues.

Other machine translation vendors frequently claim that they offer Clean Data MT, but in reality, they are only removing formatting tags and bad data (“cleaning”). Clean Data MT requires linguistic skills that only human specialists possess to develop and execute a data strategy that is specific to your custom engine, target audience, preferred writing style and preferred terminology. This is the approach taken with every full customization when creating your own custom engine with Language Studio.

Clean Data MT and MT in general require very large bilingual corpora. One of the major advantages of the Language Studio platform is the extensive collections of bilingual corpora, known as Foundation Data, in more than 600 language pairs and is progressively adding a set of 14 Industry Domains to each language pair. When combined with Data Synthesis and other data creation techniques, the data requirements are greatly reduced and it is even possible to create custom engines when no bilingual corpora are available.

Language Studio has the Clean Data MT Model as its data approach and MT technologies at its core, but is built around a Hybrid Machine Translation Model to take advantage of all the best features in DNMT, NMT, SMT, RBMT and Syntax approaches.

Common Approaches to Machine Translation Corpora Gathering

There are two common approaches to machine translation corpora gathering:

Dirty Data Machine Translation Model

Clean Data Machine Translation Model

Theory:

  • Good data will be more statistically relevant and bad data will be statistically irrelevant.

Theory:

  • Bad or undesirable patterns cannot be learned if they don’t exist in the data.

Data:

  • Gathered from as many sources as possible.
  • Domain of knowledge does not matter.
  • Data quality is not important.
  • Data quantity is important.

Data:

  • Gathered from a small number of trusted quality sources.
  • Domain of knowledge must match target.
  • Data quality is VERY important.
  • Data quantity is less important.

Benefits:

  • Broad general coverage of vocabulary in many domains
  • Easy to find corpora
  • Useful for gist level translations
  • No real skill required, simply upload data.

Benefits:

  • Higher quality translation
  • Control over translations, vocabulary and writing style.
  • Deep coverage in a specific domain.
  • More suited to professional translation community
  • Issues can be understood and fixed with small amounts of focused corrective data.
  • Consistent terminology.

Disadvantages:

  • Little or no control over translations, vocabulary and writing style.
  • Inconsistent terminology.
  • Shallow depth in any one domain frequently resulting in mixed context for ambiguous terms.
  • Domain bias towards where most data was available rather than the desired domain.
  • Less useful for professional translation due to editing effort required
  • Very hard to fix any issues. Primary resolution path is to add more dirty data and hope that it fixes the issue.

Disadvantages:

  • More unknown words in early versions of the engines (but these can be quickly fixed).
  • More difficult to find corpora.
  • Low quality translations on out of domain content.
  • High level of skill and effort required in understanding and preparing data.