If you are interested in attending our upcoming webinars, please click here to sign up.

What is the difference between “Clean Data” and “Dirty Data” MT?

What is the difference between “Clean Data” and “Dirty Data” MT?

What is the difference between “Clean Data” and “Dirty Data” MT?

There are two common approaches to MT corpora gathering:

Dirty Data MT Model

Cleaned Data MT Model


Good data will be more statistically relevant and bad data will be statistically irrelevant.


Bad or undesirable patterns cannot be learned if they don’t exist in the data


  • Gathered from as many sources as possible.

  • Domain of knowledge does not matter.

  • Data quality is not important.

  • Data quantity is important.


  • Gathered from a small number of trusted quality sources.

  • Domain of knowledge must match target.

  • Data quality is VERY important.

  • Data quantity is less important.


  • Broad general coverage of vocabulary in many domains

  • Easy to find corpora

  • Useful for gist level translations

  • No real skill required, simply upload data.


  • Higher quality translation

  • Control over translations, vocabulary and writing style.

  • Deep coverage in a specific domain.

  • More suited to professional translation community

  • Issues can be understood and fixed with small amounts of focused corrective data.

  • Consistent terminology.


  • Little or not control over translations, vocabulary and writing style.

  • Inconsistent terminology.

  • Shallow depth in any one domain frequently resulting in mixed context for ambigious terms.

  • Domain bias towards where most data was available rather than the desired domain.

  • Less useful for professional translation due to editing effort required

  • Very hard to fix any issues. Primary resolution path is to add more dirty data and hope that it fixes the issue.


  • More unknown words in early versions of the engines (but these can be quickly fixed).

  • More difficult to find corpora.

  • Low quality translations on out of domain content.

  • High level of skill and effort required in understanding and preparing data.

Omniscien Technologies (formerly Asia Online) pioneered the Clean Data MT Model and has built thousands of custom translation engines with this approach. Even when dealing with some of the largest companies in the world that have large volumes of corpora that have been collected over many years, there is still not enough corpora to deliver high quality translations, especially when a variety of writing styles are required (i.e. marketing and technical manuals). Language Studio Advanced Data Generation and Manufacturing Technologies (ADM) were developed to address this data shortfall. A customized quality development plan is executed for every custom translation engine. While a custom translation engine will always be higher quality when your data is provided, ADM has been proven to work well even when little or no corpora is available, an engine can be customized.

Clarifying the Clean Data MT Model

To better understand the Clean Data MT Model, it is helpful to explain what Clean Data MT is not. Some MT technology vendors will offer rapid customization – “simply upload your corpora and that is it!!” – We refer to this model as “upload and pray”. Any customization process that does not have deep hands on linguistic knowledge and a deep understanding on the data, target audience, preferred writing style cannot be considered a Clean Data MT Model. High quality translations can never be achieved by simply uploading your existing data. The Clean Data MT Model cannot be achieved without a deep understanding of the data and the goals of the translation engine that only a human specialist can provide.

Many have experimented with these vendors or open source tools such as Moses and found that frequently the quality their translations are lower than Google. Google follows a Dirty Data MT Model, but has vast volumes of corpora that are probably unmatched by any other organization. A Dirty Data MT Model is the correct model for Google as Google is goal is to translate anything, for anyone, anytime. But the significant limitations of the Dirty Data MT Model mean that the output translation quality is not generally suitable for professional translation due to lack of control and the inability to address issues.

Other MT vendors frequently claim that they offer Clean Data MT, but in reality, they are only removing formatting tags and bad data (“cleaning”) Clean Data MT requires linguistic skills that only human specialists possess to develop and execute a data strategy that is specific to your custom engine, target audience, preferred writing style and preferred terminology. This is the approach taken with every full customization when creating your own custom engine with Language Studio.

Clean Data MT and MT in general require very large bilingual corpora. One of the major advantages of the Language Studio platform is the extensive collections of bilingual corpora, known as Foundation Data, in more than 550 language pairs and is progressively adding a set of 12 Foundation Domains to each language pair. When combined with ADM, the data requirements are greatly reduced and it is even possible to create custom engines when no bilingual corpora are available.

Language Studio has the Clean Data MT Model as its data approach and MT technologies at its core, but is built around a Hybrid Machine Translation Model to take advantage of all the best features in DNMT, NMT, SMT, RBMT and Syntax approaches.