If you are interested in attending our upcoming webinars, please click here to sign up.

What is Do-It-Yourself (DIY) Machine Translation?

What is Do-It-Yourself (DIY) Machine Translation?
omniscien

What is Do-It-Yourself (DIY) Machine Translation?

Do-It-Yourself (DIY) Machine Translation, also known as “Self Service Machine Translation” has recently emerged as a popular model for customizing a machine translation engine. In the last 2-3 years, the DIY machine translation space has been populated by many “me too” vendors that offer little additional value over what can be downloaded as open source Moses tools and are difficult to differentiate from each other in terms of actual value add.

DIY machine translation provides only a subset of the customization capabilities of a full customization process. For a more detailed comparison see the article “Understanding the Difference between Full Customization and Do It Yourself Customization ”

The appeal of the DIY model is that it is perceived to give the user complete control over the customization process. Many DIY MT vendors have marketed heavily saying that they empower their users as they have the ability to upload their translation memories into a DIY machine translation system. However, in reality users are being stripped of the power to control the engine in these systems. Simplicity often hides reality.

Ironically, the DIY model is frequently referred to as “upload and pray”, where the user has no real control over their engines quality beyond the data they are uploading and in most cases do not even know the contents or quality of the translation memories that they are uploading. Frequently the data comes from many sources, even third parties. Many Language Service Providers (LSPs) upload translation memories that were passed onto them by their client or another LSP, or even data acquired from organizations such as TAUS. The net result is a mixing pot of inconsistent translations memories that are difficult to manage or remain unmanaged that are combined and in turn create inconsistent machine translation output. This issue is similar to what LSPs face when they use a third party translation memory or mix translation memories to pre-translate segments as part of a translation project.

The inherent issue with DIY machine translation is that it implies that the user knows how to do-it-themselves. Case studies like IOLAR’s demonstrate clearly that high quality machine translation requires considerably more effort, knowledge and skill than simply loading data into a system for training. Achieving a quality level that was useable for efficient post editing was clearly not the simple task that TAUS and third-party DIY proponents had conveyed. The article “Understanding the Difference between Full Customization and Do It Yourself Customization” has a section that lists of some skills necessary in order to build a high quality machine translation engine.

Simply being able to upload data is not user empowerment. Such DIY systems provide a “pretty” user interface that makes it very easy and fast to create low quality custom machine translation engines with only minimal levels of control of the data upon which the custom engine is built. User empowerment comes from having total control of the data and being able to manipulate and refine the data to get the optimal result.

There are many Natural Language Programming (NLP) specialists and computational linguists that have worked extensively with the Moses tools and also have an extensive research background. However, for most the focus has been on algorithmic improvements that offer a small overall improvement. While the algorithms are certainly important, the essence of a high quality translation system that requires the least amount of effort to post-edit and delivers the highest Return On Investment (ROI) lies in the data upon which the engine is customized. Data quality and appropriateness are areas that academia has not put a lot of effort into when compared to algorithms. One of the essential factors in delivering near human quality machine translation is the correct and most optimal data, which is the focus of the Clean Data SMT model.

There are three common forms of DIY MT:

  1. Downloading the open source Moses SMT system either directly from the site or with a third party installer.

  2. Using a third party Software-as-a-Service (SaaS) platform that has packaged Moses behind a web portal making it easy to use. Such services typically put a web portal over the top of Moses and allow users to upload data.

  3. Licensing a packaged version of #2 above and installing it on your own computer systems.

All of these solutions are based on the Moses open source machine translation system, a project that was created by and is maintained under the guidance of Professor Philipp Koehn. Professor Koehn is also the Chief Scientist and one of the shareholders in Omniscien Technologies.

Professor Koehn worked with many others contributors to develop the Moses decoder including Hieu Hoang, Chris Dyer, Josh Schroeder, Marcello Federico, Richard Zens, and Wade Shen. The Moses decoder has progressively matured and is the de facto benchmark for research in the field. However, Moses was not intended to be a commercial machine translation offering. There are considerable amounts of additional functionality, beyond providing a web based user interface for Moses, that are not included in Moses that are essential in order to offer a strong and innovative commercial machine translation platform.

When conceiving the idea of Moses, the primary goal was to foster research and advance the state of MT in academia by providing a de facto base from which to innovate from.
Currently the vast majority of interesting MT research and advancements still takes place in academia. Without open source toolkits such as Moses, all the exciting work would be done by the Googles and Microsofts of the world, as is the case in related fields such as information retrieval or automatic speech recognition.
As a platform for academic research, Moses provides a strong foundation. However, Moses was not intended to be a commercial MT offering. There are considerable amounts of additional functionality, beyond providing a web based user interface for Moses, that are not included in Moses that are essential in order to offer a strong and innovative commercial MT platform.

Professor Philipp Koehn,
Chief Scientist, Omniscien Technologies

To develop a full featured enterprise class translation platform that addresses the shortcomings of the DIY machine translation approach, Professor Koehn has worked with the Omniscien Technologies team of linguists, researchers and developers for many years to build a translation and language processing solution that is truly innovative – Language Studio. The innovation goes beyond software into the entire process around the customization of a high quality translation engine. This process has been captured in the Clean Data SMT model and is not available as part of any of the DIY machine translation solutions.

Some DIY vendors are copying some of the features that Language Studio has had for many years, but still only a subset of the overall functionality is provided by any DIY solution in the market today. Omniscien Technologies continues to innovate and add new features and functionality on a regular basis. Working with Professor Philipp Koehn and other world leaders in the field of machine translation research ensures that Language Studio has the latest advances in commercial and enterprise class machine translation technologies.