Custom Machine Translation Engines

Customize your own high-quality MT engine for writing style, vocabulary and domain.

Human Language Technology Enhanced by Artificial Intelligence

Creating a custom machine translation engine is an art. While rapid customization can be performed by uploading data to train an engine, this approach is best suited to engine improvements to an existing high-quality machine translation engine. A high-quality machine translation engine comes from a deep understanding of the data, the goals/purpose of use, and a strategic approach to creating and applying the correct data to obtain optimal machine translation quality.

Anyone can create a custom MT engine. But it takes experience and expertise to create a great custom MT engine that is optimized for purpose. Much like in a kitchen, just throwing ingredients in a pot does not make a great meal. A good recipe, a skilled chief, and the right tools will give a much better result.

The key to creating a high-quality machine translation engine is a combination of the right data to train an engine and pre-/post-processing that are optimized for your specific purpose. 

With over 14 years of experience creating tens-of-thousands of custom machine translation engines, the Omniscien team has refined a methodology and set of powerful tools that together provide unmatched translation quality. Many of the tools and processes are unique to Omniscien and are designed to make it fast and easy to create your own custom machine translation engines.

Omniscien experts fully understand the Anatomy of a Great Custom Machine Translation Engine and work with you to quickly organize and create the data needed to build a high-quality machine translation engine that is optimized to match your specific requirements. 

Language Studio offers 2 types of custom machine translation engines training. The Standard Custom MT Engine option provides a do-it-yourself approach that allows you to upload your data and simply click the “Train” button. However, while you can use this approach to customize your own MT engine, it is more suited to updating an existing engine that has been professionally customized. We recommend that the initial customization be performed using Language Studio’s Professional Guided Custom MT Engine option to ensure the highest possible quality.

Writing Style

In addition to customizing for domain and context, one of the major benefits of a Professional Guided MT Engine is the ability to influence writing style.

In the diagram, a Spanish sentence is translated using two different machine translation engines. The first was customized for English Business and reads more like the Economist or Forbes style of writing. The second was customized for English Children’s Books and reads more like Harry Potter.

This example shows the true impact of taking into account writing style. If the writing style does not match purpose, then the content must be rewritten – even if the translation was perfectly correct.  Applying this to a more real-world scenario, Automotive is a broad and complex domain with an extensive amount of terminology and domain-specific vocabulary. But a different writing style is needed for Automotive marketing than would be used in an Automotive Engineering Technical Manual.

Custom MT Example

Types of Custom MT Engine Training

Language Studio provides powerful tools to create, match, mine and synthesize bilingual sentences that are specific to your context, domain, and purpose. A Standard Custom MT Engine provides rapid customization with your own data and delivers a reasonable translation quality improvement using a fully automated process. The same approach can be used to update an existing engine.

Comparatively, with the additional data created by Language Studio tools, a Professional Guided Custom MT Engine takes translation quality to the next level. Omniscien Professional Guides understand the Anatomy of a Great Custom MT Engine and will guide you step-by-step through the process to deliver a custom MT engine that produces optimal translation quality.

Services

Standard Custom MT Engine

A Standard Custom MT Engine is a simple way to quickly get a basic level of customization if you are in a hurry or just want to simply add your data to Omniscien Industry Domain Data for a rapid customization.

Simply select the Industry Domain Data that you want to include in your engine, upload your translation memories and glossaries and then submit to train. Processing is fully-automated and a custom engine will be produced in as little as just a few hours.

Professional Guided Custom MT Engine

A Professional Guided Custom MT Engine delivers the highest possible translation quality. Experts who have participated in customizing thousands of machine translation engines will engage with you to understand your goals and then will design a personalized Engine Customization Plan.

The entire process is project managed by the Omniscien team. Only minimal input from the customer in required. Customization takes a little longer than a Standard Custom MT Engine as the customization is a semi-automated process that includes detailed deep data analysis, data synthesis, bilingual data mining and many other features that fine tune the data and the engine to meet your business use case.

Compare Standard and Professional Guided Custom MT Engines

Features
Standard Custom MT Engine
Standard
 
Professional Guided Custom MT Engine
Professional
Guided
Industry Domain Data Selection
Select data for your engine based on industry domain data.
TickTick
Initiate New MT Engine Creation / Initiate MT Engine Improvement
Via REST API and Web Portal.
TickTick
Submit Training Data
Via REST API and Web Portal.
TickTick
Translation Memories
TMX, XLIFF, Excel, Tab Pair
TickTick
Dictionary and Glossary
TBX, Excel, Tab Pair
TickTick
Submit to Train
Via REST API and Web Portal.
TickTick
Fully-Automated Training
All processing is automated upon submission.
Tick
Semi-Automated Training
Processing is automated in stages based on the tasks defined in the Engine Customization Plan.
Tick
Tuning and Test Set Extraction
Automated extraction of high-quality bilingual data for tuning, testing and measuring machine translation quality.
Tick
Quality Metrics
Automated BLEU, TER, Edit Distance, and other quality metrics.
Tick
Human Expert Guidance
An Omniscien expert will work with you and project manage the entire MT engine customization process.
Tick
Detailed Data Analysis
Automated and human review of your data, suitable industry domain data that best fits your goals and other data that can be used to train your custom machine translation engine.
Tick
Engine Customization Plan
After detailed data analysis and consultation to understand your goals, a personalized Engine Customization Plan will be developed to ensure that the highest-quality machine translation engine is created.
Tick
Project Management Portal
Monitor and track the progress of your custom engine as it is being developed and trained.
Tick
Extract and Translate Glossaries and Terminology
Automated terminology and entity extraction. Semi-automated resolution with specialized tools that suggest bilingual translations and enable rapid human review.
Tick
Match Bilingual Document Pairs
Analyze and automatically match document pairs and then match sentences within documents to create pairs of bilingual sentences and translation memories.
Tick
Match and Synthesize Bilingual Data
Bilingual Document Pair Matching, Bilingual Sentence Matching, and Data Synthesis
Tick
Bilingual Data Mining
Automatically download bilingual websites, match pages that are translations of each other, and extract the bilingual sentences.
Tick
Advanced Data Analysis and Processing
Language Identification, Bulk Renaming, Bulk Encoding Conversion, Bulk Format Conversion, Sentence Segmentaiton, Tokenization, Compare Multiple Document Versions, and Editing Effort Metrics, etc.
Tick
Metric and Unit Conversions
Convert metrics and units of measure such as inches to centimeters, miles to kilometers, Fahrenheit to centigrade, etc.
Tick
Custom Workflows
Pre-/Post-Processing Rules to fine tune to meet specific use cases and business requirements.
Tick
Multi-Domain/Style/Genre/Context and Multi-Language MT Engines
Advanced engine assembly that enables multiple languages, domains, styles, and context within a single engine.
Tick

Anatomy of a Great Custom MT Engine

5 years ago, a typical custom machine translation engine was based on between 2-to-4 million generic bilingual sentences and between 100-to-500 thousand domain-specific sentences provided by the user/client. At that time, it was already a challenge for many organizations to gather that many bilingual translation memory sentences. If you did not have any of your own data, finding suitable data to customize a machine translation engine was a significant barrier to building your own custom machine translation engine.

With the advent of Neural Machine Translation the volume of data required has multiplied many times over. Today’s typical custom machine translation engines are based on between 50-to-300 million bilingual sentences. This is comprised of about 80% general language pair foundation data that acts as a base underpinning the engine for grammar, vocabulary, and core linguistic knowledge that trains the artificial intelligence algorithms on how to better understand relationships between words and phrases. A further 20% is based on industry domain data such as life sciences, information technology or automotive that teaches the engine a specific context.

Language Studio provides powerful tools to create, match, mine and synthesize bilingual sentences that are specific to your context, domain, and purpose. A rapid customization can be performed with your own data with some reasonable translation quality improvement. Comparatively, with the additional data created by Language Studio tools, translation quality is taken to the next level.

The role of user/client data plays a different role than in the past. Depending on the volume of data available, it can be used as a base from which to synthesize considerably larger amounts of data. It is often possible to create as much as 10 times the volume of data through various forms of data synthesis. In domain monolingual data also provides an important base for vocabulary and terminology that can be useful in building a bilingual glossary and for data synthesis.

Bilingual data mining creates a larger set of domain specific data from external sources that further expands vocabulary choice, writing style and domain specific context. The combination of all these tools working together and driven by guidance from the client produces millions of high-quality in domain and context relevant bilingual sentences.

With a Professional Guided Custom Machine Translation Engine, an Omniscien expert with tailor an Engine Customization Plan to your specific needs. User/client involvement is minimized with most activities handled by the Omniscien team and automated tools.

MT Engine Data

Image: Example of data distribution and associated quality benefits from a Professional Guided Custom MT Engine.

Learn the Secrets to Creating a High-Quality Custom Machine Translation Engine.

Watch the “Secrets to Customizing a High-Quality Neural Machine Translation Engine” presentation from Omniscien’s Artificial Intelligence, Language Processing and Machine Translation Symposium 2020 conference.

Powerful Tools for Data Creation, Preparation, and Analysis

Convert legacy files into “data gold” and create new high-quality bilingual data quickly and efficiently.

In the process of building thousands of custom machine translation engines for clients the Omniscien team found many manual tasks to be repetitive and often too large and too error-prone to be performed with humans alone. Over many years a large number of internal tools were developed to help automate these processes and improve data quality. With the release of Language Studio V6.0 these tools are now available as part of the standard Language Studio platform.

Custom MT Engines

Extract and Create Bilingual Glossaries and Terminology

Language Studio can import and process industry standard glossary formats such as TBX. To extend a TBX or create a new glossary can take a considerable amount of time. To quickly create your own bilingual glossaries Language Studio provides a comprehensive set of tools that set and resolve bilingual glossary terms. Glossaries can be used as training data for custom machine translation engines or as runtime rules when processing a translation.

Reduce the time to create a glossary for a body of text from hours to minutes. Language professionals can quickly validate and fine-tuning the automated results. When creating bilingual glossaries, language professionals need only check and adjust suggested glossary phrases, increasing their productivity from an average of 20 to more than 160 terms per hour.

Learn More

  • Glossary and Term Extraction
  • Bilingual Glossary Creation

Create and Match Bilingual Data

Turn your legacy language data assets into “data gold”. Many organizations and Language Service Providers (LSPs) have large numbers (sometimes even tens of thousands of files) that have been built using multiple platforms and with sometimes non-optimal management controls and filing systems.

Simply upload files and automatically identify and match their bilingual document pairs and extract high-quality bilingual sentences. Even monolingual data can be leveraged as a driver for synthetic data. These are just a few of the many ways that Language Studio creates new data assets that bolsters your own custom machine translation engine.

Learn More

  • Bilingual Document Pair Matching
  • Bilingual Sentence Matching
  • Data Synthesis
  • Bilingual Data Mining

Extract, Analyze, Organize and Normalize Data Assets

Basic organization, preparation, and normalization of data files are often a challenge, especially for legacy or acquired third-party data. Automatically analyze, convert, extract, compare, measure, normalize, bulk rename, and classify.

A wide range of Natural Language Processing (NLP) tools are available for processing, cleaning, repairing, and classifying data so that it can be used in a variety of processes and tools. 

Learn More

  • Language Identification
  • Bulk Renaming
  • Bulk Encoding Conversion
  • Bulk Format Conversion and Text Extraction
  • Advanced Data Cleaning
  • Domain Classification
  • Sentence Segmentation and Joining
  • Automated Quality Metrics