Machine Translation (MT) -
Custom Machine Translation Engines
Custom machine translation engines adapt the vocabulary, context and writing style to be optimized for your specific use case.
Machine translation should be easy. secure. the default. multilingual. integrated. seamless. available to the entire organization.
Accuracy through Innovation and Customization
Anyone can create a custom MT engine. But it takes experience and expertise to create a great custom MT engine that is optimized to produce MT output for you specific purpose. Much like in a kitchen, just throwing ingredients in a pot does not make a great meal. A good recipe, a skilled chief, and the right tools will give a much better result.
The key to creating a high-quality machine translation engine is a combination of the data that it is built on and pre-/post-processing that is optimized for your specific purpose. The video in the right-hand bar provides a good overview of what is involved.
With over 14 years of experience creating tens-of-thousands of custom machine translation engines, the Omniscien team have refined a methodology and set of tools that provide unmatched translation quality. Many of the tools and processes are unique to Omniscien and are designed to make it fast and easy to create your own custom machine translation engines.
Why do you need a custom machine translation engine?
As machine translation becomes more and more accurate, industry specific machine translation engines that take into account the content of the text that is being translated have become common. These take you part of the way to better contextual support within your machine translation engine but machine translated content can produce much higher translation quality when it is processed via a professionally customized machine translation engine .
Like human translators, the highest quality machine translation is always achieved via a specialized translator that has had appropriate domain-specific training. Someone with deep knowledge of football will be able to translate that domain better than someone who specializes in legal, financial or medical.
A professionally customized machine translation engine will not only be adapted to your industry domain but also to your specific vocabulary and writing styles.
To illustrate the point, let’s examine the automotive domain. If you consider two manufacturers, Honda and Toyota, their cars generally target the same market segment and have similar features. As such, their writing style is quite similar. However, if you then compare Rolls Royce and Ferrari, the target market segment is very different and there is a very different writing style and choice of vocabulary in their sales and marketing content.
In addition to the differences between each company within the automotive domain, there are different use cases. The writing styles for sales and marketing are totally different to legal or technical manuals. You would not want your sales and marketing content to read like a legal or technical document.
With Language Studio, multiple domains and multiple writing styles can be designed and included in a single custom engine, ensuring the appropriate writing style and translation output quality.
The benefits are very simple. Translation quality and accuracy are improved. If you are involved in human post-editing then there is less effort and time required to publish your content. If you are simply translating to communicate or understand information then the meaning of the translation is more accurate and the message conveyed more clearly. When a machine translation engine is professionally customized, it should be difficult to differentiate between machine translated content and human translation. In many use cases, machine translation post editing may not be needed at all.
Our out-of-the-box machine translation engines are already suitable for many purposes and ready to use immediately. Our pre-adapted specialized Industry Domain machine translation engines offer even greater accuracy.
For total control and the best accuracy, consider customizing with your own data. Our rapid customization technologies can have your machine translaiton engines adapted to your organizations specific domain with minimal effort.
Can your MT Engine Do This?
One of the many areas that we seek to understand is writing style. Our experienced project managers work with you to understand your requirements. One of the first things to determine is the writing style or styles. Multiple styles can be incorporated directly into a single machine translation engine.
In the example to the right, two writing styles were built into an engine for different purposes. You can see clearly that the output style between “Business” and “Children’s Books” is totally different, although the meaning is the same.
Understanding Custom Machine Translation
The Upload and Pray Approach
Our competitors have a very different, more simplistic approach. They request that you upload your existing translation memories and terminology if you have any, and then they combine that with their file translation memories to create a custom machine translation engine.
We refer to this approach as “Upload and Pray” because it is a blind approach that does not take into account any of the variables that go to make a great translation and does not consider the suitability of the data being provided for the intended purpose.
Also, the data that any organization usually has for training is typically between 10,000 to 500,000 sentences, greatly limiting the scope of coverage and benefits of MT engine customization.
This is the approach that nearly all of our competitors offer for customized machine translation, including Google Translate with Google AutoML and Microsoft Translator.
When it comes to customizing neural machine translation engines, this approach is less about creating a customized machine translation engine and more about slightly biasing the outcome. If this approach really worked, you would see success stories published everywhere – but you don’t. In addition to data security, data privacy and compliance concerns around sensitive information, this is the simple reason why Google Translate and similar public machine translation engines are not used internally by many organizations – they are simply not suitable for purpose.
Simply uploading training data without any understanding of the data or the goals and how they correlate to each other is foolhardy and will likely result in disappointment.
To be fair, the approach of uploading data to improve an engine does have some merit, but not as the primary approach. This Do-It-Yourself approach is best suited to ongoing small improvements to an already fully customized machine translation engine as detailed in the next section. Small periodic improvements can be very beneficial.
The Language Studio Approach
Modern MT engines are built on billions of sentences of data. While uploading your legacy translation memories and terminology will improve translation, it is a minor improvement overall. The reality is that you don’t have sufficient data (no organization does) to customize an engine to your writing style, context, and vocabulary. You have some data that can influence the output, but a true customization is based around understanding your purpose and use and adapting it specifically to meet that purpose. If this sounds familiar, it is because the basic premise of translating with a machine is no different than human translation. If you don’t guide your human translators properly, then they will not have the necessary information for the correct style, context and purpose. The final deliverable is likely to be sub-standard.
Language Studio developer tools provide a comprehensive set of tools to create, assemble, management, measure and optimize custom machine translation engines.
|Typical Base Data Size||10-50 million bilingual sentences||1-20 billion bilingual sentences|
|Upload Your Bilingual Data||Yes||Yes|
|Expert Guided Training||No||Yes|
|Data Cleaning||None or minimal||Extensive data cleaning|
|Upload Your Monolingual Data||No||Yes|
|Personalized Customization Plan||No||Yes|
|Custom Created Data||No||Yes|
|Terminology Analysis & Extraction||No||Yes|
|Bilingual Terminology Resolution||No||Yes|
|Writing Style Control||No||Yes|
|Multiple Writing Styles in One Engine||No||Yes|
|Multiple Domains in One Engine||No||Yes|
|Data Synthesis and Data Manufacturing||No||Yes - 20-40 million bilingual sentences|
Planning, Understanding and Preparation
The Language Studio approach considers the purpose, the context/domain, the writing style and the overall use case when creating a custom MT engine. We start with a simple meeting to get the basic parameters and then adapt our standardized and proven customization plan to specifically meet your specific requirements.
- Determine language pairsMultiple language pairs can be contained within a single MT engine.
- Determine and domainsMultiple domains can be contained within a single MT engine and switched between dynamically at the time of translation.
- Determine the goals, purpose and audienceThis information provides an overall understanding so that the correct data can be utilized for optimal translation quality and accuracy.
- Determine the writing styles (sales, legal, technical, etc) required.Data needs to be prepared for each style. Multiple writing styles can be contained within a single MT engine and switched between automatically or manually at the time of translation.
- Determine the available data assetsModern MT systems are trained on hundreds of millions and sometimes billions of bilingual sentences.
Review Data Assets:
- that can be provided by the customer.
- in the Omniscien data repository.
- available from other external sources.
- that can be created/manufactured/generated using Omniscien tools.
- Determine the technology to be used for translationNeural Machine Translation (NMT) is the most common translation technology. However, sometimes Statistical Machine Translation (SMT) or a hybrid approach of both NMT and SMT is better suited.
- Determine the tools to use for gathering, preparing, processing and cleaning dataOmniscien provides a wide range of powerful tools to create millions of bilingual sentences specific to your needs.
- Determine the pre and post processing needed (if any) when translatingCustom rules give that extra edge when you need to fine tune formats and control data.
- Adapt Base Plan, Prepare and ExecuteOur proven standardized approach is adapated and refined to match your goals and purpose.
The Right Data is Essential to Delivering High Quality Translations
5 years ago, a typical custom machine translation engine was based on between 2-to-4 million generic bilingual sentences and between 100-to-500 thousand domain-specific sentences provided by the user/client. At that time, it was already a challenge for many organizations to gather that many bilingual translation memory sentences. If you did not have any of your own data, finding suitable data to customize a machine translation engine was a significant barrier to building your own custom machine translation engine.
With the advent of Neural Machine Translation, the volume of data required has multiplied many times over. Today’s typical custom machine translation engines are based on between 150 million to 1+ billion bilingual sentences.
This is comprised of approximately:
- 80% Language Pair Foundation Data that acts as a general base underpinning the engine for grammar, vocabulary, and core linguistic knowledge that trains the artificial intelligence algorithms on how to better understand relationships between words and phrases. Certain language pairs will have more training data available than others for training due to the ease of sourcing training data and the prominence of the language. Less prominent language pairs may have fewer resources available to create training data. Omniscien aims to translate directly between languages wherever possible. Already all official EU languages have at least 500 million bilingual sentences available for training in all directions without pivoting between English. Many have well over 1 billion bilingual sentences.
- 15% Industry Domain Data such as life sciences, information technology or automotive that teaches the engine a specific context.
- <1% Client Data makes up the total data volume. This is too small overall to make a significant difference, but does act as a base to manufacturing and synthesizing data. This data usually consists of translation memory files and term base files that can be easily exported from the clients translation management system.
- 4-5% Manufactured and Synthesized Data makes a huge difference in translation quality. A good sized translation memory exported from a client’s translation management systems is a great help, but modern neural machine translation engines need more data than any one company can provide. Even with many translation projects, the data is unlikely to exceed 500,000 sentences for most organizations. Most companies struggle to come up with training data.
The additional training data needed to feed the machine learning algorithms with sufficient data to really make a translation quality difference is created using a set of powerful tools that are able to create specific data for your purpose. Even when no bilingual training data is available from the client, the Language Studio tools can still generate millions of unique sentences.
Language Studio provides powerful tools to create, match, mine and synthesize bilingual sentences that are specific to your context, domain, and purpose. A rapid customization can be performed with your own data with some reasonable translation quality improvement. Comparatively, with the additional data created by Language Studio tools, translation quality is taken to the next level.
What makes Language Studio Custom MT Engines Better
Language Studio offers a range of custom machine translation engine benefits and features that are unmatched by other solutions.
Whether you have a translation team engaged in human translation and machine translation post editing or are using machine translation to communicate and share information better, Language Studio provides optimized custom machine translation solutions using both Statistical Machine Translation engine and Neural Machine Translation engine technologies to create the most suitable engine for your purpose.
Industry Domain Data
Hundreds of millions of industry domain bilingual sentences to draw from across 14 industry domains. Add pre-processed and selected data to your own custom engines.
Pre-trained Industry Domain MT engines have been created using Industry Domain engines data. These engines are often referred to as Stock Engines and can be used as a basis for your own customization.
Multiple Domains & Genres in a Single MT Engine
Language Studio custom MT engines have a unique feature for customized engines that allows the translation genre, domain, and writing style to be specified at the time of translation (i.e. marketing, technical manuals). The resulting translation to be stylized to match. A full range of genres, domains, and styles can be built into a single machine translation engine.
Advanced Data Mining, Data Synthesis, Data Creation and Data Preparation Tools
Data is the fuel that powers high-quality custom MT engines. Few organizations have sufficient data to really make a difference in translation quality and domain context.
Language Studio provides a range of powerful bilingual glossary and sentence creation tools. Automated website mining, data synthesis, client specific terminology extraction, and content cleaning, matching, and automation is used to create millions of bilingual sentences to quickly build your custom MT engine.
Professional or Do-It-Yourself Machine Translation Customization
Professional Guided Custom MT Training
This is the approach we recommend for your first customization of any domain. Once professionally customized, you can then progressively improve your MT engines with our Do-It-Yourself Standard MT Training.
A Professional Guided Custom MT Engine delivers the highest possible translation quality. Experts who have participated in customizing thousands of machine translation engines will engage with you to understand your goals and then will design a personalized Engine Customization Plan.
The entire process is project managed by the Omniscien team. Only minimal input from the customer in required. Customization takes a little longer than a Standard Custom MT Engine as the customization is a semi-automated process that includes detailed deep data analysis, data synthesis, bilingual data mining, and many other features that fine-tune the data and the engine to meet your business use case.
We will also guide you on the most appropriate translation technology, including the best machine learning approach such as Neural Machine Translation engines and Statistical Machine Translation engines so that your MT engines deliver the optimal MT output quality and best overall MT performance. The end product is optimized and trained specifically for your purpose.
Quality metrics using a blind test set are provided with each custom engine to compare translation quality against other MT engines, with metrics such as BLEU, F-Measure, Edit Distance and many more.
Do-It-Yourself Standard MT Training
A Standard Custom MT Engine is a simple way to quickly get a basic level of customization if you are in a hurry or just want to simply add your data to Omniscien Industry Domain Data for a rapid customization.
Simply select the Industry Domain Data that you want to include in your engine, upload your translation memories and glossaries and then submit to train. Processing is fully-automated and a custom engine will be produced in as little as just a few hours.
This approach is the same as our competitors' "Upload and Pray" approach. We do not recommend this approach for an initial MT engine customization. However, this approach is ideal for "topping up" and already professionally customized MT engine. This is a very fast and simple approach that helps in progressively fine-tuning of your machine translation quality.
Convert legacy files into “data gold” and create new high-quality bilingual data quickly and efficiently.
In the process of building thousands of custom machine translation engines for clients the Omniscien team found many manual tasks to be repetitive and often too large and too error-prone to be performed with humans alone. Over many years a large number of internal tools were developed to help automate these processes and improve data quality. With the release of Language Studio V6.0 these tools are now available as part of the standard Language Studio platform.
Extract and Create Bilingual Glossaries and Terminology
Language Studio can import and process industry standard glossary formats such as TBX. To extend a TBX or create a new glossary can take a considerable amount of time. To quickly create your own bilingual glossaries Language Studio provides a comprehensive set of tools that set and resolve bilingual glossary terms. Glossaries can be used as training data for custom machine translation engines or as runtime rules when processing a translation.
Reduce the time to create a glossary for a body of text from days to minutes, automatically extracting specific terminology from your content. Language professionals can quickly validate and fine-tuning the automated results to more quickly reach their terminology compliance goals. When creating bilingual glossaries, language professionals need only check and adjust suggested glossary phrases, increasing their productivity from an average of 20 to more than 160 terms per hour.
- Glossary and Term Extraction
- Bilingual Glossary Creation
Create and Match Bilingual Data
Turn your legacy language data assets into “data gold”. Many organizations and Language Service Providers (LSPs) have large numbers (sometimes even tens of thousands of files) that have been built using multiple platforms and with sometimes non-optimal management controls and filing systems.
Simply upload files and automatically identify and match their bilingual document pairs and extract high-quality bilingual sentences. Even monolingual data can be leveraged as a driver for synthetic data. These are just a few of the many ways that Language Studio creates new data assets that bolsters your own custom machine translation engine.
- Bilingual Document Pair Matching
- Bilingual Sentence Matching
- Data Synthesis
- Bilingual Data Mining
Extract, Analyze, Organize and Normalize Data Assets
Basic organization, preparation, and normalization of data files are often a challenge, especially for legacy or acquired third-party data. Automatically analyze, convert, extract, compare, measure, normalize, bulk rename, and classify.
A wide range of Natural Language Processing (NLP) tools are available for processing, cleaning, repairing, and classifying data so that it can be used in a variety of processes and tools.
- Language Identification
- Bulk Renaming
- Bulk Encoding Conversion
- Bulk Format Conversion and Text Extraction
- Advanced Data Cleaning
- Domain Classification
- Sentence Segmentation and Joining
- Automated Quality Metrics