It would seem a paradox that LSP’s can achieve significant ROI from Machine Translation (MT) and keep localisation staff happy. While seeking efficiencies in business right from the early days of the industrial revolution has always resulted in shifts of where human efforts are applied and these changes don’t come easy, most staff members presented with machine translation output for post-editing are frustrated mainly at the poor output they are presented with. Poor output can even go as far as for example wrong domain specific output which will likely force an editor to rewrite significant sections of a document and negatively affect both ROI as well as staff motivation.
For LSP’s who really want to apply Machine Translation (MT) as a strategic support tool it is critically important to adapt organisational processes, add staff responsible for MT and with NLP and MT skills, as well as adopting a solid data management strategy. Statistical Machine Translation Systems (SMT) as well as the emerging Neuronal Machine Translation Systems both have a heavy dependency on good quality “right” data to train the system. The old rule of software and system development still is as true as when it was devised: “Garbage in, garbage out”. Any MT System can only be as good as the data it was trained on.
This article discusses the importance of a having a Data and Engine Strategy for best in class MT output and ROI. It is assumed that the reader is familiar with Language Studio as well as the concepts of Statistical Machine Translation, TM and CAT technology. It is also assumed that the reader has a basic understanding of data taxonomies, translation project needs and has ideally read and understood the Omniscien Technologies Language Studio Engine Strategy Guide.
What makes a good engine and how do we increase efficiency to maximize the ROI?
Before jumping into the details, it is key to take a step back and remind ourselves of why we apply MT and what is makes a good engine. The use of MT is associated with a cost in terms of sourcing the technology, the application of relevant skills, such as dedicated MT staff, so consequently we need to ensure that we apply MT to projects that are sizeable and repeatable initially with volume that both justifies the initial effort as well as provides ongoing material for improvement. Not every domain or project is equally suited for MT. In addition, it is important to remind ourselves that engines need to mature, similar to wine, and that only mature engines will allow us to achieve the desired 200%+ ROI outcome. However, reaping benefits requires us to make the up-front effort. Finally, we need to ensure that we keep domains in mind; there is little benefit of creating an Engineering engine for customer X for a particular language pair and then having the localization teams run Marketing content through that engine – the output will be undesirable and result in frustration on the translators and a lack of a desired return.
Therefore, before we even start with the discussion on how to build a good engine and what strategy to apply, ensuring the processes are solid and the selection of which projects to apply MT to is crucial.
So assuming we have selected the right project, what then? What do we need for a good engine?
Assuming we have taken a conscious decision to apply MT in a given project, we need to do a number of things. But before we delve right into the engine assembly details, let’s remind ourselves of what we need (at a high-level) for a good engine: contrary to popular belief TM, while necessary and important, is not the most important. The components in importance of components needed for a good and meaningful engine (for that project and domain, meaning terminology needs to be right, writing style needs to be correct, etc.) are:
- Language Model: i.e. target language text. The language model (LM) is critically important because the LM defines the writing style. Using the example above, if we train the engine with engineering documentation for LM and then run marketing content thought it, the marketing content will read like an engineering document – undesirable and will lead to significant post-editing effort, possibly so much that it may be easier to human translate from scratch.
- Terminology: when building a PROFESSIONAL engine with Omniscien Technologies Language Studio, we deliberately perform term (and other) analysis on the content that we expect to process and provide a list of terms by frequency as well as term candidates to the LSP to resolve. Resolving these terms is very important because it will a) ensure the right in domain / customer terminology is used and b) it will be applied consistentlyby the machine, something that machines are known to do well and humans do poorly. We also know from experience that it affects efficiency because humans are much more efficient correcting grammar mistakes than terminology
- Translation Memory: Naturally a good volume of bi-lingual content is key, without it there is no MT engine. But in terms of producing the right translation versus (just) a translation, terminology and language model are far more important.
- Glossary and Non-Translatable Terms: for the final “polish” these components are naturally important. As an example, in IT if we were translating for HP, we would need to teach the system that “Inspire” is not only an English word but also a HP product name. Hence the term HP followed by the word “Inspire” must not be translated while the word inspire in a sentence must be.
Finally, statistical MT needs a lot of data – the more, the better. So strategically applying MT to a project needs a realization that the work must be done upfront to gather sufficient volumes of good quality and relevant training data, LM, terminology etc.. These processes must be planned and executed diligently or the result will be sub-par. Also, as said in the introduction, mature engines go through a few iterations, so re-training engines immediately when new TM and LM when available is key. Depending on the project that might even occur every few days if on a large project a lot of content is produced in a short period of time – this will boost productivity throughout the project.
So what does this have to do with the engine and data strategy?
The first thing to realize is that creating an engine is all about data – and not just any data, but the right data. Using an example, for term work it is important to perform the automated term analysis and the term resolution on the data that is supposed to be translated (the customer data) since that is the relevant data. While crawling websites in domain to manufacture LM if insufficient data is available is a valid strategy, that data should ideally NOT be used for terminology extraction and if it is used, that must be a conscious decision and the MT team needs to understand that it will be asked to resolve terms that are not necessarily relevant for the customer. The consequence is that the project teams need customer data as early as possible in the project and not just at the time when the translation is supposed to occur. This alone is a significant process change for many LSP’s. Getting the data early allows for term extraction and normalisation early in the process.
Realizing that having enough and a lot of good quality data is key means that data needs to be stored and “tagged” such that it can be accessed and assembled into an engine quickly and efficiently if needed. A concept that is important to understand in this context is that Statistical Machine Translation, as the name suggests, is based solely on statistics. However, statistical models can be adjusted and this is what is used in “layering”. “Layering” is, simply put, similar to making a large, multi-layer cake. Cream on the top, followed by different layers of cake and possibly a crunchy bottom layer for structure. With MT engines we apply the same principle by using the project (customer/domain) specific data as the top and most relevant layer and making the content statistically most relevant, followed by other layers of data to provide “body” since we will not always have sufficient data.
Assume and example where a LSP has two engineering firms (customers A and B) in a closely related engineering domain as customers doing the same language pairs. Assume also that this LSP translates both technical as well as marketing documentation for this customer. In this situation the different customer (and domain) specific engines can “feed” off one another as follows if data policies allow. So a marketing engine for customer A could be layered as follows with statistical relevances adjusted accordingly:
- Customer A marketing data (TM, LM, Terminology, etc)
- Customer B marketing data
- General engineering marketing data available
- General engineering data available
The Customer B marketing engine would in essence be the same, but just switch layers 1 and 2 to give the Customer B writing style and terminology a higher weighting.
This concept of layering and prioritizing data is important to understand because it provides the basic mechanism for building out good SMT / NMT engines, even if initially limited amounts of customer data are available. The 3rd party or general data provides “body” and over time as more and more data from post-editing is added back in, the engine becomes increasingly precise as it matures and the customer and domain specific content influences the statistical model more and more. Language Studio has been specifically designed to support this process and makes it exceptionally easy to add data to an engine in layers.
In Language Studio a MT expert can for an engine assembly select an Omniscient Technologies INDUSTRY engine as a foundation, then add layers of data from broad domains as discussed in the above example, add data from in domain competitors where available and then overlay it all with customer specific data that will have the highest statistical relevance.
However, none of the above is possible if the data used for engine assemblies cannot be found because it is 1) not uploaded into Language Studio and 2) not included in a taxonomy.
Engine Strategy and Taxonomy
While I apologize for the lengthy introduction, understanding the above concepts is critically important to any MT project. Not understanding and applying them is likely to result in a lot of overhead trying to apply MT but ultimately unsatisfactory results.
While data taxonomies can quickly become complex, a simple initial taxonomy, however, could use Omniscient Technologies industry domains (e.g. IT, Travel, Banking and Finance, etc), plus tags for domain (e.g. marketing, engineering, …), tags for language (pair) and tags for data type (e.g. TM, LM, ….) as well as tagging by customer. A simple data taxonomy such as this will will ease the engine assembly significantly and provide ROI very quickly. It allows the MT expert to quickly identify relevant (and sufficient) data at the start of a project, determine the ideal “layering” strategy and quickly produce a SMT engine. This approach in combination with processes that ensure that post-edited content is fed back into the system create a “feedback” loop that leads to continuous improvement and significantly improves the time for achieving significant ROI. Language Studio supports automated improvement (auto-train) both for professional (high-end and complex) engines as well as basic engines allowing LSP’s to update their engines within a matter of hours.
Finally, a solid engine strategy further assists shortening the time before achieving an ROI. The exact strategy and MT project selection will differ by LSP but LSP’s that do much (recurring) work in specific industry verticals and language pairs should consider building their own “general” engines for these industry-domain-language combinations. Doing so will enable rush projects to use an “off the shelf” custom engine and larger projects can build upon a solid “foundation” to build custom engines for specific projects.
Omniscien Technologies’ Language Studio provides significant capabilities but LSP’s will only be able to maximise benefits if they take a strategic approach to MT (see also the MT Maturity Model published by Omniscien Technologies) and turn their data into an asset for building good quality right MT engines. There is admittedly quite a bit of effort to change processes, appoint leaders for MT and implement a data taxonomy and engine strategy, however, all the LSP’s that have implemented these changes have seen significant increases in ROI and localisation staff embracing MT.