If you are interested in attending our upcoming webinars, please click here to sign up.

Supporting Intelligence Gathering with Statistical Machine Translation (SMT)

Supporting Intelligence Gathering with Statistical Machine Translation (SMT)

Gathering intelligence from a corpus of high volumes of multi-lingual, multi-format content, possibly in near real-time, is a complex undertaking. However, recent advances in language processing technology and statistical machine translation (SMT) provide both the capacity (throughput) as well as the means to provide a rich set of information for analysis.

In different industry segments ranging from government to legal, there is an increasing need for effective machine supported analysis of multi-lingual and multi-formatted electronic data. In cases, such as eDiscovery, batches of data gathered need to be processed quickly and brought into a single language normalized format for further analysis while other uses include the monitoring of an ongoing (highvolume) stream of electronic data.

The Importance of Pre-Processing

More often than not, data under analysis is provided in a variety of different formats ranging from email to text documents, spreadsheets and presentations. Furthermore, the data is completely unstructured which makes effective analysis challenging at best.

Pre-processing data is key to “unlocking” the data and providing the basis for effective analysis. Pre-processing deals with the following key tasks:

  1. Normalization – Normalization is the process of bringing data into a uniform format, be it plain text or XML. By converting and optionally structuring the data, data can be processed more effectively and analyzed effectively by higher level tools.
  2. Translation – Having converted data into a uniform, machine readable format, the next important step is moving to a single language such that analysts can understand content and machine algorithms can be applied using a single language model.
  3. Tagging – As part of the statistical machine translation process complex language analysis is performed. This process provides valuable meta-data and tagging that is highly useful to the analysis stage. For example, so called “named entities” can be tagged accordingly which will enable the analytics process to in a later stage more effectively link related items.

Read more by downloading our white paper “Supporting Effective Intelligence Gathering with Statistical Machine Translation”