Gathering intelligence from a corpus of high volumes of multi-lingual, multi-format content, possibly in near real-time, is a complex undertaking. However, recent advances in language processing technology and statistical machine translation (SMT) provide both the capacity (throughput) as well as the means to provide a rich set of information for analysis.
In different industry segments ranging from government to legal, there is an increasing need for effective machine supported analysis of multi-lingual and multi-formatted electronic data. In cases, such as eDiscovery, batches of data gathered need to be processed quickly and brought into a single language normalized format for further analysis while other uses include the monitoring of an ongoing (highvolume) stream of electronic data.
The Importance of Pre-Processing
More often than not, data under analysis is provided in a variety of different formats ranging from email to text documents, spreadsheets and presentations. Furthermore, the data is completely unstructured which makes effective analysis challenging at best.
Pre-processing data is key to “unlocking” the data and providing the basis for effective analysis. Pre-processing deals with the following key tasks:
- Normalization – Normalization is the process of bringing data into a uniform format, be it plain text or XML. By converting and optionally structuring the data, data can be processed more effectively and analyzed effectively by higher level tools.
- Translation – Having converted data into a uniform, machine readable format, the next important step is moving to a single language such that analysts can understand content and machine algorithms can be applied using a single language model.
- Tagging – As part of the statistical machine translation process complex language analysis is performed. This process provides valuable meta-data and tagging that is highly useful to the analysis stage. For example, so called “named entities” can be tagged accordingly which will enable the analytics process to in a later stage more effectively link related items.
Read more by downloading our white paper “Supporting Effective Intelligence Gathering with Statistical Machine Translation”