Slide 1
Language Studio
Optical Character Recognition / OCR
Technical Features

210 written languages are supported for Optical Character Recognition. Convert images and Adobe PDF files to editable formats for translation and further processing.

0
Written Languages
Block
Overview
Analyze and Recognize
Technical Features
Integrate and Scale
Languages
Recognize Me!!

Best in class AI driven optical character recognition and machine translation deliver

image conversions to MS Office formats. image tables into Excel. PDF conversions into Word. searchable PDFs. translated images and PDFs.

Technical Features Overview

210 Written Languages

  • Arabic, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Norwegian, Polish, Portuguese, Russian, Slovakian, Slovenian, Spanish, Swedish, Tamil, Thai, Turkish, Ukraine, Vietnamese, and more…

Click to see the full list of supported languages.

Automated Document Analysis

  • Artificial intelligence and machine learning is applied for accuracy and document layout reconstruction
  • Document layout reconstruction, incl. internal structure and formatting
  • Detection and recreation of balanced columns of text
  • Detection of tables and layout reconstruction ensures that tables (even ones without visible column borders) are processed correctly
  • Accurate font detection and mapping

Powerful Server and API for Integration

  • Enterprise class scalability and processing features.
  • Scale to tens of thousands of pages per hour.
  • Automatically identify the document’s language.
  • Submit batch files for processing via API.
  • Integrate your own applications and workflows seamlessly.
  • Dynamically scale server resources up and down based on demand.

Advanced Image Pre-Processing

Image pre-processing increases the recognition accuracy by optimizing the image for OCR. Even low-quality images can deliver best OCR results after automated image correction steps are applied. Pre-processing features include:

  • Auto-cropping and auto-splitting of dual pages
  • Filtering of color stamps and marks, noise removal, and local contrast improvement
  • Image mirroring, inverting, scaling, cropping and clipping
  • Automated detection of page orientation (90, 180, and 270 degrees)
  • Automated splitting of double-pages
  • Camera OCR
  • Deskew (up to +/- 20 degrees) and rotate images
  • Automated distortion correction, image despeckling/clean-up, ISO noise reduction
  • Despeckling images in individual blocks Texture filtering and Adaptive Binarization
  • Adjusting text and background color
  • Text line straightening

Unrivalled Photo Processing (Camera OCR)

Digital cameras, smartphones and tablets take pictures with suitable resolution and image quality, but typically have many device specific and user introduced distortions that makes reading the printed text difficult.

Artificial intelligence identifies images captured by a digital camera and implements special image processing algorithms to eliminate distortion on digital photos, such as blur, curved text lines and other errors caused by insufficient light.

  • Correct image resolution
  • Straighten curved lines
  • Automatic 3D perspective distortions correction

Speed and Accuracy

A balance between speed and accuracy is achieved by optimizing the configuration to match your requirements.

  • Switch between thorough or fast recognition modes.
  • Consistently outperforms other OCR products for accuracy and document layout reconstruction in independent evaluations
  • Uses the latest artificial intelligence and machine learning
  • Integrated dictionaries are provided for many languages, with support for your own custom dictionaries and character patterns.
  • When converting many pages such as complete document archives or books, developers can leverage the Language Studio’s flexible and scalable architecture
  • Use multi-core CPUs and processing images in parallel on multiple threads, the OCR steps can be performed significantly faster

Understanding Core Technical Features

System Requirements

Align the system specification to your workload

For smaller and low processing volume deployments our out-of-the-box single server configuration should be sufficient for all features. For higher volumes and scalable deployments, the Omniscien team will guide you on the hardware requirements and specifications that match your anticipated workload.

Requirements Summary:

Feature Description
Memory
  • for processing one-page documents — minimum 400 MB RAM, recommended 1 GB RAM
  • for processing multi-page documents — minimum 1 GB RAM, recommended 1,5 GB RAM
  • for parallel processing — 450 MB RAM + 350 MB RAM for each core
  • for parallel processing of documents in Arabic, Chinese, Japanese, or Korean languages — 750MB RAM + 850 MB RAM for each core
Hard Disk Space
  • 3 GB for Docker installation
  • 100 MB for program operation
  • Additional 15 MB for every page when processing a multi-page document
Other
  • Tmpfs size — 4GB + 1GB * (cores number)
  • Swap size — 4GB + 1GB * (cores number)
Fonts
  • For correct font detection, the fonts contained in documents should be installed.

 

Translation and Natural Language Processing (NLP)

Subtitle Optimized Machine Translation

Get more value from OCR data with NLP and translation

Make your OCR apps smarter. Use Natural Language Processing tools to get more from your data. Easily enable applications to extract context, syntax, parts of speech, key terms, sentiment, meaning, summarize voice content, and even translate your OCR data and documents into other languages.

  • Translate images into another language by automatically converting the image to a Microsoft Office using OCR and then translating it, keeping the layout, structure and fonts.
  • Extract text, email addresses, URLs, etc.
  • Extract key phrases and terminology
  • Analyze sentiment, syntax, parts of speech, etc.
  • Determine the language of a document
  • Automatically detect and extract tables from images into Excel

Scalability

Scale to Thousands of Pages and Users

During the OCR process, a range of different algorithms are applied. They depend on image quality, document languages, layout complexity and number of pages in the document. Accordingly, such algorithms might require higher memory resources. It is recommended to set up the system in accordance with the outlined memory requirements to optimize the processing speed by allocating adequate system memory. The out-of-the-box single-server configuration is suitable for smaller organizations. The Omniscien team will guide you on deployments that have higher demands.

  • With built in load balancing, Language Studio can scale servers up on-demand to meet even the highest of loads. 
  • Language Studio’s architecture is designed for high-availability and scaling. Learn more >>

RESTful API

Integrate OCR into your Applications

Use the RESTful  APIs to power your applications with Language Studio’s artificial intelligence based tools.

APIs include:
  • Process multiple files concurrently by submitting them via the batch mode API
  • A large array of settings can be configured for processing and for output document format control

Input Image and Document Formats

A wide variety of image formats are supported

Note: Images must be no larger than 32,512 * 32,512 pixels.

File Extension Description
BMP BMP

  • uncompressed black and white
  • 4- and 8-bit — uncompressed Palette
  • 4- and 8-bit — RLE compressed Palette
  • 16-bit — uncompressed, uncompressed Mask
  • 24-bit — uncompressed
  • 32-bit — uncompressed, uncompressed Mask
BMP BMP

  • 4- and 8-bit — RLE compressed Palette
DCX DCX

  • black and white
  • 2-, 4- and 8-bit palette
  • 24-bit color
GIF GIF

  • black and white — LZW-compressed
  • 2-, 3-, 4-, 5-, 6-, 7-, 8-bit palette — LZW-compressed
JB2 JBIG2

  • black and white
JPG, JPEG, JFIF Joint Photographic Experts Group
gray, color
JP2, JPC, J2K JPEG 2000

  • gray — Part 1
  • color — Part 1
PCX PiCture eXchange

  • black and white
  • 2-, 4- and 8-bit palette
  • 24-bit color
PDF PDF

  • Image PDF (scanned PDF)
  • Digitally created PDF (Version 1.7 or earlier)
PNG Portable Network Graphic

  • black and white, gray, color
TIF, TIFF Tagged Image File Format

  • black and white — uncompressed, CCITT3, CCITT4, Packbits, ZIP, LZW
  • gray — uncompressed, Packbits, JPEG, ZIP, LZW
  • 24-bit color — uncompressed, JPEG, ZIP, LZW
  • 1-, 4-, 8-bit palette — uncompressed, Packbits, ZIP, LZW
  • (including multi-page TIFF)

Output Document Formats

Output to a wide range of formats

Language Studio can save the recognized text in the following formats:

File Extension Description
XML* ALTO (Analyzed Layout and Text Object) 

*ALTO3.0 -is an XML Schema that details technical metadata for describing the layout and content of physical text resources, such as pages of a book or a newspaper

CSV Comma Separated Values

  • Support for various code pages (Windows, DOS, Mac, ISO) and Unicode (UTF-16, UTF-8) encoding
DOCX / DOC Microsoft Word

  • Legacy DOC format
  • DOCX format
EPUB Electronic Publisher Format

  • Open eBook File

FB2 FictionBook 2.0

  • FictionBook is an open XML-based e-book format.
HTML / HTML5 Hyper Text Markup Language

  • Support for various code pages (Windows, DOS, Mac, ISO) and Unicode (UTF-16, UTF-8) encoding.
  • Includes support for the latest HTML5 standards
ODT OpenDocument Text Document

  • Created by LibreOffice and Apache OpenOffice
PDF Portable Document Format

  • PDF, PDF 2.0, PDF/UA
  • PDF/A-1 (a,b), PDF/A-2 (a,b,u), PDF/A-3 (a,b,u)
  • Support for MRC compression for all PDF formats.
RTF Rich Text Format
TXT Plain Text

  • Support for various code pages (Windows, DOS, Mac, ISO) and Unicode (UTF-16, UTF-8) encoding.
XLSX / XLS Microsoft Excel

  • Legacy XLS format with support for MS Excel 5 and 8 formats.
  • XLSX format
XML Extensible Markup Language

  • File format contains recognized text which structure is described with the help of XML tags.
FREE WEBINAR: AI and Language Processing Innovation – What Is It Good For? Real-World Use CasesWatch the Replay
+