Optical Character Recognition (OCR) - Technical Features

Slide 1

Optical Character Recognition / OCR

Technical Features

210 written languages are supported for Optical Character Recognition. Convert images and Adobe PDF files to editable formats for translation and further processing.

Written Languages

Omniscien » Language Studio » Features » OCR » Optical Character Recognition (OCR) – Technical Features

Language Studio:	Home Features Secure Portal Server Platform Data Privacy & Compliance Book a Demo
	Convert Files Images & OCR Media Processing Natural Language Processing Transcribe & Dictate Translate

Block

Overview

Analyze and Recognize

Technical Features

Integrate and Scale

Languages

Recognize Me!!

Best in class AI driven optical character recognition and machine translation deliver

image conversions to MS Office formats. image tables into Excel. PDF conversions into Word. searchable PDFs. translated images and PDFs.

Technical Features Overview

210 Written Languages

Arabic, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hindi, Indonesian, Italian, Japanese, Korean, Malay, Norwegian, Polish, Portuguese, Russian, Slovakian, Slovenian, Spanish, Swedish, Tamil, Thai, Turkish, Ukraine, Vietnamese, and more…

Click to see the full list of supported languages.

Automated Document Analysis

Artificial intelligence and machine learning is applied for accuracy and document layout reconstruction
Document layout reconstruction, incl. internal structure and formatting
Detection and recreation of balanced columns of text
Detection of tables and layout reconstruction ensures that tables (even ones without visible column borders) are processed correctly
Accurate font detection and mapping

Powerful Server and API for Integration

Enterprise class scalability and processing features.
Scale to tens of thousands of pages per hour.
Automatically identify the document’s language.
Submit batch files for processing via API.
Integrate your own applications and workflows seamlessly.
Dynamically scale server resources up and down based on demand.

Advanced Image Pre-Processing

Image pre-processing increases the recognition accuracy by optimizing the image for OCR. Even low-quality images can deliver best OCR results after automated image correction steps are applied. Pre-processing features include:

Auto-cropping and auto-splitting of dual pages
Filtering of color stamps and marks, noise removal, and local contrast improvement
Image mirroring, inverting, scaling, cropping and clipping
Automated detection of page orientation (90, 180, and 270 degrees)
Automated splitting of double-pages
Camera OCR
Deskew (up to +/- 20 degrees) and rotate images
Automated distortion correction, image despeckling/clean-up, ISO noise reduction
Despeckling images in individual blocks Texture filtering and Adaptive Binarization
Adjusting text and background color
Text line straightening

Unrivalled Photo Processing (Camera OCR)

Digital cameras, smartphones and tablets take pictures with suitable resolution and image quality, but typically have many device specific and user introduced distortions that makes reading the printed text difficult.

Artificial intelligence identifies images captured by a digital camera and implements special image processing algorithms to eliminate distortion on digital photos, such as blur, curved text lines and other errors caused by insufficient light.

Correct image resolution
Straighten curved lines
Automatic 3D perspective distortions correction

Speed and Accuracy

A balance between speed and accuracy is achieved by optimizing the configuration to match your requirements.

Switch between thorough or fast recognition modes.
Consistently outperforms other OCR products for accuracy and document layout reconstruction in independent evaluations
Uses the latest artificial intelligence and machine learning
Integrated dictionaries are provided for many languages, with support for your own custom dictionaries and character patterns.
When converting many pages such as complete document archives or books, developers can leverage the Language Studio’s flexible and scalable architecture
Use multi-core CPUs and processing images in parallel on multiple threads, the OCR steps can be performed significantly faster

Understanding Core Technical Features

System Requirements

Align the system specification to your workload

For smaller and low processing volume deployments our out-of-the-box single server configuration should be sufficient for all features. For higher volumes and scalable deployments, the Omniscien team will guide you on the hardware requirements and specifications that match your anticipated workload.

Requirements Summary:

Feature	Description
Memory	for processing one-page documents — minimum 400 MB RAM, recommended 1 GB RAM for processing multi-page documents — minimum 1 GB RAM, recommended 1,5 GB RAM for parallel processing — 450 MB RAM + 350 MB RAM for each core for parallel processing of documents in Arabic, Chinese, Japanese, or Korean languages — 750MB RAM + 850 MB RAM for each core
Hard Disk Space	3 GB for Docker installation 100 MB for program operation Additional 15 MB for every page when processing a multi-page document
Other	Tmpfs size — 4GB + 1GB * (cores number) Swap size — 4GB + 1GB * (cores number)
Fonts	For correct font detection, the fonts contained in documents should be installed.

Translation and Natural Language Processing (NLP)

Get more value from OCR data with NLP and translation

Make your OCR apps smarter. Use Natural Language Processing tools to get more from your data. Easily enable applications to extract context, syntax, parts of speech, key terms, sentiment, meaning, summarize voice content, and even translate your OCR data and documents into other languages.

Translate images into another language by automatically converting the image to a Microsoft Office using OCR and then translating it, keeping the layout, structure and fonts.
Extract text, email addresses, URLs, etc.
Extract key phrases and terminology
Analyze sentiment, syntax, parts of speech, etc.
Determine the language of a document
Automatically detect and extract tables from images into Excel

Scalability

Scale to Thousands of Pages and Users

During the OCR process, a range of different algorithms are applied. They depend on image quality, document languages, layout complexity and number of pages in the document. Accordingly, such algorithms might require higher memory resources. It is recommended to set up the system in accordance with the outlined memory requirements to optimize the processing speed by allocating adequate system memory. The out-of-the-box single-server configuration is suitable for smaller organizations. The Omniscien team will guide you on deployments that have higher demands.

With built in load balancing, Language Studio can scale servers up on-demand to meet even the highest of loads.
Language Studio’s architecture is designed for high-availability and scaling. Learn more >>

RESTful API

Integrate OCR into your Applications

Use the RESTful APIs to power your applications with Language Studio’s artificial intelligence based tools.

APIs include:

Process multiple files concurrently by submitting them via the batch mode API
A large array of settings can be configured for processing and for output document format control

Input Image and Document Formats

A wide variety of image formats are supported

Note: Images must be no larger than 32,512 * 32,512 pixels.

File Extension	Description
BMP	BMP uncompressed black and white 4- and 8-bit — uncompressed Palette 4- and 8-bit — RLE compressed Palette 16-bit — uncompressed, uncompressed Mask 24-bit — uncompressed 32-bit — uncompressed, uncompressed Mask
BMP	BMP 4- and 8-bit — RLE compressed Palette
DCX	DCX black and white 2-, 4- and 8-bit palette 24-bit color
GIF	GIF black and white — LZW-compressed 2-, 3-, 4-, 5-, 6-, 7-, 8-bit palette — LZW-compressed
JB2	JBIG2 black and white
JPG, JPEG, JFIF	Joint Photographic Experts Group gray, color
JP2, JPC, J2K	JPEG 2000 gray — Part 1 color — Part 1
PCX	PiCture eXchange black and white 2-, 4- and 8-bit palette 24-bit color
PDF	PDF Image PDF (scanned PDF) Digitally created PDF (Version 1.7 or earlier)
PNG	Portable Network Graphic black and white, gray, color
TIF, TIFF	Tagged Image File Format black and white — uncompressed, CCITT3, CCITT4, Packbits, ZIP, LZW gray — uncompressed, Packbits, JPEG, ZIP, LZW 24-bit color — uncompressed, JPEG, ZIP, LZW 1-, 4-, 8-bit palette — uncompressed, Packbits, ZIP, LZW (including multi-page TIFF)

Output Document Formats

Output to a wide range of formats

Language Studio can save the recognized text in the following formats:

File Extension	Description
XML*	ALTO (Analyzed Layout and Text Object) *ALTO3.0 -is an XML Schema that details technical metadata for describing the layout and content of physical text resources, such as pages of a book or a newspaper
CSV	Comma Separated Values Support for various code pages (Windows, DOS, Mac, ISO) and Unicode (UTF-16, UTF-8) encoding
DOCX / DOC	Microsoft Word Legacy DOC format DOCX format
EPUB	Electronic Publisher Format Open eBook File
FB2	FictionBook 2.0 FictionBook is an open XML-based e-book format.
HTML / HTML5	Hyper Text Markup Language Support for various code pages (Windows, DOS, Mac, ISO) and Unicode (UTF-16, UTF-8) encoding. Includes support for the latest HTML5 standards
ODT	OpenDocument Text Document Created by LibreOffice and Apache OpenOffice
PDF	Portable Document Format PDF, PDF 2.0, PDF/UA PDF/A-1 (a,b), PDF/A-2 (a,b,u), PDF/A-3 (a,b,u) Support for MRC compression for all PDF formats.
RTF	Rich Text Format
TXT	Plain Text Support for various code pages (Windows, DOS, Mac, ISO) and Unicode (UTF-16, UTF-8) encoding.
XLSX / XLS	Microsoft Excel Legacy XLS format with support for MS Excel 5 and 8 formats. XLSX format
XML	Extensible Markup Language File format contains recognized text which structure is described with the help of XML tags.