What Is Optical Character Recognition (OCR)?

Optical character recognition (OCR) technology is an efficient business process that saves time, cost and other resources by utilizing automated data extraction and storage capabilities.

Optical character recognition (OCR) is sometimes referred to as text recognition. An OCR program extracts and repurposes data from scanned documents, camera images and image-only pdfs. OCR software singles out letters on the image, puts them into words and then puts the words into sentences, thus enabling access to and editing of the original content. It also eliminates the need for manual data entry.

OCR systems use a combination of hardware and software to convert physical, printed documents into machine-readable text. Hardware — such as an optical scanner or specialized circuit board — copies or reads text; then, software typically handles the advanced processing.

OCR software can take advantage of Artificial intelligence (AI) to implement more advanced methods of intelligent character recognition (ICR), like identifying languages or styles of handwriting. The process of OCR is most commonly used to turn hard copy legal or historical documents into pdf documents so that users can edit, format and search the documents as if created with a word processor.

The history of optical character recognition

In 1974, Ray Kurzweil started Kurzweil Computer Products, Inc., whose omni-font optical character recognition (OCR) product could recognize text printed in virtually any font. He decided that the best application of this technology would be a machine-learning device for the blind, so he created a reading machine that could read text aloud in a text-to-speech format. In 1980, Kurzweil sold his company to Xerox, which was interested in further commercializing paper-to-computer text conversion.

OCR technology became popular in the early 1990s while digitizing historical newspapers. Since then, the technology has undergone several improvements. Today’s solutions have the abilitiy to deliver near-to-perfect OCR accuracy. Advanced methods are used to automate complex document processing workflows. Before OCR technology was available, the only option to digitally format documents was to manually retype the text. Not only was this time-consuming, but it also came with inevitable inaccuracies and typing errors. Today, OCR services are widely available to the public. For example, Google Cloud Vision OCR is used to scan and store documents on your smartphone.

How does optical character recognition work?

Optical character recognition (OCR) uses a scanner to process the physical form of a document. Once all pages are copied, OCR software converts the document into a two-color or black-and-white version. The scanned-in image or bitmap is analyzed for light and dark areas, and the dark areas are identified as characters that need to be recognized, while light areas are identified as background. The dark areas are then processed to find alphabetic letters or numeric digits. This stage typically involves targeting one character, word or block of text at a time. Characters are then identified using one of two algorithms — pattern recognition or feature recognition.

Pattern recognition is used when the OCR program is fed examples of text in various fonts and formats to compare and recognize characters in the scanned document or image file.

Feature detection occurs when the OCR applies rules regarding the features of a specific letter or number to recognize characters in the scanned document. Features include the number of angled lines, crossed lines or curves in a character. For example, the capital letter “A” is stored as two diagonal lines that meet with a horizontal line across the middle. When a character is identified, it is converted into an ASCII code (American Standard Code for Information Interchange) that computer systems use to handle further manipulations.

An OCR program also analyzes the structure of a document image. It divides the page into elements such as blocks of texts, tables or images. The lines are divided into words and then into characters. Once the characters have been singled out, the program compares them with a set of pattern images. After processing all likely matches, the program presents you with the recognized text.

Why is OCR important?

Most business workflows involve receiving information from print media. Paper forms, invoices, scanned legal documents, and printed contracts are all part of business processes. These large volumes of paperwork take a lot of time and space to store and manage. Though paperless document management is the way to go, scanning the document into an image creates challenges. The process requires manual intervention and can be tedious and slow.

Moreover, digitizing this document content creates image files with the text hidden within it. Text in images cannot be processed by word processing software in the same way as text documents. OCR technology solves the problem by converting text images into text data that can be analyzed by other business software. You can then use the data to conduct analytics, streamline operations, automate processes, and improve productivity.

The benefits of optical character recognition

The main benefit of optical character recognition (OCR) technology is that it simplifies the data-entry process by creating effortless text searches, editing and storage. OCR allows businesses and individuals to store files on their computers, laptops and other devices, ensuring constant access to all documentation.

How does OCR work?

The OCR engine or OCR software works by using the following steps:

Image acquisition

A scanner reads documents and converts them to binary data. The OCR software analyzes the scanned image and classifies the light areas as background and the dark areas as text.

Preprocessing

The OCR software first cleans the image and removes errors to prepare it for reading. These are some of its cleaning techniques:

  • Deskewing or tilting the scanned document slightly to fix alignment issues during the scan.
  • Despeckling or removing any digital image spots or smoothing the edges of text images.
  • Cleaning up boxes and lines in the image.
  • Script recognition for multi-language OCR technology

Text recognition

The two main types of OCR algorithms or software processes that an OCR software uses for text recognition are called pattern matching and feature extraction.

Pattern matching

Pattern matching works by isolating a character image, called a glyph, and comparing it with a similarly stored glyph. Pattern recognition works only if the stored glyph has a similar font and scale to the input glyph. This method works well with scanned images of documents that have been typed in a known font.

Feature extraction

Feature extraction breaks down or decomposes the glyphs into features such as lines, closed loops, line direction, and line intersections. It then uses these features to find the best match or the nearest neighbor among its various stored glyphs.

Postprocessing

After analysis, the system converts the extracted text data into a computerized file. Some OCR systems can create annotated PDF files that include both the before and after versions of the scanned document.

Optical character recognition use cases

The most well-known use case for optical character recognition (OCR) is converting printed paper documents into machine-readable text documents. Once a scanned paper document goes through OCR processing, the text of the document can be edited with a word processor like Microsoft Word or Google Docs.

OCR is often used as a hidden technology, powering many well-known systems and services in our daily life. Important — but less-known — use cases for OCR technology include data-entry automation, assisting blind and visually impaired persons and indexing documents for search engines, such as passports, license plates, invoices, bank statements, business cards and automatic number plate recognition.

OCR enables the optimization of big-data modeling by converting paper and scanned image documents into machine-readable, searchable pdf files. Processing and retrieving valuable information cannot be automated without first applying OCR in documents where text layers are not already present.

With OCR text recognition, scanned documents can be integrated into a big-data system that is now able to read client data from bank statements, contracts and other important printed documents. Instead of having employees examine countless image documents and manually feed inputs into an automated big-data processing workflow, organizations can use OCR to automate at the input stage of data mining. OCR software can identify the text in the image, extract text in pictures, save the text file and support jpg, jpeg, png, bmp, tiff, pdf and other formats.