Correcting OCR Errors

Optical Character Recognition, commonly referred to as OCR, is the process of converting scanned images of letters and words into a electronic versions. For example, you can use the Recognize Text feature in Acrobat DC to convert an image of a page into a searchable version in which you can select text, comment on it and even edit it.

OCR is an imperfect process. While some very good originals will process at or near 100% accuracy, if you feed Acrobat a poor quality document, results will suffer. So, yes, a fax of a fax of fax is not going to OCR well. Scanned documents may also contain handwriting which seldom is recognized as text.

OCR affects search quality and that should be a concern to legal professionals. Consider a contract that may be part of your case. Perhaps the only place your client’s name can be found in the document is in handwritten Name and Signature fields.

If you use Acrobat (or other tools) to search for your client name, no result will be returned. Since your client’s name is an important term for most cases, you might want to consider correcting key documents to enhance search results.

Fortunately, Acrobat DC includes tools to help you audit OCR quality and correct OCR errors.

Auditing OCR Quality

Acrobat offers a feature in Preflight called “Make OCR Text Visible” which can help you audit OCR quality. Here’s how to use it:

  1. OCR the document or open a previously OCR’d document.
    Tip: Choose the Enhance Scans option in the Right Hand Pane, then choose Recognize Text
  2. In the Right Hand Pane
    1. Enter Preflight in the search field
    2. Click the Preflight tool
    3. 000_find_preflight
  3. The Preflight window opens.
    1. In the search field, enter Make OCR
    2. Select the Make OCR text visible fixup function
    3. Click Analyze and Fix
    4. 001_find_preflight
  4. Acrobat will ask you to renamed the file. I suggest adding “_QA” to the file name.

Looking at the Results

To QA the document, first open the Layers Panel in the file:

002_open_layers_panel

The Layers panel show two layers:

In the image below, both layers are turned on which means that the original scanned image is showing.

I added a red oiutline to some handwritten text in the document. Do you think Acrobat will recognize the handwriting? Let’s see . . .

Click the Visible Page Content eyeball to turn the layer off:

003a_visible_layer

Now, only the OCR text is visible in the document. I’ve added a red outline to show you that Acrobat did not recognize the handwritten text.

004_invisible_text_only

Correcting OCR Text in Acrobat

Acrobat makes it possible to correct OCR errors to enhance search quality. This can be a time-consuming process, but may be worthwhile when archiving high-value documents or in situations where you can identify certain documents in a case for which you want to ensure good search results.

To correct OCR in document:

  1. OCR the document or open a previously OCR’d document
  2. In the Right Hand Panel:
    1. Click in the Search field and type “Correct”
    2. Click Correct Recognized Text
      005_find_correction_tool
    3. The Correct Text function appears
      1. Enable Review Recognized text
      2. Select a suspect on the page. It will be highlighted in red.
      3. Enter the correct text for the error
      4. Click the Accept button
        006a_correction_steps

Your Corrections are Found

Tap CMD/CTRL-F to open the Find widget.

Once corrections are made, Acrobat will find the corrected text, even the text you have assigned to handwritten portions of the document:

008_it_is_found

Tips for Correcting Text