Correcting OCR Errors
Optical Character Recognition, commonly referred to as OCR, is the process of converting scanned images of letters and words into a electronic versions. For example, you can use the Recognize Text feature in Acrobat DC to convert an image of a page into a searchable version in which you can select text, comment on it and even edit it.
OCR is an imperfect process. While some very good originals will process at or near 100% accuracy, if you feed Acrobat a poor quality document, results will suffer. So, yes, a fax of a fax of fax is not going to OCR well. Scanned documents may also contain handwriting which seldom is recognized as text.
OCR affects search quality and that should be a concern to legal professionals. Consider a contract that may be part of your case. Perhaps the only place your client’s name can be found in the document is in handwritten Name and Signature fields.
If you use Acrobat (or other tools) to search for your client name, no result will be returned. Since your client’s name is an important term for most cases, you might want to consider correcting key documents to enhance search results.
Fortunately, Acrobat DC includes tools to help you audit OCR quality and correct OCR errors.
Auditing OCR Quality
Acrobat offers a feature in Preflight called “Make OCR Text Visible” which can help you audit OCR quality. Here’s how to use it:
- OCR the document or open a previously OCR’d document.
Tip: Choose the Enhance Scans option in the Right Hand Pane, then choose Recognize Text
- In the Right Hand Pane
- Enter Preflight in the search field
- Click the Preflight tool
- The Preflight window opens.
- In the search field, enter Make OCR
- Select the Make OCR text visible fixup function
- Click Analyze and Fix
- Acrobat will ask you to renamed the file. I suggest adding “_QA” to the file name.
Looking at the Results
To QA the document, first open the Layers Panel in the file:
The Layers panel show two layers:
- Invisible text
- Visible Page Content
In the image below, both layers are turned on which means that the original scanned image is showing.
I added a red oiutline to some handwritten text in the document. Do you think Acrobat will recognize the handwriting? Let’s see . . .
Click the Visible Page Content eyeball to turn the layer off:
Now, only the OCR text is visible in the document. I’ve added a red outline to show you that Acrobat did not recognize the handwritten text.
Correcting OCR Text in Acrobat
Acrobat makes it possible to correct OCR errors to enhance search quality. This can be a time-consuming process, but may be worthwhile when archiving high-value documents or in situations where you can identify certain documents in a case for which you want to ensure good search results.
To correct OCR in document:
- OCR the document or open a previously OCR’d document
- In the Right Hand Panel:
- Click in the Search field and type “Correct”
- Click Correct Recognized Text
- The Correct Text function appears
- Enable Review Recognized text
- Select a suspect on the page. It will be highlighted in red.
- Enter the correct text for the error
- Click the Accept button
Your Corrections are Found
Tap CMD/CTRL-F to open the Find widget.
Once corrections are made, Acrobat will find the corrected text, even the text you have assigned to handwritten portions of the document:
Tips for Correcting Text
- You can toggle “Review Recognized Text” on or off to see the original scanned text
- You can make all corrections “mouse free”. Simply hit TAB to move the cursor to the correction text field and Enter to Accept.
- Your document may contain artifacts such as smudges or marks which Acrobat could see as text. Simply clear the correction text field and Acrobat will show “This is not text” in the correction field:
- You can assign Preflight steps such as “Make OCR Visible” and other steps mentioned in this article to Actions which let you automate multi-step processes.