Is that PDF Searchable?
Most law firms and even solos have a scanner that can create PDF from paper documents. Overwhelmingly, these devices create image-only, non-searchable PDFs.
Using Optical Character Recognition (OCR), Acrobat can add an invisible layer of searchable text while maintaining the original appearance.
The resulting searchable file is referred to as an image+text PDF.
An image+text PDF looks no different than a PDF which is not searchable. That creates a problem.
How can you tell if a PDF is searchable or not?
Check Manually for Text
Acrobat will discover there is no text on the page and ask you to OCR the document if you undertake any of the following:
- Search using Full Acrobat Search
Edit—>Search - Read Out Loud operation
View—>Read Out Loud - Select All
Edit—>Select All or Ctrl-A - Open an Image-only PDF
Try one of the above and Acrobat will offer this message:
If you don’t see this warning, choose Edit—>Preferences and click the General category. Click the Reset Warnings button. Many users click the “Do Not Show Again” option, but this feature can be useful.
Searchability Report
Although not designed specifically for checking for searchable text, Acrobat’s Accessibility Checker can provide reporting which may be useful.
To run an Accessibility report:
- Choose Advanced —>Accessibility—>Full Check
- Enable Report if it is not already enabled
- Deselect all settings except:
– Text language…
– Reliable character encoding… - Click the Start Checking button
An image-only PDF will produce the following result:
Checking for Searchability on Many Documents
Acrobat 8’s Batch Processing feature may be used to check for the presence of text on a great many documents.
To create a Batch Process for checking documents
- Choose Advanced—>Document Processing—>Batch Processing
- Click the New Sequence button
- Give the sequence a name:
- Click the Select Commands button:
- Choose Accessibility Checker from the list on the left.
Click the Add button. - Double-click Accessibility Checker from the list on the right.
- Disable all checkboxes except the following in the Accessibility Checker window:
- Create Accessibility Report
You may wish to choose a folder to contain the individual reports - Text Language is Specified
- Reliable Character Encoding is provided
- Create Accessibility Report
- Click the Start Checking button
Exits from the window - Click OK
- Click Close
To run the Batch Sequence
- Choose Advanced—>Document Processing—>Batch Processing
- Select the Sequence you created
- Click the Run Sequence button
- Click OK to acknowledge the actions that will take place in the Run Sequence Confirmation window
- Browse to find the folder and files you wish to check and click the Select button
- Browse and select a folder destination for the document reports
- The Warning and Errors window appears:
- This window is the only way to view a consolidate report on all the files. Unfortunately, you can’t save the contents of this window.
- Click OK
Looking at Reports
The document reports may be found at the location you specified when using the Batch Sequence.
The reports are standard HTML. Acrobat will produce one report per document. If you double-click on them, they will open in your default browser.
Caveats and Final Thoughts
Accessibility Reporting is only useful for detecting image-only PDFs.
- If a PDF contains an image-only page and an image+text page or PDF Normal page, you will find the results confusing.
- The Accessibility Checker won’t help if the page contains any renderable text. For example, some Federal Courts deliver image-only PDFs that are Bates stamped. These pages will not cause errors.
If you want all your documents to be searchable, your best bet is to OCR all of them. Check out my blog article on Batch OCR.
Documents with searchable text will be skipped automatically.