How can I detect if a PDF needs to be OCRd?
You just received 1000 PDFs from the other side which are a mix of PDFs created from Office applications and scans. Some of the documents might have been OCRd and some not.
How can you quickly detect which files need to be OCRd?
Further, how can you pull out and separate searchable and non-searchable PDFs?
I have written on this subject previously in my article “Is that PDF Searchable?” That post included information on how to test if individual documents are searchable and offered a basic way to detect searchability across files.
Why detecting searchability is hard? When would you call a PDF searchable? When one word is searchable? When 100 words are searchable? When a page is searchable? When all the pages are searchble? What about pictures or text inside of pictures?
I’ve been doing some research and in this article I offer up another way to check for searchable text.
To accomplish this, we will use the Preflight feature of Acrobat Pro. Acrobat’s Preflight feature offers hundreds of different tests including the ability to check for characters on the page. Preflight can be used on a single document or it can be automated using a batch sequence.
The following workflow isn’t perfect, but I offer it here to legal professionals who want to experiment with it.
In this article, you’ll learn how to create a Batch Sequence to run across folders of files which will:
- Separate searchable PDFs from non-searchable PDFs and place them in named folders
- Ignore non-PDF documents
- Create a Summary Report of searchability
Acrobat Pro’s Preflight feature includes several hundred detection and fix-up operations.
Leonard Rosenthal, Adobe’s Standards Evangelist, blogged about using Preflight to check for searchable text in his blog post on Checking a PDF for Searchable Text.
In that post, Leonard offered a custom preflight profile which tested for searchable text.
One oddity of how Preflight works is that when conditions are met, an error is generated. Leonard’s Preflight Profile, in fact, errors when searchable text is found.
I certainly found this confusing at first, but once taken into account, it is very powerful indeed.
Read the Entire Article First Before you crazy loading up that new discovery documents you just got in, read the entire article including the caveats at the end.
Step 1: Import the Preflight Profile
In order to test for searchability, you will will need to import a Preflight Profile.
- Download Leonard’s Text Searchable Preflight Profile from the link below:
Text Searchable.kfp (4K) (new link, 09/17)
- Open Acrobat Pro and any PDF document
(Preflight won’t be available until a document is open)
- Choose Advanced> Preflight
- Click the Options button
- Choose _Import Preflight Profile . . . _from the Options menu
- Locate the Text Searchable.kfp file you downloaded in Step 1 above and click the Open button.
- Close the Preflight Window
Step 2: Create Sorting Folders
You will need to create two folders; one for the PDFs which are searchable and one for the PDFs that are not searchable.
- Create a new folder on your desktop. Change the folder name to “Searchable”
- Create a new folder on your desktop. Change the folder name to “Not Searchable”
Source File Folder In case you haven’t already, copy all of your source files (the PDF files you want to check) to a folder on your hard drive. Sub-folders within the main folder are not a problem.
Step 3: Create a Batch Sequence
In this step, you’ll create a batch sequence that calls Preflight and places PDFs in the folders you created in Step 2.
- Choose Advanced>Document Processing> Batch Processing
- Click the New Sequence button
- Give the Sequence a name such as “No Searchable Text”
- Click the Select Commands . . . button
- The Select Commands window opens.
A) From the list at left, choose Preflight
B) Click the** Add>** button
C) Click the Edit . . . button
- The Preflight Setup Window opens:
- A) Choose “Text can be searched” from the list
B) In_ On success_, Enable the checkbox and Move PDF file option
C) Set Success folder button to your”Not Searchable” folder”
D) In_ On error_, Enable the checkbox and Move PDF file option
E) Set Error folder button to your”Searchable” folder”
F) Enable reporting
G) Click the Save button
- Click OK to go back to the main Edit Sequence window.
- In the Edit Sequence window:
A) Change “Run Commands on” to Selected Folder
B) Click the Browse button and locate your source file folder
C) Click Source File Options . . . and deselect all the file types
See Source File Options window picture below
D) Set to Output location to “Same folder as originals”
- Click OK to exit the Edit Batch Sequence window
Step 4: Run the Batch Sequence
By now, you should have gathered up all your source files into the folder referenced above in Step 9-A. You also have folders on your desktop ready to receive searchable and non-searchable files.
Here’s how to run the sequence:
- Choose Advanced>Document Processing> Batch Processing
A) Select the “No Searchable Text” Sequence
B) Click the Run Sequence button
- The Batch Sequence will run. If you have thousands of files, you will need to get a beverage.
Acrobat will move the files into your Searchable and Not Searchable folders. In the case of a duplicate filename, Acrobat will append a sequence number and won’t overwrite the original.
When complete, a report will be generated.
Step 5: Looking at the Report and Some Caveats
The process will generate a PDF Portfolio report that looks like this:
The cover sheet for the PDF Portfolio contains summary information about all the files.
Looking closer at this report, you can see which files passed the test (were not searchable, I know that sounds weird, but that’s how it works) and which didn’t:
If you click on the filename link, it will open the file.
Caveats: Understanding the Report
The yellow section above shows you how “texty” the file is. In the example above, 80 text blocks were found in the file. That’s not the same as 80 words, however. Depending how the PDF was created, text could be grouped by word, sentence, paragraph or page.
This process does a good job helping you identify documents which have zero searchable words, especially those that were raw image-only scanned PDFs. If the report shows a checkmark, then you will want to OCR the document.
The script is not effective in these circumstances:
- The PDF contains a mix of searchable and non-searchable pages. For example, if you combined a PDF output from Word and one you scanned.
- The PDF hasn’t been OCRd, but may be Bates Stamped
Bates Stamps are searchable text and would be reported.
- The PDF contains vector text that was converted to outlines. Since the characters were converted to graphics, they won’t be searchable. You’ll sometimes see files like this from CAD systems, illustration tools, or page layout programs.
But, I want to be sure . . .
Given the caveats above, you might just want to OCR everything. Follow the instructions in my Batch OCR using Acrobat Professional article.
Acrobat will skip any pages which don’t need to be OCRd and process any pages that do.