Troubleshooting Acrobat OCR

Searchable PDFs are critical in litigation and matter management. Using Acrobat’s OCR function, you can turn mountains of paper into searchable PDFs that look just like the original.

Occasionally, you may run into some issues.

Read on to learn about some workarounds and key considerations.

Acrobat OCR Troubleshooting

Acrobat OCR generally works well, but occasionally you might run into the following problems:

1) Slow Processing

Solutions:

**Read and Write Locally
**Make sure your source files and OCR’s files are written to local volumes. Reading and writing to the network or from a CD or DVD is much slower. If you are short on space, try using an external USB 2.0 drive.

**Input Resolution
**Have you scanned above 300 dpi? 600 dpi? There are diminishing returns on OCR accuracy above 600 dpi.

**Output Resolution
**I generally recommend that you downsample after OCR. For example, scanning at 600 dpi yields slightly better accuracy than scanning at 300 dpi, but downsampling back to 300 dpi to make a smaller PDF can add 20% or more to your conversion times.

**Did you scan in color or greyscale?
**Scanning B&W documents in color mode results in dramatically bigger files. Acrobat cannot convert color documents to black and white. (Adobe Photoshop can and in batch if you need to). An image-only, black and white, letter-sized document should almost never be more than a 50K PDF if properly compressed. If your PDFs are a lot bigger than this, check your scanning settings.

Large PDFs

Solutions:

**Scan in Black and White
**Make sure you do not scan in color to limit the size of your PDFs.

**Use the PDF Optimizer in Acrobat Professional
**Taked advantage of JBIG2 Lossy compression to create PDFs that are smaller. Most incoming PDF Image-only files use CCIT Group 4 Fax compression. This compression flavor was designed for fax machines with limited processing power. It was great technology . . . In 1980. Choose Advanced—>PDF Optimizer.

Use Optimize Scanned PDF in Acrobat Standard
This new feature of Acrobat 8 makes it easy to reduce the file size of scanned images. This feature can also deskew scanned pages and remove dirt, etc. Choose Document —>Optimize Scanned Document

Optimize Scanned Image
**Scan to Size
**If you scanner supports it, choose automatic page size if you regularly scan documents smaller than 8.5 by 11. Remember that PDF documents can support multiple page sizes. Scanning a business card at letter-size makes a larger file.

Slow Scaning

Solutions:

**Buy a Faster Scanner
**If you are using a scanner that is more than three years old, it may be time to upgrade. Newer units are dramatically faster. Consider buying a dedicated document scanner. I like the Fujitsu ScanSnap (about $400 street price) which includes a full version of Acrobat Standard. The Fujitsu can scan 15 double-sided pages per minute directly to PDF Image-only format! The input bin can hold 50 pages. The downside with the ScanSnap is that it is not Twain or ISIS (two standard methods that applications communicate with scanners) compliant, so it cannot be used with directly from Create PDF from Scanner in Acrobat or used with Photoshop, etc.

I also wrote an article about the Canon DR-2580C. This scanner may be used directly from Acrobat and works particularly well with Acrobat 8. The DR-2580C scans at 25 double-sided pages per minute.

**Use a Scanning Service Bureau
**Send out those bankers boxes of documents to a local scanning provider. They can return Image-only PDFs to OCR. If your service bureaus offer OCRd PDFs, make sure you test them first. In many cases, we’ve found that selecting OCR’d text on the PDF is iffy. Ask them what kind of image compression they use. Test the documents to see if they are tagged. Most times, you’ll get better results OCRing in Acrobat.

Acrobat Won’t OCR your file because it contains renderable text

You’ll see the Renderable Text Error when the PDF you are trying to OCR has vector elements on it like stamps, annotations or Bates Numbers. It’s a particular problem with federal court files that are image-only PDFs with stamped Bates numbers.

Solutions:

Install the Acrobat 8.1 or Higher
Acrobat 8.1 allows OCRing of documents which contain vector elements within margins defined as 20% of the width/height of the page. See my Acrobat 8.1 Article for more details. This fix accommodates almost all Bates numbered PDFs received from the courts.

**Remove Headers and Footers or Bates Numbers
**Go to Document—>Add Headers and Footers and remove the headers and footers and remove all entries.This solution only works if the Headers and Footers or Bates Numbers were stamped using Acrobat.

Remove the Header/Footers Manually
You can select and delete the vector elements by choosing Tools—>Advanced Editing—>Touchup Object Tool in Acrobat Professional.

Another option is to use the Redaction Tools in Acrobat 8 Professional to remove them.

Error When Using Batch OCR

You may encounter this error during Batch OCR (Acrobat Professional only) and using the PDF Optimizer:

Warning Message from PDF Optimizer

- “Settings which allow retaining the document’s PDF Version, cannot be processed.”

Solution:

Set the PDF Optimizer to Version 5 or higher
Note that Image+Text PDFs are only a property of Acrobat 5 (PDF 1.4) and higher. Do not use the Retain Existing option: