Comparing Scanned Documents Tips and Workarounds
Acrobat Pro includes tools that allow you to compare two PDFs to quickly spot the differences.
In Acrobat X Pro, the Document Comparison feature is available by choosing View>Compare Documents.
For a brief demonstration of this feature, click here.
Recently, I received an email from a law firm that was having trouble comparing two PDFs. The firm reported that during comparison, Acrobat couldn’t find any differences in the documents, even though visual differences were apparent during a visual inspection.
I was able to examine the documents and discovered that:
- The source documents were scanned documents
- One document was scanned in black and white. The other was scanned in greyscale.
- Documents were OCR’d
The difference in color space (black and white vs. greyscale) was enough to seriously affect Acrobat’s ability to detect changes. In effect, Acrobat saw these as being two completely different documents.
That’s probably a bug (I reported it), but we all still have a job to do. Fortunately, I was able to come up with a workaround.
Control the Scanning
When you can, make sure that scanned documents which need to be compared are scanned using identical settings.
That isn’t always possible if you receive documents from the other side or via the court. What to do?
ClearScan to the Rescue
Acrobat offers different “flavors” of OCR. The type used most frequently by legal professional is searchable image, sometimes referred to as “image+text”. When Searchable Image OCR is performed on a scaned doc, the original scanned image is left in place in the PDF and an invisible layer of searchable text is added.
Acroat 9 introduced a new OCR option called ClearScan designed to enhance the quality of the document post-scan.
ClearScan works by turning the images which represent text characters on the page into smoothed vector outlines. Each character on the page is compared and all matching characters are replaced with a an outline character:
800% View in Acrobat 300 dpi scan
You can read more about ClearScan in my article Better PDF OCR. ClearScan is smaller, looks better.
RE-OCRing using ClearScan
I found that running ClearScan OCR on the documents supplied to me by the customer, that Acrobat could then find difference in the text.
Since ClearScan is not the default type of OCR that Acrobat uses, many folks never discover it on their own.
Here’s how to OCR using ClearScan in Acrobat X:
- Open the Tools Panel
- Go to the Recognize Text section and choose In this File
- The Recognize Text window appears. Click the Edit button:
- In the Settings window, choose ClearScan from the PDF Output Style menu, then click OK.
- Click OK again to OCR your document using ClearScan.
Like some other choices in Acrobat, making the change above (Step 4) is a “sticky” setting. If you don’t want to use ClearScan the next time you OCR, you’ll need to remember to change it back. If you think you might occasionally want to use ClearScan, you might consider creating a quick Action which includes the ClearScan option.
ClearScan OCR has many benefits over traditional OCR (smaller file size, faster printing), but I do not reccomend using it in all legal workflows.
Because ClearScan changes the original scanned image, replacing it with a cleaner vector representation, you might potentially call into question the validity of your documents.
Finally, while you can Re-OCR an existing OCR’d PDF using ClearScan, you cannot revert back to the Searchable Image flavor later. For that reason, you might wish to duplicate the documents and work on a copy.