Generating TIFF and Text files from PDF for Concordance and Summation
Adobe is the custodian for both PDF and TIFF (Tagged Image File Format) formats.
While PDF is superior in many ways, TIFF remains a popular format for use in large case litigation support systems such as Concordance and Summation.
If you have a lot of PDFs in your production it can be a challenge to work with these systems as they do not robustly support PDF and conversion is necessary. These systems want to ingest a . . .
TIFF file to represent each individual document page
TEXT file of the text of each page
Processing several hundred documents to individual TEXT and TIFF files is a candidate for some serious automation!
Fortunately, repetitive tasks like this can be easily accomplished using Acrobat Professional. Since Acrobat can be automated using JavaScript, it is possible to string together several steps and save a lot of time.
In this article, I’ve included a Tiff-Text Processing Batch Script to download which handles all of this conversion automatically. Here are the results:
What the script do exactly?
The TIFF-TEXT Processing script performs the following steps:
Tags the file for accessibility and text reflow. This should make the text files easier to review in litigation support products.
Splits the PDF into individual PDFs by page
Exports the individual PDFs as TIFF
Exports the individual PDFs as Text files
**Have you OCRd your files first?**Acrobat can’t export text if the file hasn’t been OCRd first. Check out this article on Batch OCR .
What’s Covered . . .
Installing the script
Setting related Acrobat Preferences
Tweaking the Script
Running the Script
Troubleshooting
Download the Sequence File
Below is a PDF file containing the sequence file. Select the file in the Attachments panel of the PDF and click the Save button to extract it. .
The instructions below have been tested with Acrobat 8.
Caution_ Use of the TIFF-TEXT sequence file is not supported by Adobe Systems Incorporated. The sequence file is made available as-is and without warranty. Use at your own risk! Use on a copy of your files!_
The above obligatory warning aside, it seems to work.
Quit Acrobat if it is open.
Extract the sequence file contained in the Installation_instructions PDF to your desktop or other location you can find easily. The sequence file is called Tiff-Text Processor.sequ
Select the _Tiff-Text Processor.sequ _file, right-click and choose Copy to place the file on the clipboard
Place the file in the following location:
WINDOWS C:\Documents and Settings\<username>\Application Data\Adobe\Acrobat\8.0\Sequences
MAC OSX PPC /Users/<username>/Library/Acrobat User Data/8.0_ppc/Sequences
MAC OSX INTEL /Users/<username>/Library/Acrobat User Data/8.0_x86/Sequences
Restart Acrobat
Note If you or your IT administrator has customized your installation of Acrobat, you may not be able to find the correct folder at the location noted above. Consult your IT department or use the Search function to find the correct folder.
**Can’t see Files on Windows?**1. Go to the Control Panel 2. Choose Folder Options 3. Click on the View tab 4. Find Hidden Files and Folders in the list and double click to open it 5. Enable “Show hidden files and folders
Set TIFF Conversion Preferences
The majority of case documents may be represented well as B&W TIFFS at 300 dpi resolution. Acrobat’s default preference, however, is to make an intelligent conversion of the document which could result in the creation of grayscale or color TIFFs . . . these can be really large!
Let’s make some changes:
Choose Edit—>Preferences . . . (Acrobat—>Preferences . . . on the Mac)
In the Preferences window A) Choose Convert from PDF B) Choose TIFF C) Click the_ Edit Settings _button
Make the following changes in the Settings Window: A)Change Colorspace to Monochrome B) Change Resolution to 300/pixels/inch Click OK
Set Batch Conversion Preferences. A) Click on the Batch Processing category (far left) B) Enable “Save warnings and errors in log file” Click OK
Sometimes a file may not convert properly. You can view a log file created by Acrobat to help with troubleshooting.
Destination Locations
One thing you should know about the script— the destination folder is hard wired.
If you run the script sample as-is, it will prompt you to find the PDFs to process and then write individual PDFs, TIFFs and Text files into a folder at C:\dest.
Windows: At the very least, you will need to create the “dest” folder at the root level of your C drive to use the script.
Mac: On my Mac, I was surprised that Acrobat actually created a folder at /C/dest.
Still, you probably will want to have more control about where the files will go. See below.
Changing the Destination Location
To change the place where files will be written:
Advanced—>Document Processing—>Batch Processing . . . A) Scroll down to find the Tiff-Text Processor sequence B) Click Edit Sequence . . .
Click the Select Commands . . . button in the Edit Batch Sequence window
In the Edit Sequence window: A)Choose Execute JavaScript from the list on the right and B)Click the Edit button.
In the JavaScript editor window, scroll down to find the line:
var destPath = “c/dest/”
**What’s in a path? **/c/dest/ represents the drive letter and path. If you want to put the transformed files files into a path on your desktop, you might change that portion to:> Windows /c/Documents and Settings/USERNAME/Desktop/FOLDERNAME > > Macintosh /Users//USERNAME/Desktop/FOLDERNAME
Source File Location
It isn’t always convenient to have to select the file(s) for conversion, especially if your document production spans several nested folders.
To process a folder and all subfolders within:
Advanced—>Document Processing—>Batch Processing . . . A) Scroll down to find the Tiff-Text Processor sequence B) Click Edit Sequence . . .
In the Edit Batch Sequence window, change the following: A) Change Run commands on to “Selected Folder” B) Click the Browse button and locate your source folder. Click OK.
Warning! Do not change the Output location via the window above or the script will not work. Leave this as “Ask When Sequence is Run”.
Running the Sequence
This part is easy!
Advanced—>Document Processing—>Batch Processing . . . A) Scroll down to find the Tiff-Text Processor sequence B) Click the Run Sequence button
Acrobat will prompt you to select files if you did not change the Source File location.
Acrobat will display the Run Confirmation window. You can turn this off in Preferences(Batch Processing category).
Acrobat will process the files. This could take a while!
Open your destination folder to view the results:
Troubleshooting
The script is not perfect. I have found that it does not work properly in the following cases:
If no text is created from a file, ensure that it has been OCRd.
You might receive messages that files could not be tagged or were already tagged. Generally, you can ignore these.
Not all PDF forms can be saved as TIFF. You might need to flatten them first using the PDF Optimizer
Corrupted PDFs may cause a crash
I need to convert Word, Excel, etc. to PDF
You can use Batch Processing to convert any type of file supported by Acrobat to PDF. You would need to conduct this operation as a separate batch sequence before running the TIFF-Text Processor script. Regrettably, Acrobat doesn’t allow you to chain together PDF Creation and secondary processing.
**Thank you, Leonard!**Thanks to Leonard Rosenthol, Adobe’s Technical Standards Evangelist, for his help in developing this script. I don’t know anybody who knows more about the technical intricacies of PDF than Leonard!