Generating TIFF and Text files from PDF for Concordance and Summation
Adobe is the custodian for both PDF and TIFF (Tagged Image File Format) formats.
While PDF is superior in many ways, TIFF remains a popular format for use in large case litigation support systems such as Concordance and Summation.
If you have a lot of PDFs in your production it can be a challenge to work with these systems as they do not robustly support PDF and conversion is necessary. These systems want to ingest a . . .
- TIFF file to represent each individual document page
- TEXT file of the text of each page
Processing several hundred documents to individual TEXT and TIFF files is a candidate for some serious automation!
Fortunately, repetitive tasks like this can be easily accomplished using Acrobat Professional. Since Acrobat can be automated using JavaScript, it is possible to string together several steps and save a lot of time.
In this article, I’ve included a Tiff-Text Processing Batch Script to download which handles all of this conversion automatically. Here are the results:
What the script do exactly?
The TIFF-TEXT Processing script performs the following steps:
- Tags the file for accessibility and text reflow.
This should make the text files easier to review in litigation support products. - Splits the PDF into individual PDFs by page
- Exports the individual PDFs as TIFF
- Exports the individual PDFs as Text files
What’s Covered . . .
- Installing the script
- Setting related Acrobat Preferences
- Tweaking the Script
- Running the Script
- Troubleshooting
Download the Sequence File
Below is a PDF file containing the sequence file. Select the file in the Attachments panel of the PDF and click the Save button to extract it. .
Installation_instructions (52K) (Opens in a new window)
Installing the Sequence File for Acrobat 8
The instructions below have been tested with Acrobat 8.
Caution_ Use of the TIFF-TEXT sequence file is not supported by Adobe Systems Incorporated. The sequence file is made available as-is and without warranty. Use at your own risk! Use on a copy of your files!_
The above obligatory warning aside, it seems to work.
- Quit Acrobat if it is open.
- Extract the sequence file contained in the Installation_instructions PDF to your desktop or other location you can find easily.
The sequence file is called Tiff-Text Processor.sequ - Select the _Tiff-Text Processor.sequ _file, right-click and choose Copy to place the file on the clipboard
- Place the file in the following location:
- WINDOWS
C:\Documents and Settings\<username>\Application Data\Adobe\Acrobat\8.0\Sequences - MAC OSX PPC
/Users/<username>/Library/Acrobat User Data/8.0_ppc/Sequences - MAC OSX INTEL
/Users/<username>/Library/Acrobat User Data/8.0_x86/Sequences - Restart Acrobat
Note If you or your IT administrator has customized your installation of Acrobat, you may not be able to find the correct folder at the location noted above. Consult your IT department or use the Search function to find the correct folder.
**Can’t see Files on Windows?**1. Go to the Control Panel 2. Choose Folder Options 3. Click on the View tab 4. Find Hidden Files and Folders in the list and double click to open it 5. Enable “Show hidden files and folders
Set TIFF Conversion Preferences
The majority of case documents may be represented well as B&W TIFFS at 300 dpi resolution. Acrobat’s default preference, however, is to make an intelligent conversion of the document which could result in the creation of grayscale or color TIFFs . . . these can be really large!
Let’s make some changes:
- Choose Edit—>Preferences . . .
(Acrobat—>Preferences . . . on the Mac) - In the Preferences window
A) Choose Convert from PDF
B) Choose TIFF
C) Click the_ Edit Settings _button - Make the following changes in the Settings Window:
A)Change Colorspace to Monochrome
B) Change Resolution to 300/pixels/inch
Click OK - Set Batch Conversion Preferences.
A) Click on the Batch Processing category (far left)
B) Enable “Save warnings and errors in log file”
Click OK
Destination Locations
One thing you should know about the script— the destination folder is hard wired.
If you run the script sample as-is, it will prompt you to find the PDFs to process and then write individual PDFs, TIFFs and Text files into a folder at C:\dest.
Windows:
At the very least, you will need to create the “dest” folder at the root level of your C drive to use the script.
Mac:
On my Mac, I was surprised that Acrobat actually created a folder at /C/dest.
Still, you probably will want to have more control about where the files will go. See below.
Changing the Destination Location
To change the place where files will be written:
- Advanced—>Document Processing—>Batch Processing . . .
A) Scroll down to find the Tiff-Text Processor sequence
B) Click Edit Sequence . . . - Click the Select Commands . . . button in the Edit Batch Sequence window
- In the Edit Sequence window:
A)Choose Execute JavaScript from the list on the right and
B)Click the Edit button. - In the JavaScript editor window, scroll down to find the line:
- var destPath = “c/dest/”
Source File Location
It isn’t always convenient to have to select the file(s) for conversion, especially if your document production spans several nested folders.
To process a folder and all subfolders within:
- Advanced—>Document Processing—>Batch Processing . . .
A) Scroll down to find the Tiff-Text Processor sequence
B) Click Edit Sequence . . . - In the Edit Batch Sequence window, change the following:
A) Change Run commands on to “Selected Folder”
B) Click the Browse button and locate your source folder.
Click OK.
Running the Sequence
This part is easy!
- Advanced—>Document Processing—>Batch Processing . . .
A) Scroll down to find the Tiff-Text Processor sequence
B) Click the Run Sequence button - Acrobat will prompt you to select files if you did not change the Source File location.
- Acrobat will display the Run Confirmation window.
You can turn this off in Preferences(Batch Processing category). - Acrobat will process the files. This could take a while!
- Open your destination folder to view the results:
Troubleshooting
The script is not perfect. I have found that it does not work properly in the following cases:
- If no text is created from a file, ensure that it has been OCRd.
- You might receive messages that files could not be tagged or were already tagged. Generally, you can ignore these.
- Not all PDF forms can be saved as TIFF. You might need to flatten them first using the PDF Optimizer
- Corrupted PDFs may cause a crash
I need to convert Word, Excel, etc. to PDF
You can use Batch Processing to convert any type of file supported by Acrobat to PDF. You would need to conduct this operation as a separate batch sequence before running the TIFF-Text Processor script. Regrettably, Acrobat doesn’t allow you to chain together PDF Creation and secondary processing.
How do I learn more about scripting?
AcrobatUsers.Com Javascript Corner