Adobe Acrobat 9 How-To #61: Extracting Active Text from an Image in Acrobat 9
A page scanned in older versions of Acrobat, or one created from a photo or drawing, is only an image of a page, and you can't manipulate its content by extracting images or modifying the text. However, Acrobat can convert the image of the document into actual text or add a text layer to the document by using optical character recognition (OCR).
To capture the content of an image document, follow these steps:
- Choose Document > OCR Text Recognition > Recognize Text Using OCR. The Recognize Text dialog box opens. Specify whether you want to capture the current page, an entire document, or specified pages in a multipage document.
- Click the Edit button to open the Recognize Text - Settings dialog box. Choose one of three options in the PDF Output Style pop-up menu:
- Searchable Image compresses the foreground and places the searchable text behind the image. Note that compressing affects the image quality.
- Searchable Image (Exact) keeps the foreground of the page intact and places the searchable text behind the image.
- ClearScan rebuilds the page, converting the content into text, fonts, and graphics.
- If you selected either the Searchable Image or the ClearScan OCR choice, choose one of four options from the Downsample Images pop-up menu—anywhere from 600 dpi down to 72 dpi. (Downsampling reduces file size, but also can result in unusable images.) Click OK to return to the Recognize Text dialog box.
- Click OK to start the capture process. Be patient. Depending on the size and complexity of the document, the process can take a minute or two. When the process is complete, the dialog box closes and the results of the conversion are shown in the document (see Figure 1).
Figure 1 A poster image converts with all content visible if you specify that you want an exact image (left). Using the ClearScan setting converts only content that doesn't overlay the denser parts of the poster image (right).
The point of OCR is to produce searchable text in your document. OCR isn't foolproof, and you're going to have some errors, even though Acrobat doesn't recognize them as such. (See the next section for details on handling suspect content.) The example shown in Figure 2 is a case in point. At the top of the figure, the title was reset using a complex font. Then the first paragraph was captured as a screenshot image, and the text recognized using OCR in Acrobat. At the bottom of the figure, notice how many errors the captured text contains.
Figure 2 Acrobat tries to interpret as much as possible of the page's content as text, but won't recognize text on a dark-colored background.