MathGroup Archive: November 2010 [00019]

[Date Index] [Thread Index] [Author Index]

Re: Importing "Plaintext" from PDF

To: mathgroup at smc.vnet.net
Subject: [mg113522] Re: Importing "Plaintext" from PDF
From: AES <siegman at stanford.edu>
Date: Mon, 1 Nov 2010 04:59:59 -0500 (EST)
References: <iaj4jt$n2v$1@smc.vnet.net>

In article <iaj4jt$n2v$1 at smc.vnet.net>,
 Bill Rowe <readnews at sbcglobal.net> wrote:

> > My source
> >pdf files were obtained from Google's Patent Search function. Just
> >wondering if I there is some option I am missing or if Mathematica
> >cannot Import text from pdf files.
> 
> It is not at all difficult to import just the text from PDF
> files into Mathematica. The basic syntax is
> 
> Import["filename",{"PDF","Plaintext"}]
> 
> This will import all of the text in the PDF file assuming it
> exists. This will not do anything for you if the document was
> scanned into the PDF file. In that case, there is no plaintext
> to import.

A bit of additional info, in case it's helpful:

If these are the usual Patent Office copies of patents, the originals 
are in two-column format and have been scanned and converted into 
raster images which are delivered in TIFF or PDF format.

You may be able to OCR these to convert all the scanned text to a "real 
text" PDF file.  Adobe Acrobat in particular has a very good and easy to 
use OCR capability built right into it (one click to OCR an entire 
multi-page  raster imagePDF document) that I've often used with success, 
although I've never applied it specifically to patents.

I'm not at all sure, however, that the output from this OCR process will 
have any of the "flow" information associated with its two-column 
character, in which case you may have a mess to deal with in 
interpreting each page after reading it into Mathematica (the line 
numbering that's added to patents may cause some trouble also).  

You might, for example, have to duplicate each original page into two 
pages before scanning it; use Crop operations to select the left column 
on the first page and the right column on the second; and then proceed 
with OCRing these cropped pages.

You might also be able to OCR the original pages; hand select each 
column individually; and Copy and Paste them one by one into an RTF 
document.  (The Bean Services on a Mac can make this a pretty fast 
process.)

I'd be interested in a summary report to this group, if you find a way 
to get all this working, or any free source that will provide already 
OCRed and "single-columned" patents.

Prev by Date: Re: Assertions in Mathematica?

Next by Date: Re: Assertions in Mathematica?

Previous by thread: Re: Balance point of a solid

Next by thread: Re: how to plot nminimized result