Re: Importing "Plaintext" from PDF
- To: mathgroup at smc.vnet.net
- Subject: [mg113522] Re: Importing "Plaintext" from PDF
- From: AES <siegman at stanford.edu>
- Date: Mon, 1 Nov 2010 04:59:59 -0500 (EST)
- References: <iaj4jt$n2v$1@smc.vnet.net>
In article <iaj4jt$n2v$1 at smc.vnet.net>,
Bill Rowe <readnews at sbcglobal.net> wrote:
> > My source
> >pdf files were obtained from Google's Patent Search function. Just
> >wondering if I there is some option I am missing or if Mathematica
> >cannot Import text from pdf files.
>
> It is not at all difficult to import just the text from PDF
> files into Mathematica. The basic syntax is
>
> Import["filename",{"PDF","Plaintext"}]
>
> This will import all of the text in the PDF file assuming it
> exists. This will not do anything for you if the document was
> scanned into the PDF file. In that case, there is no plaintext
> to import.
A bit of additional info, in case it's helpful:
If these are the usual Patent Office copies of patents, the originals
are in two-column format and have been scanned and converted into
raster images which are delivered in TIFF or PDF format.
You may be able to OCR these to convert all the scanned text to a "real
text" PDF file. Adobe Acrobat in particular has a very good and easy to
use OCR capability built right into it (one click to OCR an entire
multi-page raster imagePDF document) that I've often used with success,
although I've never applied it specifically to patents.
I'm not at all sure, however, that the output from this OCR process will
have any of the "flow" information associated with its two-column
character, in which case you may have a mess to deal with in
interpreting each page after reading it into Mathematica (the line
numbering that's added to patents may cause some trouble also).
You might, for example, have to duplicate each original page into two
pages before scanning it; use Crop operations to select the left column
on the first page and the right column on the second; and then proceed
with OCRing these cropped pages.
You might also be able to OCR the original pages; hand select each
column individually; and Copy and Paste them one by one into an RTF
document. (The Bean Services on a Mac can make this a pretty fast
process.)
I'd be interested in a summary report to this group, if you find a way
to get all this working, or any free source that will provide already
OCRed and "single-columned" patents.