Re: Importing "Plaintext" from PDF
- To: mathgroup at smc.vnet.net
- Subject: [mg113522] Re: Importing "Plaintext" from PDF
- From: AES <siegman at stanford.edu>
- Date: Mon, 1 Nov 2010 04:59:59 -0500 (EST)
- References: <iaj4jt$n2v$1@smc.vnet.net>
In article <iaj4jt$n2v$1 at smc.vnet.net>, Bill Rowe <readnews at sbcglobal.net> wrote: > > My source > >pdf files were obtained from Google's Patent Search function. Just > >wondering if I there is some option I am missing or if Mathematica > >cannot Import text from pdf files. > > It is not at all difficult to import just the text from PDF > files into Mathematica. The basic syntax is > > Import["filename",{"PDF","Plaintext"}] > > This will import all of the text in the PDF file assuming it > exists. This will not do anything for you if the document was > scanned into the PDF file. In that case, there is no plaintext > to import. A bit of additional info, in case it's helpful: If these are the usual Patent Office copies of patents, the originals are in two-column format and have been scanned and converted into raster images which are delivered in TIFF or PDF format. You may be able to OCR these to convert all the scanned text to a "real text" PDF file. Adobe Acrobat in particular has a very good and easy to use OCR capability built right into it (one click to OCR an entire multi-page raster imagePDF document) that I've often used with success, although I've never applied it specifically to patents. I'm not at all sure, however, that the output from this OCR process will have any of the "flow" information associated with its two-column character, in which case you may have a mess to deal with in interpreting each page after reading it into Mathematica (the line numbering that's added to patents may cause some trouble also). You might, for example, have to duplicate each original page into two pages before scanning it; use Crop operations to select the left column on the first page and the right column on the second; and then proceed with OCRing these cropped pages. You might also be able to OCR the original pages; hand select each column individually; and Copy and Paste them one by one into an RTF document. (The Bean Services on a Mac can make this a pretty fast process.) I'd be interested in a summary report to this group, if you find a way to get all this working, or any free source that will provide already OCRed and "single-columned" patents.