MathGroup Archive: October 2010 [00695]

[Date Index] [Thread Index] [Author Index]

Re: Importing "Plaintext" from PDF

To: mathgroup at smc.vnet.net
Subject: [mg113505] Re: Importing "Plaintext" from PDF
From: Joseph Gwinn <joegwinn at comcast.net>
Date: Sun, 31 Oct 2010 02:09:56 -0500 (EST)
References: <iagldl$3cm$1@smc.vnet.net>

In article <iagldl$3cm$1 at smc.vnet.net>,
 Mark Coleman <markspcoleman at gmail.com> wrote:

> Hi,
> 
> I'm attempting to use Mathematica (v7.01) to Import the text from a PDF file.
> If I simply Import[] the file, it returns a list of graphics objects
> representing each page of the file. If I use use "Plaintext" option of
> Import[], it returns an empty list. My source pdf files were obtained
> from Google's Patent Search function. Just wondering if I there is
> some option I am missing or if Mathematica cannot Import text from pdf files.

The pdf contains scans (like a fax), not text.  Google patents has the 
text generated by OCR of the scans, but even for straight English text 
the error rate is significant, at least 1% on older patents.  

OCR of math equations is basically hopeless.  Nor are the published 
equations written in Mathematica.  You will have to do this manually.

Joe Gwinn

Prev by Date: Re: Condensed syntax

Next by Date: Re: solving an integral

Previous by thread: Re: Importing "Plaintext" from PDF

Next by thread: Re: Importing "Plaintext" from PDF