MathGroup Archive 2010

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Importing "Plaintext" from PDF

  • To: mathgroup at smc.vnet.net
  • Subject: [mg113505] Re: Importing "Plaintext" from PDF
  • From: Joseph Gwinn <joegwinn at comcast.net>
  • Date: Sun, 31 Oct 2010 02:09:56 -0500 (EST)
  • References: <iagldl$3cm$1@smc.vnet.net>

In article <iagldl$3cm$1 at smc.vnet.net>,
 Mark Coleman <markspcoleman at gmail.com> wrote:

> Hi,
> 
> I'm attempting to use Mathematica (v7.01) to Import the text from a PDF file.
> If I simply Import[] the file, it returns a list of graphics objects
> representing each page of the file. If I use use "Plaintext" option of
> Import[], it returns an empty list. My source pdf files were obtained
> from Google's Patent Search function. Just wondering if I there is
> some option I am missing or if Mathematica cannot Import text from pdf files.

The pdf contains scans (like a fax), not text.  Google patents has the 
text generated by OCR of the scans, but even for straight English text 
the error rate is significant, at least 1% on older patents.  

OCR of math equations is basically hopeless.  Nor are the published 
equations written in Mathematica.  You will have to do this manually.

Joe Gwinn


  • Prev by Date: Re: Condensed syntax
  • Next by Date: Re: solving an integral
  • Previous by thread: Re: Importing "Plaintext" from PDF
  • Next by thread: Re: Importing "Plaintext" from PDF