MathGroup Archive: August 1999 [00379]

[Date Index] [Thread Index] [Author Index]
Re: Re: out of memory reading large(?) file (Q:)
To: mathgroup at smc.vnet.net
Subject: [mg19358] Re: [mg18841] Re: out of memory reading large(?) file (Q:)
From: David Withoff <withoff at wolfram.com>
Date: Fri, 20 Aug 1999 23:09:36 -0400
Sender: owner-wri-mathgroup at wolfram.com
I finally got time to look at a message that was posted to mathgroup about
a month ago (22 Jul 1999) and which included some claims and suggestions
about ReadList that are not generally correct:

John A. Sidles <sidles at u.washington.edu> wrote:
> There is a non-obvious but *very* fast and *very* memory
> efficient way to read a text file as a list of numbers -- it
> is an order of magnitude faster than "ReadList", and handles
> more general file formats to boot.  This idiom is well-known to
> most Mathematica cogniscenti, and is rediscovered every few months
> by frustrated users --- so I guess it's time for us users to
> teach it (again) to you guys at Wolfram Inc headquarters!
> 
>   (* Open the file *)
>       stream = OpenRead["fileName here"];
> 
>   (* Read it in as one long string *)
>       theDataString = ReadList[stream,Record,RecordSeparators->{}][[1]];
> 
>   (* Convert it to a Mathematica expression *)
>       theDataArray = (
>       "{{"<>StringReplace[
>            StringDrop[theDataString,-1],
>            {"e"->"*10^",  (* trick for reading exponential notation! *)
>             "\t"->",",
>             "\n"->"},\n{"
>             }] <> "}}") //ToExpression;
> 
>   (* All done!  Release the string *)
>       Clear[theDataString];
> 
> In comparison to the above, you'll find ReadList[] to be
> slow, buggy, and a memory hog, to the point that (as the
> original poster "iMic", and I, and many other users have found)
> ReadList[] is simply unusable for importing large text files
> of data.

In typical examples that would be covered by the suggested alternative
approach (reading the file into one long string and converting that string
to an expression using ToExpression), ReadList by itself is faster and
uses less memory than this alternative.

For example, since the message that led to this thread referred to
a large file of zeroes and ones:

>In article <7lumlt$lf5 at smc.vnet.net>, "iMic" <schaferk at communique.net> writes:
>> i am trying to read in a large file (6.3MB) of zeroes (0.0) and ones (1.0).
>> after giving the kernel 50MB, the kernel reports out of memory. surely this
>> amount of memory should be enough. the version is Mathematica 3.0 
>> on a PowerMac 6500/225 with 96MB RAM.
>> i have tried before (on much smaller files), using unformatted binary output
>> of the file along with ReadBinary, only to find unbearable long read times
>> (> 5Mins).

I did an experiment with a large file of zeroes and ones, which I
generated using

Do[Write["file",
    OutputForm[Infix[{1.0, 0.0, 1.0, 0.0}, "\t"]]], {100000}]

I read this file into an array using ReadList, in the current version
of Mathematica (Version 4) for Linux:

In[1]:= m = {MemoryInUse[], MaxMemoryUsed[]} ;

In[2]:= Timing[result = ReadList["file", Number, RecordLists->True]][[1]]

Out[2]= 5.07 Second

In[3]:= {MemoryInUse[], MaxMemoryUsed[]} - m

Out[3]= {10018392, 10334560}

and for comparison read the same file into an equivalent matrix using
the suggested alternative (reading the file into one long string and
using ToExpression to convert that string into an expression):

In[1]:= m = {MemoryInUse[], MaxMemoryUsed[]} ;

In[2]:= Timing[
            stream = OpenRead["file"];
            theDataString = ReadList[stream,Record,RecordSeparators->{}][[1]];
            Close[stream];
            theDataArray = (
            "{{"<>StringReplace[
                 StringDrop[theDataString,-1],
                 {"e"->"*10^",  (* trick for reading exponential notation! *)
                  "\t"->",",
                  "\n"->"},\n{"
                  }] <> "}}") //ToExpression;
            Clear[theDataString]][[1]]

Out[2]= 12.64 Second

In[3]:= {MemoryInUse[], MaxMemoryUsed[]} - m

Out[3]= {10020004, 14863856}

In this example ReadList by itself is about twice as fast as the
suggested alternative and uses less memory.  The alternative uses about
four megabytes of extra memory for intermediate results, while the extra
memory (memory beyond that needed to store the final result) used by
ReadList alone was negligible.

I found similar behavior for all of the examples that I tried, both
in Version 3 and in Version 4 of Mathematica, and on other types of
computers (such as Mathematica for Windows).  ReadList by itself is
consistently faster than the suggested alternative and uses less
memory -- sometimes a lot less, especially for numbers in "e" (Fortran)
notation.  (In this regard, if the suggested alternative is used, the
replacement "e"->"*10^" really should be "e"->"*^".)

Although reading the entire file into one long string and processing
that string is not a recommended approach in general, that approach
is useful in some other examples that do not involve duplicating the
basic functionality of ReadList.  There are also quite a number of
other approaches that are of interest in other examples.  It is always
possible to read data in to Mathematica in a way that is within a
tolerable factor of optimal.  The people in the technical support group
at Wolfram Research freqently can offer assistance with this sort
of thing.

Dave Withoff
Wolfram Research
Prev by Date: Re: Control Function With NDsolve
Next by Date: Fontproblems
Previous by thread: Re: Tricky Symbolizations with the Notation Package
Next by thread: Fontproblems