Re: Re: out of memory reading large(?) file (Q:)
- To: mathgroup at smc.vnet.net
- Subject: [mg19358] Re: [mg18841] Re: out of memory reading large(?) file (Q:)
- From: David Withoff <withoff at wolfram.com>
- Date: Fri, 20 Aug 1999 23:09:36 -0400
- Sender: owner-wri-mathgroup at wolfram.com
I finally got time to look at a message that was posted to mathgroup about a month ago (22 Jul 1999) and which included some claims and suggestions about ReadList that are not generally correct: John A. Sidles <sidles at u.washington.edu> wrote: > There is a non-obvious but *very* fast and *very* memory > efficient way to read a text file as a list of numbers -- it > is an order of magnitude faster than "ReadList", and handles > more general file formats to boot. This idiom is well-known to > most Mathematica cogniscenti, and is rediscovered every few months > by frustrated users --- so I guess it's time for us users to > teach it (again) to you guys at Wolfram Inc headquarters! > > (* Open the file *) > stream = OpenRead["fileName here"]; > > (* Read it in as one long string *) > theDataString = ReadList[stream,Record,RecordSeparators->{}][[1]]; > > (* Convert it to a Mathematica expression *) > theDataArray = ( > "{{"<>StringReplace[ > StringDrop[theDataString,-1], > {"e"->"*10^", (* trick for reading exponential notation! *) > "\t"->",", > "\n"->"},\n{" > }] <> "}}") //ToExpression; > > (* All done! Release the string *) > Clear[theDataString]; > > In comparison to the above, you'll find ReadList[] to be > slow, buggy, and a memory hog, to the point that (as the > original poster "iMic", and I, and many other users have found) > ReadList[] is simply unusable for importing large text files > of data. In typical examples that would be covered by the suggested alternative approach (reading the file into one long string and converting that string to an expression using ToExpression), ReadList by itself is faster and uses less memory than this alternative. For example, since the message that led to this thread referred to a large file of zeroes and ones: >In article <7lumlt$lf5 at smc.vnet.net>, "iMic" <schaferk at communique.net> writes: >> i am trying to read in a large file (6.3MB) of zeroes (0.0) and ones (1.0). >> after giving the kernel 50MB, the kernel reports out of memory. surely this >> amount of memory should be enough. the version is Mathematica 3.0 >> on a PowerMac 6500/225 with 96MB RAM. >> i have tried before (on much smaller files), using unformatted binary output >> of the file along with ReadBinary, only to find unbearable long read times >> (> 5Mins). I did an experiment with a large file of zeroes and ones, which I generated using Do[Write["file", OutputForm[Infix[{1.0, 0.0, 1.0, 0.0}, "\t"]]], {100000}] I read this file into an array using ReadList, in the current version of Mathematica (Version 4) for Linux: In[1]:= m = {MemoryInUse[], MaxMemoryUsed[]} ; In[2]:= Timing[result = ReadList["file", Number, RecordLists->True]][[1]] Out[2]= 5.07 Second In[3]:= {MemoryInUse[], MaxMemoryUsed[]} - m Out[3]= {10018392, 10334560} and for comparison read the same file into an equivalent matrix using the suggested alternative (reading the file into one long string and using ToExpression to convert that string into an expression): In[1]:= m = {MemoryInUse[], MaxMemoryUsed[]} ; In[2]:= Timing[ stream = OpenRead["file"]; theDataString = ReadList[stream,Record,RecordSeparators->{}][[1]]; Close[stream]; theDataArray = ( "{{"<>StringReplace[ StringDrop[theDataString,-1], {"e"->"*10^", (* trick for reading exponential notation! *) "\t"->",", "\n"->"},\n{" }] <> "}}") //ToExpression; Clear[theDataString]][[1]] Out[2]= 12.64 Second In[3]:= {MemoryInUse[], MaxMemoryUsed[]} - m Out[3]= {10020004, 14863856} In this example ReadList by itself is about twice as fast as the suggested alternative and uses less memory. The alternative uses about four megabytes of extra memory for intermediate results, while the extra memory (memory beyond that needed to store the final result) used by ReadList alone was negligible. I found similar behavior for all of the examples that I tried, both in Version 3 and in Version 4 of Mathematica, and on other types of computers (such as Mathematica for Windows). ReadList by itself is consistently faster than the suggested alternative and uses less memory -- sometimes a lot less, especially for numbers in "e" (Fortran) notation. (In this regard, if the suggested alternative is used, the replacement "e"->"*10^" really should be "e"->"*^".) Although reading the entire file into one long string and processing that string is not a recommended approach in general, that approach is useful in some other examples that do not involve duplicating the basic functionality of ReadList. There are also quite a number of other approaches that are of interest in other examples. It is always possible to read data in to Mathematica in a way that is within a tolerable factor of optimal. The people in the technical support group at Wolfram Research freqently can offer assistance with this sort of thing. Dave Withoff Wolfram Research