MathGroup Archive: September 2009 [00294]

[Date Index] [Thread Index] [Author Index]

Efficient way to read binary data line-by-line

To: mathgroup at smc.vnet.net
Subject: [mg103163] Efficient way to read binary data line-by-line
From: pfalloon <pfalloon at gmail.com>
Date: Wed, 9 Sep 2009 04:46:55 -0400 (EDT)

Hi All,
I am trying to set up an efficient procedure to handle large binary
datasets in such a way that I can read (or write) them line-by-line
without ever needing to have the entire dataset in memory.

I have been using the BinaryRead/Write functions to do this, but am
finding that they run significantly (dramatically even) slower than
reading the entire file using Import. It would be great to know if
anyone has found a solution for this and if not whether it's something
that's likely to improve in future versions.

Let me illustrate my attempts with an example (apologies for the
length of this; I've tried to make it as succinct as possible while
remaining non-trivial):

(* initial definitions *)
{nRow, nCol} = {100000,10};
mat = RandomReal[{-1,1}, {nRow,nCol}];
file = "C:\\falloon\\test.dat";
fmt = ConstantArray["Real64", nCol];


In[240]:= (* METHOD 1A: write to file using Export: very efficient *)
Export[file, mat, "Real64"] // AbsoluteTiming

Out[240]= {0.0937500,C:\falloon\test.dat}


In[241]:= (* METHOD 2A: write to file line-by-line *)
If[FileExistsQ[file], DeleteFile[file]];
str = OpenWrite[file, BinaryFormat->True];
Do[BinaryWrite[file, row, fmt], {row, mat}] // AbsoluteTiming
Close[str];

Out[249]= {2.1718750,Null}


(* METHOD 3A: write to file element-by-element *)
If[FileExistsQ[file], DeleteFile[file]];
str = OpenWrite[file, BinaryFormat->True];
Do[BinaryWrite[file, mat[[i,j]], "Real64"], {i,nRow}, {j,nCol}] //
AbsoluteTiming
Close[str];
Out[253]= {11.4296875,Null}


In[266]:= (* METHOD 1B: read entire file using Import *)
mat2 = Partition[Import[file, "Real64"], nCol]; // AbsoluteTiming
mat2 == mat

Out[266]= {0.1093750,Null}
Out[267]= True


In[255]:= (* METHOD 2B: read file line-by-line *)
str = OpenRead[file, BinaryFormat->True];
mat2 = Table[BinaryRead[str, fmt], {nRow}]; // AbsoluteTiming
Close[str];
mat == mat2

Out[256]= {11.7500000,Null}
Out[258]= True

In[259]:= (* METHOD 3B: read file element-by-element *)
str = OpenRead[file, BinaryFormat->True];
mat3 = Table[BinaryRead[str, "Real64"], {nRow}, {nCol}]; //
AbsoluteTiming
Close[str];
mat == mat3

Out[260]= {2.2812500,Null}
Out[262]= True

So, based on this example, I guess my question can be summarized as:

1. Why are line-by-line or element-by-element reading so much slower
than importing all-at-once?

2. Why is line-by-line writing better than element-by-element, but
vice versa when reading?

3. Is there any solution or workaround that can avoid reading entire
file at one?


Many thanks for any help!

Cheers,
Peter.

Follow-Ups:
- Re: Efficient way to read binary data line-by-line
  - From: Leonid Shifrin <lshifr@gmail.com>
- Re: Efficient way to read binary data line-by-line
  - From: "Kurt TeKolste" <tekolste@fastmail.us>

Prev by Date: Re: inconsistent synatx for FillingStyle and PlotStyle? or How to make vertical lines in ListPlot have diffenent colors?

Next by Date: Re: Manipulate: How to correctly adjust one control parameters based

Previous by thread: Re: inconsistent synatx for FillingStyle and PlotStyle?

Next by thread: Re: Efficient way to read binary data line-by-line