MathGroup Archive: September 2009 [00383]

[Date Index] [Thread Index] [Author Index]

Re: Efficient way to read binary data line-by-line

To: mathgroup at smc.vnet.net
Subject: [mg103248] Re: [mg103163] Efficient way to read binary data line-by-line
From: Leonid Shifrin <lshifr at gmail.com>
Date: Fri, 11 Sep 2009 05:28:59 -0400 (EDT)
References: <200909090846.EAA06073@smc.vnet.net>

Hi Peter,

While I don't have a deep knowledge to answer your questions, I would
suggest to look at BinaryReadList. It has an optional third parameter which
specifies how many elements you want to read at once. In the following
example, I read your test file in 1000 iterattions, reading 1000 elements
at every iteration. Even though I used AppendTo to form the resulting
1000 x 1000 matrrix, the timing is comparable to that of the single-shot
BinaryRead call.

In[1]:= file =   "C:\\test.dat";

In[2]:=
res = {};
str = OpenRead[file, BinaryFormat -> True];
Do[AppendTo[res,
    BinaryReadList[str, "Real64", 1000]], {1000}]; // AbsoluteTiming
Close[str];

Out[4]= {0.1250016, Null}

In[6]:= res1 = {};
str = OpenRead[file, BinaryFormat -> True];
res1 = BinaryReadList[str, "Real64", 1000000]; // AbsoluteTiming
Close[str];

Out[8]= {0.0937512, Null}

In[10]:= Flatten[res] == res1

Out[10]= True

If you keep the stream open, you can read more of your file when needed,
without losing the efficiency of BinaryRead. It looks like the only
limitation is  that
you will not have the random access to the file - only sequential. It would
be
nice if in the future various read-write Mathematica built-ins will support
random-access stream, with functionality similar to say  Java
RandomAccessFile
class.

Regards,
Leonid



On Wed, Sep 9, 2009 at 12:46 PM, pfalloon <pfalloon at gmail.com> wrote:

> Hi All,
> I am trying to set up an efficient procedure to handle large binary
> datasets in such a way that I can read (or write) them line-by-line
> without ever needing to have the entire dataset in memory.
>
> I have been using the BinaryRead/Write functions to do this, but am
> finding that they run significantly (dramatically even) slower than
> reading the entire file using Import. It would be great to know if
> anyone has found a solution for this and if not whether it's something
> that's likely to improve in future versions.
>
> Let me illustrate my attempts with an example (apologies for the
> length of this; I've tried to make it as succinct as possible while
> remaining non-trivial):
>
> (* initial definitions *)
> {nRow, nCol} = {100000,10};
> mat = RandomReal[{-1,1}, {nRow,nCol}];
> file = "C:\\falloon\\test.dat";
> fmt = ConstantArray["Real64", nCol];
>
>
> In[240]:= (* METHOD 1A: write to file using Export: very efficient *)
> Export[file, mat, "Real64"] // AbsoluteTiming
>
> Out[240]= {0.0937500,C:\falloon\test.dat}
>
>
> In[241]:= (* METHOD 2A: write to file line-by-line *)
> If[FileExistsQ[file], DeleteFile[file]];
> str = OpenWrite[file, BinaryFormat->True];
> Do[BinaryWrite[file, row, fmt], {row, mat}] // AbsoluteTiming
> Close[str];
>
> Out[249]= {2.1718750,Null}
>
>
> (* METHOD 3A: write to file element-by-element *)
> If[FileExistsQ[file], DeleteFile[file]];
> str = OpenWrite[file, BinaryFormat->True];
> Do[BinaryWrite[file, mat[[i,j]], "Real64"], {i,nRow}, {j,nCol}] //
> AbsoluteTiming
> Close[str];
> Out[253]= {11.4296875,Null}
>
>
> In[266]:= (* METHOD 1B: read entire file using Import *)
> mat2 = Partition[Import[file, "Real64"], nCol]; // AbsoluteTiming
> mat2 == mat
>
> Out[266]= {0.1093750,Null}
> Out[267]= True
>
>
> In[255]:= (* METHOD 2B: read file line-by-line *)
> str = OpenRead[file, BinaryFormat->True];
> mat2 = Table[BinaryRead[str, fmt], {nRow}]; // AbsoluteTiming
> Close[str];
> mat == mat2
>
> Out[256]= {11.7500000,Null}
> Out[258]= True
>
> In[259]:= (* METHOD 3B: read file element-by-element *)
> str = OpenRead[file, BinaryFormat->True];
> mat3 = Table[BinaryRead[str, "Real64"], {nRow}, {nCol}]; //
> AbsoluteTiming
> Close[str];
> mat == mat3
>
> Out[260]= {2.2812500,Null}
> Out[262]= True
>
> So, based on this example, I guess my question can be summarized as:
>
> 1. Why are line-by-line or element-by-element reading so much slower
> than importing all-at-once?
>
> 2. Why is line-by-line writing better than element-by-element, but
> vice versa when reading?
>
> 3. Is there any solution or workaround that can avoid reading entire
> file at one?
>
>
> Many thanks for any help!
>
> Cheers,
> Peter.
>
>

References:
- Efficient way to read binary data line-by-line
  - From: pfalloon <pfalloon@gmail.com>

Prev by Date: Re: Efficient way to read binary data line-by-line

Next by Date: Re[4]: Minimal number of transformations

Previous by thread: Re: Efficient way to read binary data line-by-line

Next by thread: Re: Manipulate: How to correctly adjust one control parameters based