MathGroup Archive: September 2009 [00382]

[Date Index] [Thread Index] [Author Index]

Re: Efficient way to read binary data line-by-line

To: mathgroup at smc.vnet.net
Subject: [mg103238] Re: Efficient way to read binary data line-by-line
From: pfalloon <pfalloon at gmail.com>
Date: Fri, 11 Sep 2009 05:27:09 -0400 (EDT)
References: <200909090846.EAA06073@smc.vnet.net> <h8an98$hai$1@smc.vnet.net>

On Sep 10, 9:18 pm, "Kurt TeKolste" <tekol... at fastmail.us> wrote:
> I believe that the answer to this is trivial:
>
> The expensive part of a write operation is the interaction with the
> storage medium (disk drive?), which requires establishing the proper
> logical and physical relationships (e.g. any write involves finding the
> physical space on the disk, which may or many not be a simple set of
> contiguous physical addresses, updating the data that defines the file
> to ensure that the proper set of physical addresses is read in the
> proper order to reconstruct the file).  This not only requires a lot of
> instructions, it involves the slowest interaction on your computer --
> waiting for the read-write head to be in the proper location on the
> disk.  
>
> These factors are inherent in the use of a disk drive.  The design
> tradeoff is capacity and persistence against speed -- drives give
> capacity and persistence, RAM and cache give speed.  
>
> ekt
>
>
>
>
>
> On Wed, 09 Sep 2009 04:46 -0400, "pfalloon" <pfall... at gmail.com> wrote:
> > Hi All,
> > I am trying to set up an efficient procedure to handle large binary
> > datasets in such a way that I can read (or write) them line-by-line
> > without ever needing to have the entire dataset in memory.
>
> > I have been using the BinaryRead/Write functions to do this, but am
> > finding that they run significantly (dramatically even) slower than
> > reading the entire file using Import. It would be great to know if
> > anyone has found a solution for this and if not whether it's something
> > that's likely to improve in future versions.
>
> > Let me illustrate my attempts with an example (apologies for the
> > length of this; I've tried to make it as succinct as possible while
> > remaining non-trivial):
>
> > (* initial definitions *)
> > {nRow, nCol} = {100000,10};
> > mat = RandomReal[{-1,1}, {nRow,nCol}];
> > file = "C:\\falloon\\test.dat";
> > fmt = ConstantArray["Real64", nCol];
>
> > In[240]:= (* METHOD 1A: write to file using Export: very efficient *)
> > Export[file, mat, "Real64"] // AbsoluteTiming
>
> > Out[240]= {0.0937500,C:\falloon\test.dat}
>
> > In[241]:= (* METHOD 2A: write to file line-by-line *)
> > If[FileExistsQ[file], DeleteFile[file]];
> > str = OpenWrite[file, BinaryFormat->True];
> > Do[BinaryWrite[file, row, fmt], {row, mat}] // AbsoluteTiming
> > Close[str];
>
> > Out[249]= {2.1718750,Null}
>
> > (* METHOD 3A: write to file element-by-element *)
> > If[FileExistsQ[file], DeleteFile[file]];
> > str = OpenWrite[file, BinaryFormat->True];
> > Do[BinaryWrite[file, mat[[i,j]], "Real64"], {i,nRow}, {j,nCol}] //
> > AbsoluteTiming
> > Close[str];
> > Out[253]= {11.4296875,Null}
>
> > In[266]:= (* METHOD 1B: read entire file using Import *)
> > mat2 = Partition[Import[file, "Real64"], nCol]; // AbsoluteTiming
> > mat2 == mat
>
> > Out[266]= {0.1093750,Null}
> > Out[267]= True
>
> > In[255]:= (* METHOD 2B: read file line-by-line *)
> > str = OpenRead[file, BinaryFormat->True];
> > mat2 = Table[BinaryRead[str, fmt], {nRow}]; // AbsoluteTiming
> > Close[str];
> > mat == mat2
>
> > Out[256]= {11.7500000,Null}
> > Out[258]= True
>
> > In[259]:= (* METHOD 3B: read file element-by-element *)
> > str = OpenRead[file, BinaryFormat->True];
> > mat3 = Table[BinaryRead[str, "Real64"], {nRow}, {nCol}]; //
> > AbsoluteTiming
> > Close[str];
> > mat == mat3
>
> > Out[260]= {2.2812500,Null}
> > Out[262]= True
>
> > So, based on this example, I guess my question can be summarized as:
>
> > 1. Why are line-by-line or element-by-element reading so much slower
> > than importing all-at-once?
>
> > 2. Why is line-by-line writing better than element-by-element, but
> > vice versa when reading?
>
> > 3. Is there any solution or workaround that can avoid reading entire
> > file at one?
>
> > Many thanks for any help!
>
> > Cheers,
> > Peter.
>
> Regards,
> Kurt Tekolste

Kurt, thanks for the comments. I appreciate your point, and I
certainly agree that these types of considerations would explain why
the Import versions (method 1 in my example) are faster than the
others.

But I don't think that's the whole story since, by this reasoning,
method 2B should presumably be faster than 3B, whereas the reverse is
true! (It has been suggested that ReadBinaryList may be more suitable,
but I haven't tried this yet..).

Obviously working with everything in memory is the way to go if at all
possible; what I'm looking for is the optimal (or at least, a not-too-
sub-optimal) solution when that *isn't* possible.

Cheers,
Peter.

References:
- Efficient way to read binary data line-by-line
  - From: pfalloon <pfalloon@gmail.com>

Prev by Date: Re: Re: how to get the longest ordered sub

Next by Date: Re: Efficient way to read binary data line-by-line

Previous by thread: Re: Efficient way to read binary data line-by-line

Next by thread: Re: Efficient way to read binary data line-by-line