MathGroup Archive: September 2009 [00351]

[Date Index] [Thread Index] [Author Index]

Re: Efficient way to read binary data line-by-line

To: mathgroup at smc.vnet.net
Subject: [mg103179] Re: [mg103163] Efficient way to read binary data line-by-line
From: "Kurt TeKolste" <tekolste at fastmail.us>
Date: Thu, 10 Sep 2009 07:17:57 -0400 (EDT)
References: <200909090846.EAA06073@smc.vnet.net>

I believe that the answer to this is trivial:

The expensive part of a write operation is the interaction with the
storage medium (disk drive?), which requires establishing the proper
logical and physical relationships (e.g. any write involves finding the
physical space on the disk, which may or many not be a simple set of
contiguous physical addresses, updating the data that defines the file
to ensure that the proper set of physical addresses is read in the
proper order to reconstruct the file).  This not only requires a lot of
instructions, it involves the slowest interaction on your computer --
waiting for the read-write head to be in the proper location on the
disk.  

These factors are inherent in the use of a disk drive.  The design
tradeoff is capacity and persistence against speed -- drives give
capacity and persistence, RAM and cache give speed.  

ekt

On Wed, 09 Sep 2009 04:46 -0400, "pfalloon" <pfalloon at gmail.com> wrote:
> Hi All,
> I am trying to set up an efficient procedure to handle large binary
> datasets in such a way that I can read (or write) them line-by-line
> without ever needing to have the entire dataset in memory.
> 
> I have been using the BinaryRead/Write functions to do this, but am
> finding that they run significantly (dramatically even) slower than
> reading the entire file using Import. It would be great to know if
> anyone has found a solution for this and if not whether it's something
> that's likely to improve in future versions.
> 
> Let me illustrate my attempts with an example (apologies for the
> length of this; I've tried to make it as succinct as possible while
> remaining non-trivial):
> 
> (* initial definitions *)
> {nRow, nCol} = {100000,10};
> mat = RandomReal[{-1,1}, {nRow,nCol}];
> file = "C:\\falloon\\test.dat";
> fmt = ConstantArray["Real64", nCol];
> 
> 
> In[240]:= (* METHOD 1A: write to file using Export: very efficient *)
> Export[file, mat, "Real64"] // AbsoluteTiming
> 
> Out[240]= {0.0937500,C:\falloon\test.dat}
> 
> 
> In[241]:= (* METHOD 2A: write to file line-by-line *)
> If[FileExistsQ[file], DeleteFile[file]];
> str = OpenWrite[file, BinaryFormat->True];
> Do[BinaryWrite[file, row, fmt], {row, mat}] // AbsoluteTiming
> Close[str];
> 
> Out[249]= {2.1718750,Null}
> 
> 
> (* METHOD 3A: write to file element-by-element *)
> If[FileExistsQ[file], DeleteFile[file]];
> str = OpenWrite[file, BinaryFormat->True];
> Do[BinaryWrite[file, mat[[i,j]], "Real64"], {i,nRow}, {j,nCol}] //
> AbsoluteTiming
> Close[str];
> Out[253]= {11.4296875,Null}
> 
> 
> In[266]:= (* METHOD 1B: read entire file using Import *)
> mat2 = Partition[Import[file, "Real64"], nCol]; // AbsoluteTiming
> mat2 == mat
> 
> Out[266]= {0.1093750,Null}
> Out[267]= True
> 
> 
> In[255]:= (* METHOD 2B: read file line-by-line *)
> str = OpenRead[file, BinaryFormat->True];
> mat2 = Table[BinaryRead[str, fmt], {nRow}]; // AbsoluteTiming
> Close[str];
> mat == mat2
> 
> Out[256]= {11.7500000,Null}
> Out[258]= True
> 
> In[259]:= (* METHOD 3B: read file element-by-element *)
> str = OpenRead[file, BinaryFormat->True];
> mat3 = Table[BinaryRead[str, "Real64"], {nRow}, {nCol}]; //
> AbsoluteTiming
> Close[str];
> mat == mat3
> 
> Out[260]= {2.2812500,Null}
> Out[262]= True
> 
> So, based on this example, I guess my question can be summarized as:
> 
> 1. Why are line-by-line or element-by-element reading so much slower
> than importing all-at-once?
> 
> 2. Why is line-by-line writing better than element-by-element, but
> vice versa when reading?
> 
> 3. Is there any solution or workaround that can avoid reading entire
> file at one?
> 
> 
> Many thanks for any help!
> 
> Cheers,
> Peter.
> 
Regards,
Kurt Tekolste

References:
- Efficient way to read binary data line-by-line
  - From: pfalloon <pfalloon@gmail.com>

Prev by Date: Re: An arithmetic puzzle, equality of numbers.

Next by Date: Re: Re[2]: Minimal number of transformations

Previous by thread: Efficient way to read binary data line-by-line

Next by thread: Re: Efficient way to read binary data line-by-line