Efficient way to read binary data line-by-line
- To: mathgroup at smc.vnet.net
- Subject: [mg103163] Efficient way to read binary data line-by-line
- From: pfalloon <pfalloon at gmail.com>
- Date: Wed, 9 Sep 2009 04:46:55 -0400 (EDT)
Hi All, I am trying to set up an efficient procedure to handle large binary datasets in such a way that I can read (or write) them line-by-line without ever needing to have the entire dataset in memory. I have been using the BinaryRead/Write functions to do this, but am finding that they run significantly (dramatically even) slower than reading the entire file using Import. It would be great to know if anyone has found a solution for this and if not whether it's something that's likely to improve in future versions. Let me illustrate my attempts with an example (apologies for the length of this; I've tried to make it as succinct as possible while remaining non-trivial): (* initial definitions *) {nRow, nCol} = {100000,10}; mat = RandomReal[{-1,1}, {nRow,nCol}]; file = "C:\\falloon\\test.dat"; fmt = ConstantArray["Real64", nCol]; In[240]:= (* METHOD 1A: write to file using Export: very efficient *) Export[file, mat, "Real64"] // AbsoluteTiming Out[240]= {0.0937500,C:\falloon\test.dat} In[241]:= (* METHOD 2A: write to file line-by-line *) If[FileExistsQ[file], DeleteFile[file]]; str = OpenWrite[file, BinaryFormat->True]; Do[BinaryWrite[file, row, fmt], {row, mat}] // AbsoluteTiming Close[str]; Out[249]= {2.1718750,Null} (* METHOD 3A: write to file element-by-element *) If[FileExistsQ[file], DeleteFile[file]]; str = OpenWrite[file, BinaryFormat->True]; Do[BinaryWrite[file, mat[[i,j]], "Real64"], {i,nRow}, {j,nCol}] // AbsoluteTiming Close[str]; Out[253]= {11.4296875,Null} In[266]:= (* METHOD 1B: read entire file using Import *) mat2 = Partition[Import[file, "Real64"], nCol]; // AbsoluteTiming mat2 == mat Out[266]= {0.1093750,Null} Out[267]= True In[255]:= (* METHOD 2B: read file line-by-line *) str = OpenRead[file, BinaryFormat->True]; mat2 = Table[BinaryRead[str, fmt], {nRow}]; // AbsoluteTiming Close[str]; mat == mat2 Out[256]= {11.7500000,Null} Out[258]= True In[259]:= (* METHOD 3B: read file element-by-element *) str = OpenRead[file, BinaryFormat->True]; mat3 = Table[BinaryRead[str, "Real64"], {nRow}, {nCol}]; // AbsoluteTiming Close[str]; mat == mat3 Out[260]= {2.2812500,Null} Out[262]= True So, based on this example, I guess my question can be summarized as: 1. Why are line-by-line or element-by-element reading so much slower than importing all-at-once? 2. Why is line-by-line writing better than element-by-element, but vice versa when reading? 3. Is there any solution or workaround that can avoid reading entire file at one? Many thanks for any help! Cheers, Peter.
- Follow-Ups:
- Re: Efficient way to read binary data line-by-line
- From: Leonid Shifrin <lshifr@gmail.com>
- Re: Efficient way to read binary data line-by-line
- From: "Kurt TeKolste" <tekolste@fastmail.us>
- Re: Efficient way to read binary data line-by-line