Re: Efficient way to read binary data line-by-line
- To: mathgroup at smc.vnet.net
- Subject: [mg103248] Re: [mg103163] Efficient way to read binary data line-by-line
- From: Leonid Shifrin <lshifr at gmail.com>
- Date: Fri, 11 Sep 2009 05:28:59 -0400 (EDT)
- References: <200909090846.EAA06073@smc.vnet.net>
Hi Peter, While I don't have a deep knowledge to answer your questions, I would suggest to look at BinaryReadList. It has an optional third parameter which specifies how many elements you want to read at once. In the following example, I read your test file in 1000 iterattions, reading 1000 elements at every iteration. Even though I used AppendTo to form the resulting 1000 x 1000 matrrix, the timing is comparable to that of the single-shot BinaryRead call. In[1]:= file = "C:\\test.dat"; In[2]:= res = {}; str = OpenRead[file, BinaryFormat -> True]; Do[AppendTo[res, BinaryReadList[str, "Real64", 1000]], {1000}]; // AbsoluteTiming Close[str]; Out[4]= {0.1250016, Null} In[6]:= res1 = {}; str = OpenRead[file, BinaryFormat -> True]; res1 = BinaryReadList[str, "Real64", 1000000]; // AbsoluteTiming Close[str]; Out[8]= {0.0937512, Null} In[10]:= Flatten[res] == res1 Out[10]= True If you keep the stream open, you can read more of your file when needed, without losing the efficiency of BinaryRead. It looks like the only limitation is that you will not have the random access to the file - only sequential. It would be nice if in the future various read-write Mathematica built-ins will support random-access stream, with functionality similar to say Java RandomAccessFile class. Regards, Leonid On Wed, Sep 9, 2009 at 12:46 PM, pfalloon <pfalloon at gmail.com> wrote: > Hi All, > I am trying to set up an efficient procedure to handle large binary > datasets in such a way that I can read (or write) them line-by-line > without ever needing to have the entire dataset in memory. > > I have been using the BinaryRead/Write functions to do this, but am > finding that they run significantly (dramatically even) slower than > reading the entire file using Import. It would be great to know if > anyone has found a solution for this and if not whether it's something > that's likely to improve in future versions. > > Let me illustrate my attempts with an example (apologies for the > length of this; I've tried to make it as succinct as possible while > remaining non-trivial): > > (* initial definitions *) > {nRow, nCol} = {100000,10}; > mat = RandomReal[{-1,1}, {nRow,nCol}]; > file = "C:\\falloon\\test.dat"; > fmt = ConstantArray["Real64", nCol]; > > > In[240]:= (* METHOD 1A: write to file using Export: very efficient *) > Export[file, mat, "Real64"] // AbsoluteTiming > > Out[240]= {0.0937500,C:\falloon\test.dat} > > > In[241]:= (* METHOD 2A: write to file line-by-line *) > If[FileExistsQ[file], DeleteFile[file]]; > str = OpenWrite[file, BinaryFormat->True]; > Do[BinaryWrite[file, row, fmt], {row, mat}] // AbsoluteTiming > Close[str]; > > Out[249]= {2.1718750,Null} > > > (* METHOD 3A: write to file element-by-element *) > If[FileExistsQ[file], DeleteFile[file]]; > str = OpenWrite[file, BinaryFormat->True]; > Do[BinaryWrite[file, mat[[i,j]], "Real64"], {i,nRow}, {j,nCol}] // > AbsoluteTiming > Close[str]; > Out[253]= {11.4296875,Null} > > > In[266]:= (* METHOD 1B: read entire file using Import *) > mat2 = Partition[Import[file, "Real64"], nCol]; // AbsoluteTiming > mat2 == mat > > Out[266]= {0.1093750,Null} > Out[267]= True > > > In[255]:= (* METHOD 2B: read file line-by-line *) > str = OpenRead[file, BinaryFormat->True]; > mat2 = Table[BinaryRead[str, fmt], {nRow}]; // AbsoluteTiming > Close[str]; > mat == mat2 > > Out[256]= {11.7500000,Null} > Out[258]= True > > In[259]:= (* METHOD 3B: read file element-by-element *) > str = OpenRead[file, BinaryFormat->True]; > mat3 = Table[BinaryRead[str, "Real64"], {nRow}, {nCol}]; // > AbsoluteTiming > Close[str]; > mat == mat3 > > Out[260]= {2.2812500,Null} > Out[262]= True > > So, based on this example, I guess my question can be summarized as: > > 1. Why are line-by-line or element-by-element reading so much slower > than importing all-at-once? > > 2. Why is line-by-line writing better than element-by-element, but > vice versa when reading? > > 3. Is there any solution or workaround that can avoid reading entire > file at one? > > > Many thanks for any help! > > Cheers, > Peter. > >
- References:
- Efficient way to read binary data line-by-line
- From: pfalloon <pfalloon@gmail.com>
- Efficient way to read binary data line-by-line