Reading binary files

*To*: mathgroup at smc.vnet.net*Subject*: [mg14652] Reading binary files*From*: jenningsj at mail.utexas.edu (Jim Jennings)*Date*: Sat, 7 Nov 1998 02:10:01 -0500*Organization*: University of Texas at Austin*Sender*: owner-wri-mathgroup at wolfram.com

I recently obtained the MathLink applications MathHDF and FastBinary from MathSource in my quest to get large data files into Mathematica faster. Both packages needed updating; they did not contain PowerMac native applications, and MathHDF did not work at all on my system: PowerMac 7100/80 104MB ram, virtual set to 105MB System 7.5.5 Mathematica 3.0.1 With some tinkering I was able to recompile both packages into PowerMac native applications using the most recent Mathlink goodies (the developers kit that came on the Mathematica CD and updated mathlink.h and SAmprep downloaded from the Wolfram Web site) and CodeWarrior Pro 2 (CodeWarrior IDE 2.1). The new MathHDF uses HDF 4.1r1. The new MathHDF has been submitted to MathSource; the new FastBinary will be submitted soon. Here are the results of a simple benchmark reading a large array into Mathematica using various methods on the computer described above: ReadListBinary FastBinary (ppc) 23 seconds ReadSDS MathHDF (ppc) 28 seconds ReadListBinary FastBinary (68k) 189 seconds ReadList built in function reading ascii text 939 seconds ReadListBinary standard package Utilities`BinaryFiles` 9597 seconds The times are elapsed times, not CPU time. I was careful to not do anything else with the computer while the benchmarks were running. The files contained a 21 by 30 by 300 array of 4 byte real numbers. The result for ReadListBinary from the package Utilities`BinaryFiles` is actually an estimate; a single 30 by 300 slice of the array was read & the result multiplied by 21. ReadSDS read from a 741K HDF file containing the 3D array. ReadListBinary read from a 741K binary file containing 189,000 numbers in a single list (except for the Utilities`BinaryFiles` test which read from a file with 9,000 numbers). ReadList read a 2.5MB ascii text file with 189,000 numbers. The ascii text file was created with Mathematica; the numbers ranged from 2 to 18 characters long. The ppc native ReadListBinary and ReadSDS read the file in about the same time. This is satisfying, but it still seems somewhat slow to me for files of this size on this computer. Can anyone explain it? The ppc native ReadListBinary has a factor of 8 advantage over the 68k version, which makes sense since the 68k version was running in emulation. It looks like it will be worth the trouble to submit the updated FastBinary to MathSource. The ppc native ReadListBinary has a factor of 41 advantage over ReadList. For those of you wanting to read large files it will be well worth the trouble to convert your files to binary (or better, make them that way in the first place) and use FastBinary or MathHDF. The result for ReadListBinary from Utilities`BinaryFiles` is outrageous! Did I do something wrong? Anyone attempting to get faster reads by converting their ascii files to binary & using this solution will get a nasty surprise; it will be more than a factor of 10 slower! Can anyone explain this seemingly absurd behavior? -- Jim Jennings Research Associate jenningsj at mail.utexas.edu Bureau of Economic Geology (512) 471-4364 (voice) University of Texas at Austin (512) 471-0140 (fax)