Re: Working with huge text files with mathematica

*To*: mathgroup at smc.vnet.net*Subject*: [mg56596] Re: Working with huge text files with mathematica*From*: Maxim <ab_def at prontomail.com>*Date*: Fri, 29 Apr 2005 03:22:11 -0400 (EDT)*References*: <d4clv2$357$1@smc.vnet.net> <d4mrtu$1uq$1@smc.vnet.net>*Sender*: owner-wri-mathgroup at wolfram.com

On Wed, 27 Apr 2005 02:03:42 +0000 (UTC), David Bailey <dave at Remove_Thisdbailey.co.uk> wrote: > Andrey Shevchuk wrote: >> Hi Everybody, >> >> I face a problem when try to read in a huge data file with my >> Mathematica5.1 >> >> The file is approx. 4Gb large and was created by another Mathematica >> application (So, I think Mathematica should be able to handle it). >> >> Now, if I try to use Read on a stream from this file and it returns >> EndOfFile and nothing else! >> I checked the file with the HexEdit (an editor for huge files) and it >> is not >> corrupt and has actually the data I need. >> >> A similar file (from the same application) but 1Gb works perfectly. >> Any ideas? >> >> Does Mathematica have an internal limit for the file size (and if yes, >> can >> one somehow override this option) ? >> >> I would appreciate any feedback! >> >> >> > It would seem from what others have written that there is a 2G limit on > the size of files that can be read by Mathematica under Windows. This > must be a bug in Mathematica - Windows itself can exceed this limit. > Rather than changing operating systems or splitting the file, you might > want to try using J/Link and reading the file via Java. For best > performance, you might find it was better to read the file in > substantial chunks and buffer it inside Mathematica, however. > > David Bailey > dbaileyconsultancy.co.uk > Here's my attempt to do it with J/Link. Creating a test file: sz = 10^6; chunks = Ceiling[2^31 / sz]; L = Table[Random[Integer, {0, 255}], {sz}]; f = OpenWrite["/math/2gb", BinaryFormat -> True]; Do[BinaryWrite[f, L], {chunks}] Close[f]; And now reading it back (note that the read buffer has to be created on the Java side): <<jlink` InstallJava[]; f = JavaNew["java.io.FileInputStream", "/math/2gb"]; L2 = JavaNew["[B", sz]; goodF = And@@ Table[ f@read[L2] == sz && Mod[JavaObjectToExpression[L2], 256] == L, {chunks}]; f@close[]; ReleaseJavaObject[L2]; goodF This should return True, but there is a problem: during the evaluation of Table memory usage remains constant (as one would expect) until we have read approximately half of the file, after which memory usage begins to grow almost linearly, reaching 2Gb and crashing the Mathematica kernel. So we can check that the first parts of the file are being read correctly, but it's still impossible to process the whole file in one go (without skipping some parts using f@skip or RandomAccessFile). Testing the parts of the code separately, we can see that JavaObjectToExpression is the culprit: L2 = JavaNew["[B", sz]; Do[JavaObjectToExpression[L2], {chunks}] causes the memory blow-up. Maxim Rytin m.r at inbox.ru