Re: shuffling 10^8 numbers
- To: mathgroup at smc.vnet.net
- Subject: [mg53190] Re: shuffling 10^8 numbers
- From: David Bailey <dave at Remove_Thisdbailey.co.uk>
- Date: Tue, 28 Dec 2004 23:12:06 -0500 (EST)
- References: <cqrgpe$qgv$1@smc.vnet.net>
- Sender: owner-wri-mathgroup at wolfram.com
George Szpiro wrote: > Hi, > > I am trying to shuffle 10^8 numbers stored in the file GG.doc in the root > directory. (Size of GG.doc appros 360 MB) > > Accorrding to previous suggestions from this group I try to shuffle them > witht he following program: > > GG=OpenRead["c:\GG.doc"]; > AA=ReadList[GG]; > Timing[ > OrigList=Table[AA]; > p=RandomPermutation@Length@OrigList; > ShuffledList=OrigList[[p]]; > > > But the file is far too big. I can read it but then I get the following > error message: > > <<No more memory available. Mathematica kernel has shut down. Try quitting > other applications and then retry.>> > > No other programs are open, so I guess I am at the limit. Can anybody > suggest a workaround? Is there a possibility to shuffle numbers without > loading them all into memory simultaneously? > > NEW IDEA: I thought there might be a possibility of just reading one single > number each time from the file GG.doc, and putting them into a randomly > chosen slot in a new file. > > Any answeres greatly appreciated to: > george at netvision.net.il > > Thanks, > George > > I think speed may be a problem with this size of file whatever you do, but you could read your file one number at a time using Read (using a type of Number). Then, if you have version 5.1, you could use BinaryWrite to write the values to a number of smaller binary files. You could then shuffle each file in turn before combining them by repeatedly reading a number from a randomly chosen file and writing it to your final output file (presumably in text format). Since you would never have the whole file in memory at one time, you should not hit memory limits. Before spending a great deal of effort on this, I would time a program that simply reads your file one number at a time and writes another (without shuffling). The cost of I/O is likely to dominate, so you will get an idea of performance if you do this. If it is too slow, you may have to think about C++. Your idea would work with a binary file (where every number takes the same number of bytes) but you would have to ensure that you did not write several numbers into the same slot (and therefore leave others empty). David Bailey dbaileyconsultancy.co.uk