Re: Obtaining Random LIne from A file
- To: mathgroup at smc.vnet.net
- Subject: [mg129872] Re: Obtaining Random LIne from A file
- From: David Bailey <dave at removedbailey.co.uk>
- Date: Tue, 19 Feb 2013 18:54:14 -0500 (EST)
- Delivered-to: l-mathgroup@mail-archive0.wolfram.com
- Delivered-to: l-mathgroup@wolfram.com
- Delivered-to: mathgroup-newout@smc.vnet.net
- Delivered-to: mathgroup-newsend@smc.vnet.net
- References: <kfn7nt$qaj$1@smc.vnet.net> <kfq6mb$4us$1@smc.vnet.net> <kfv4vg$3g3$1@smc.vnet.net>
On 19/02/2013 06:09, Ramiro wrote: > Thank you so much for the reply. My files are 50MB each, I don't think ReadList would work for my purposes, it would be too slow. I am actually doing an MCMC simulation, doing (hopefully if I have time) millions of iterations and in each one I need to read a random line from one of many files, thus requiring this reading to happen as quickly as possible. Any suggestions? Each line is pretty much the same length. > > Thanks, > Ramiro > OK - let's establish two points: 1) Are the records in the files of a fixed length? 2) When you say you want an 'arbitrary line' I am assuming that you calculate a number N, and when want the N'th line of the file. If you really don't care which line you choose, use Ramiro's method (above). If your files are not guaranteed to have equal length records, there is obviously a problem, as I explained before, because you have to read all N-1 lines to establish which is the N'th. One option therefore, might be to pre-process your files to make fixed length records by padding with blanks. Once you have fixed record length files, you can open them with BinaryFormat->True and use SetStreamPosition to set the stream to the position in bytes where your record starts, and read the relevant number of bytes. Unless you are using extended characters, you could convert these to characters with FromCharacterCode. This should be VERY fast, because the cost of each access is not proportional to the size of the file (once all the files have been preprocessed). If the records are variable length but contain some identification such as a line number, another option would be to pull out a line as Ramiro suggested, but then use a binary chop procedure to zero in on the line of interest. Hint: You may want to look at the processed file with a hex editor, to make sure the record length is as you expect - remember Windows uses 2 characters per end of line! David Bailey http://www.dbaileyconsultancy.co.uk