MathGroup Archive: February 2013 [00236]

[Date Index] [Thread Index] [Author Index]

Re: Obtaining Random LIne from A file

To: mathgroup at smc.vnet.net
Subject: [mg129872] Re: Obtaining Random LIne from A file
From: David Bailey <dave at removedbailey.co.uk>
Date: Tue, 19 Feb 2013 18:54:14 -0500 (EST)
Delivered-to: l-mathgroup@mail-archive0.wolfram.com
Delivered-to: l-mathgroup@wolfram.com
Delivered-to: mathgroup-newout@smc.vnet.net
Delivered-to: mathgroup-newsend@smc.vnet.net
References: <kfn7nt$qaj$1@smc.vnet.net> <kfq6mb$4us$1@smc.vnet.net> <kfv4vg$3g3$1@smc.vnet.net>

On 19/02/2013 06:09, Ramiro wrote:
> Thank you so much for the reply.  My files are 50MB each, I don't think ReadList would work for my purposes, it would be too slow.  I am actually doing an MCMC simulation, doing (hopefully if I have time) millions of iterations and in each one I need to read a random line from one of many files, thus requiring this reading to happen as quickly as possible. Any suggestions? Each line is pretty much the same length.
>
> Thanks,
> Ramiro
>

OK - let's establish two points:

1)      Are the records in the files of a fixed length?

2)      When you say you want an 'arbitrary line' I am assuming that you 
calculate a number N, and when want the N'th line of the file. If you 
really don't care which line you choose, use Ramiro's method (above).

If your files are not guaranteed to have equal length records, there is 
obviously a problem, as I explained before, because you have to read all 
N-1 lines to establish which is the N'th. One option therefore, might be 
to pre-process your files to make fixed length records by padding with 
blanks.

Once you have fixed record length files, you can open them with 
BinaryFormat->True and use SetStreamPosition to set the stream to the 
position in bytes where your record starts, and read the relevant number 
of bytes. Unless you are using extended characters, you could convert 
these to characters with FromCharacterCode.

This should be VERY fast, because the cost of each access is not 
proportional to the size of the file (once all the files have been 
preprocessed).

If the records are variable length but contain some identification such 
as a line number, another option would be to pull out a line as Ramiro 
suggested, but then use a binary chop procedure to zero in on the line 
of interest.

Hint: You may want to look at the processed file with a hex editor, to 
make sure the record length is as you expect - remember Windows uses 2 
characters per end of line!

David Bailey
http://www.dbaileyconsultancy.co.uk

Prev by Date: Series Expansions in Mathematica

Next by Date: Re: Obtaining Random LIne from A file

Previous by thread: Re: Obtaining Random LIne from A file

Next by thread: Re: Obtaining Random LIne from A file