MathGroup Archive 2013

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Obtaining Random LIne from A file

  • To: mathgroup at smc.vnet.net
  • Subject: [mg129864] Re: Obtaining Random LIne from A file
  • From: "Kevin J. McCann" <kjm at KevinMcCann.com>
  • Date: Tue, 19 Feb 2013 18:51:34 -0500 (EST)
  • Delivered-to: l-mathgroup@mail-archive0.wolfram.com
  • Delivered-to: l-mathgroup@wolfram.com
  • Delivered-to: mathgroup-newout@smc.vnet.net
  • Delivered-to: mathgroup-newsend@smc.vnet.net
  • References: <kfn7nt$qaj$1@smc.vnet.net> <kfq6mb$4us$1@smc.vnet.net> <kfv4vg$3g3$1@smc.vnet.net>

If you plan to do this millions of times, then your only hope is to load 
the file(s) into memory, e.g. with ReadList. If you do a disk access for 
each line, you will be waiting for quite a while. Memory is cheap.

Kevin

On 2/19/2013 1:09 AM, Ramiro wrote:
> Thank you so much for the reply.  My files are 50MB each, I don't think ReadList would work for my purposes, it would be too slow.  I am actually doing an MCMC simulation, doing (hopefully if I have time) millions of iterations and in each one I need to read a random line from one of many files, thus requiring this reading to happen as quickly as possible. Any suggestions? Each line is pretty much the same length.
>
> Thanks,
> Ramiro
>
> On Sunday, February 17, 2013 4:08:27 AM UTC-5, David Bailey wrote:
>> On 16/02/2013 06:07, Ramiro Barrantes wrote:
>>
>>> Hello,
>>
>>>
>>
>>> I would like to get a random line from a file, I know this can be done
>>
>>> with Mathematica but I am playing with using sed to see if it goes
>>
>>> faster, say I want to get line 1000
>>
>>>
>>
>>> In mathematica it would be:
>>
>>>
>>
>>> <<"! sed -n p1000 filename.txt"
>>
>>>
>>
>>> However, I am trying to put the filename as a variable, say
>>
>>>
>>
>>> filename="hugefile.txt"
>>
>>>
>>
>>> cmd="! sed -n p1000 "<>filename
>>
>>> <<cmd
>>
>>>
>>
>>> does not work.
>>
>>>
>>
>>> How can I do this?
>>
>>>
>>
>>> Lastly, I am getting a randomline using mathematica doing:
>>
>>>
>>
>>> getRandomLine[file_, n_] :=
>>
>>>     Block[{i = RandomInteger[{1, n}], str = OpenRead[file], res},
>>
>>>      Skip[str, "String", i];
>>
>>>      res = Read[str, Expression];
>>
>>>      Close[str];
>>
>>>      res[[2]]
>>
>>>      ]
>>
>>>
>>
>>> However, it is very slow so I was going to try with sed.Any suggestions?
>>
>>>
>>
>>> Thanks in advance,
>>
>>> Ramiro
>>
>>>
>>
>>>
>>
>> I would stick with Mathematica to do this job! How big is the file
>>
>> (number of lines and number of bytes)? If it will fit inside Mathematica
>>
>> comfortable, I'd see how it works to read it all in as a list of strings
>>
>> and pick the one you want:
>>
>>
>>
>> xx=ReadList["C:\\some file",String];//Timing
>>
>>
>>
>> Then you have an array of strings, and you can select what you want
>>
>> directly.
>>
>>
>>
>> Remember, the basic problem with reading at an arbitrary position in a
>>
>> text file, is that if the line lengths are not the same, any algorithm
>>
>> has to read every line before the one you want! If you create this file,
>>
>> you should consider packing the lines to make them all the same length -
>>
>> then you could access what you want very efficiently (but with a little
>>
>> more coding!)
>>
>>
>>
>> David Bailey
>>
>> http://www.dbaileyconsultancy.co.uk
>
>



  • Prev by Date: Re: Obtaining Random LIne from A file
  • Next by Date: Re: Stephen Wolfram's recent blog
  • Previous by thread: Re: Obtaining Random LIne from A file
  • Next by thread: Re: Obtaining Random LIne from A file