MathGroup Archive: February 2011 [00131]

[Date Index] [Thread Index] [Author Index]

Re: Counting Matching Patterns in a Large File

To: mathgroup at smc.vnet.net
Subject: [mg116177] Re: Counting Matching Patterns in a Large File
From: Ulrich Arndt <ulrich.arndt at data2knowledge.de>
Date: Fri, 4 Feb 2011 01:39:08 -0500 (EST)

function g does it similar but reads m records before the evaluation of the search condition starts

g[filename_, n_, m_] := Module[{str, ct, lct, eof, row},
   str = OpenRead[filename];
   ct = 0;
   lct = 0;
   eof = 0;
   While[ct < n  && eof == 0 ,
     row = Table[Read[str, "String"], {i, 1, m}];
     lct += m;
     If[Head[row[[-1]]] == Symbol, eof = 1;
      row = Cases[row, x_ /; Head[x] == String]];
     ct += Count[
       ToExpression[StringSplit[row, "\t"]], {a_?NumberQ, b_?NumberQ,
        c_?NumberQ}];
     ]
    Close[str];
   {ct, lct, If[ct < n, False, True]}
   ];

On my laptop it saved more than 20% of time and it seems that 200 is the optimal lengths for my laptop and this data set. I would expect that m is related to the block size of the disk - but this is just a guess....
File contained 100000 records and 4958 rows with lengths 3.... So n=4000 scans 80% of the file.

In[257]:= AbsoluteTiming[
 str = OpenRead["/tmp/BigFile.tsv"];
 count = 0;
 While[count < 4000 && (rl = ReadList[str, Record, 1]) =!= {},
   If[MatchQ[
     ToExpression /@ StringSplit[rl[[1]], "\t"], {a_?NumberQ,
      b_?NumberQ, c_?NumberQ}], count++]]
  Close[str];
 count]

Out[257]= {4.960967, 4000}

In[258]:=
AbsoluteTiming[
 g["/tmp/BigFile.tsv", 4000, 200]
 ]

Out[258]= {3.405888, {4009, 80600, True}}

r = {#, AbsoluteTiming[
      g["/tmp/BigFile.tsv", 2000, #]
      ]} & /@ (Range[100]*10);

ListLinePlot[r[[All, 2, 1]]]


Am 03.02.2011 um 11:33 schrieb Sjoerd C. de Vries:

> Craig,
>
> Something like
>
> str = OpenRead["BigFile.tsv"];
> count = 0;
> While[count < 5000 && (rl = ReadList[str, Record, 1]) =!= {},
> If[
>  MatchQ[ToExpression /@ StringSplit[rl[[1]], "\t"], {a_?NumberQ,
>    b_?NumberQ, c_?NumberQ}],
>  count++
>  ]
> ]
> Close[str];
> count
>
> should do. It's not particularly good looking and I'm not too sure
> about efficiency.
>
> ReadList doesn't seem to be able to read variable numbers of fields
> per record, so I read each line as a string and split using
> StringSplit.
>
> Cheers -- Sjoerd
>
> On Feb 2, 12:08 pm, "W. Craig Carter" <ccar... at mit.edu> wrote:
>> MathGroup,
>>
>> (*
>> I'm trying to find a more efficient way to check if a file has more than 
> n lines that match a pattern.
>>
>> As a test, one might use a test example file obtained from:
>> *)
>>
>> Export["BigFile.tsv",   Map[RandomReal[{0, 1}, {#}] &, RandomInteger[{1
> , 20}, {10000}]]]
>>
>> (*
>> Right now, I am using:
>> *)
>>
>> n=5 (*for example*)
>>
>> Count[Import["BigFile.tsv", "Table"], {a_?NumberQ, b_?NumberQ, c_?NumberQ
> }] > n
>>
>> (*
>> But, in many cases, a count of 5 *would* be obtained well before the end -
> of-file is reached.
>>
>> My target files are *much* larger than 10000 lines...
>>
>> I haven't dealt with Streams very much---I am guessing that is where the
> answer lies.
>>
>> Many Thanks, Craig
>> *)
>
>

Prev by Date: Re: Mathematica 8 Home Edition

Next by Date: Re: Protect a variable against being used as an iterator (related to the HoldAll - Evaluate problem)

Previous by thread: Re: Counting Matching Patterns in a Large File

Next by thread: finding area in ListContourPlot