Re: Counting Matching Patterns in a Large File
- To: mathgroup at smc.vnet.net
- Subject: [mg116177] Re: Counting Matching Patterns in a Large File
- From: Ulrich Arndt <ulrich.arndt at data2knowledge.de>
- Date: Fri, 4 Feb 2011 01:39:08 -0500 (EST)
function g does it similar but reads m records before the evaluation of the search condition starts g[filename_, n_, m_] := Module[{str, ct, lct, eof, row}, str = OpenRead[filename]; ct = 0; lct = 0; eof = 0; While[ct < n && eof == 0 , row = Table[Read[str, "String"], {i, 1, m}]; lct += m; If[Head[row[[-1]]] == Symbol, eof = 1; row = Cases[row, x_ /; Head[x] == String]]; ct += Count[ ToExpression[StringSplit[row, "\t"]], {a_?NumberQ, b_?NumberQ, c_?NumberQ}]; ] Close[str]; {ct, lct, If[ct < n, False, True]} ]; On my laptop it saved more than 20% of time and it seems that 200 is the optimal lengths for my laptop and this data set. I would expect that m is related to the block size of the disk - but this is just a guess.... File contained 100000 records and 4958 rows with lengths 3.... So n=4000 scans 80% of the file. In[257]:= AbsoluteTiming[ str = OpenRead["/tmp/BigFile.tsv"]; count = 0; While[count < 4000 && (rl = ReadList[str, Record, 1]) =!= {}, If[MatchQ[ ToExpression /@ StringSplit[rl[[1]], "\t"], {a_?NumberQ, b_?NumberQ, c_?NumberQ}], count++]] Close[str]; count] Out[257]= {4.960967, 4000} In[258]:= AbsoluteTiming[ g["/tmp/BigFile.tsv", 4000, 200] ] Out[258]= {3.405888, {4009, 80600, True}} r = {#, AbsoluteTiming[ g["/tmp/BigFile.tsv", 2000, #] ]} & /@ (Range[100]*10); ListLinePlot[r[[All, 2, 1]]] Am 03.02.2011 um 11:33 schrieb Sjoerd C. de Vries: > Craig, > > Something like > > str = OpenRead["BigFile.tsv"]; > count = 0; > While[count < 5000 && (rl = ReadList[str, Record, 1]) =!= {}, > If[ > MatchQ[ToExpression /@ StringSplit[rl[[1]], "\t"], {a_?NumberQ, > b_?NumberQ, c_?NumberQ}], > count++ > ] > ] > Close[str]; > count > > should do. It's not particularly good looking and I'm not too sure > about efficiency. > > ReadList doesn't seem to be able to read variable numbers of fields > per record, so I read each line as a string and split using > StringSplit. > > Cheers -- Sjoerd > > On Feb 2, 12:08 pm, "W. Craig Carter" <ccar... at mit.edu> wrote: >> MathGroup, >> >> (* >> I'm trying to find a more efficient way to check if a file has more than > n lines that match a pattern. >> >> As a test, one might use a test example file obtained from: >> *) >> >> Export["BigFile.tsv", Map[RandomReal[{0, 1}, {#}] &, RandomInteger[{1 > , 20}, {10000}]]] >> >> (* >> Right now, I am using: >> *) >> >> n=5 (*for example*) >> >> Count[Import["BigFile.tsv", "Table"], {a_?NumberQ, b_?NumberQ, c_?NumberQ > }] > n >> >> (* >> But, in many cases, a count of 5 *would* be obtained well before the end - > of-file is reached. >> >> My target files are *much* larger than 10000 lines... >> >> I haven't dealt with Streams very much---I am guessing that is where the > answer lies. >> >> Many Thanks, Craig >> *) > >