MathGroup Archive 2011

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Counting Matching Patterns in a Large File

  • To: mathgroup at smc.vnet.net
  • Subject: [mg116157] Re: Counting Matching Patterns in a Large File
  • From: "Sjoerd C. de Vries" <sjoerd.c.devries at gmail.com>
  • Date: Thu, 3 Feb 2011 05:33:41 -0500 (EST)
  • References: <iibdv9$npo$1@smc.vnet.net>

Craig,

Something like

str = OpenRead["BigFile.tsv"];
count = 0;
While[count < 5000 && (rl = ReadList[str, Record, 1]) =!= {},
 If[
  MatchQ[ToExpression /@ StringSplit[rl[[1]], "\t"], {a_?NumberQ,
    b_?NumberQ, c_?NumberQ}],
  count++
  ]
 ]
Close[str];
count

should do. It's not particularly good looking and I'm not too sure
about efficiency.

ReadList doesn't seem to be able to read variable numbers of fields
per record, so I read each line as a string and split using
StringSplit.

Cheers -- Sjoerd

On Feb 2, 12:08 pm, "W. Craig Carter" <ccar... at mit.edu> wrote:
> MathGroup,
>
> (*
> I'm trying to find a more efficient way to check if a file has more than =
n lines that match a pattern.
>
> As a test, one might use a test example file obtained from:
> *)
>
> Export["BigFile.tsv",   Map[RandomReal[{0, 1}, {#}] &, RandomInteger[{1=
, 20}, {10000}]]]
>
> (*
> Right now, I am using:
> *)
>
> n=5 (*for example*)
>
> Count[Import["BigFile.tsv", "Table"], {a_?NumberQ, b_?NumberQ, c_?NumberQ=
}] > n
>
> (*
> But, in many cases, a count of 5 *would* be obtained well before the end-=
of-file is reached.
>
> My target files are *much* larger than 10000 lines...
>
> I haven't dealt with Streams very much---I am guessing that is where the =
answer lies.
>
> Many Thanks, Craig
> *)



  • Prev by Date: Re: finding area in ListContourPlot
  • Next by Date: Re: How to start new input line?
  • Previous by thread: Counting Matching Patterns in a Large File
  • Next by thread: Re: Counting Matching Patterns in a Large File