MathGroup Archive 2011

[Date Index] [Thread Index] [Author Index]

Search the Archive

Re: Counting Matching Patterns in a Large File

  • To: mathgroup at smc.vnet.net
  • Subject: [mg116161] Re: Counting Matching Patterns in a Large File
  • From: Bill Rowe <readnews at sbcglobal.net>
  • Date: Thu, 3 Feb 2011 05:34:26 -0500 (EST)

On 2/2/11 at 6:08 AM, ccarter at mit.edu (W. Craig Carter) wrote:

>(* I'm trying to find a more efficient way to check if a file has
>more than n lines that match a pattern.

>As a test, one might use a test example file obtained from: *)

>Export["BigFile.tsv",   Map[RandomReal[{0, 1}, {#}] &,
>RandomInteger[{1, 20}, {10000}]]]

>(* Right now, I am using: *)

>n=5 (*for example*)

>Count[Import["BigFile.tsv", "Table"], {a_?NumberQ, b_?NumberQ,
>c_?NumberQ}] > n

>(* But, in many cases, a count of 5 *would* be obtained well before
>the end-of-file is reached.

>My target files are *much* larger than 10000 lines...

>I haven't dealt with Streams very much---I am guessing that is where
>the answer lies.

Not necessarily. You might find the function FindList will do
what you need. If there is a specific identifier for the lines
lines you want to count, FindList[file,text,n] will find the
first n lines containing text. And it will be more efficient
than the approach you are indicating above.

When you do Import[file,"Table"] you have two sources of
inefficiency. First, Import is a general purpose function and
works with a wide variety of file formats. In order to do this,
Import has to do some amount of computation to determine how
things on a given line are to be represented in Mathematica.
FindList avoids any conversion overhead since it simply reads
the data in as text.

Additional overhead (both memory and time) occurs since Import
will read in the entire file before any thing is done to count
the target lines. FindList will only read to the end of file if
that is required to get the number of lines you asked for or
there are fewer lines in the file that meet your criteria.

Note, you may also be able to achieve satisfactory results using
ReadList by using some of the options available to ReadList. But
if the lines in your target file are not formated in a fairly
structured way, the most likely result will be an error if
ReadList encounters a line not meeting your expected pattern
before reaching the end of the file and obtaining the data you want.



  • Prev by Date: Re: finding area in ListContourPlot
  • Next by Date: Re: Mystifying Scoping of Piecewise Variable?
  • Previous by thread: Re: Counting Matching Patterns in a Large File
  • Next by thread: Re: Counting Matching Patterns in a Large File