Re: Counting Matching Patterns in a Large File
- To: mathgroup at smc.vnet.net
- Subject: [mg116161] Re: Counting Matching Patterns in a Large File
- From: Bill Rowe <readnews at sbcglobal.net>
- Date: Thu, 3 Feb 2011 05:34:26 -0500 (EST)
On 2/2/11 at 6:08 AM, ccarter at mit.edu (W. Craig Carter) wrote: >(* I'm trying to find a more efficient way to check if a file has >more than n lines that match a pattern. >As a test, one might use a test example file obtained from: *) >Export["BigFile.tsv", Map[RandomReal[{0, 1}, {#}] &, >RandomInteger[{1, 20}, {10000}]]] >(* Right now, I am using: *) >n=5 (*for example*) >Count[Import["BigFile.tsv", "Table"], {a_?NumberQ, b_?NumberQ, >c_?NumberQ}] > n >(* But, in many cases, a count of 5 *would* be obtained well before >the end-of-file is reached. >My target files are *much* larger than 10000 lines... >I haven't dealt with Streams very much---I am guessing that is where >the answer lies. Not necessarily. You might find the function FindList will do what you need. If there is a specific identifier for the lines lines you want to count, FindList[file,text,n] will find the first n lines containing text. And it will be more efficient than the approach you are indicating above. When you do Import[file,"Table"] you have two sources of inefficiency. First, Import is a general purpose function and works with a wide variety of file formats. In order to do this, Import has to do some amount of computation to determine how things on a given line are to be represented in Mathematica. FindList avoids any conversion overhead since it simply reads the data in as text. Additional overhead (both memory and time) occurs since Import will read in the entire file before any thing is done to count the target lines. FindList will only read to the end of file if that is required to get the number of lines you asked for or there are fewer lines in the file that meet your criteria. Note, you may also be able to achieve satisfactory results using ReadList by using some of the options available to ReadList. But if the lines in your target file are not formated in a fairly structured way, the most likely result will be an error if ReadList encounters a line not meeting your expected pattern before reaching the end of the file and obtaining the data you want.