Regular Expressions -- count lines with a specific pattern in a flat file

S

Sathyaish

I have a CSV file like so:

"HDR",20060629133932,"9845","9083","0010"
1,"3","000000000690","000007","rsM4hJXR5Ik0O8RWghjtDBlUVAOZq7tO","BAR","0010","","",20.00
2,"3","000000000691","000007","65Xbp5dMcDFflPJnxWCrsJtV1jzcUjgd","BAR","0010","","",20.00
3,"3","000000000692","000007","SEjcf3eDA7hWmwGrNsLWoCWt1Geyh4GN","BAR","0010","","",20.00
4,"3","000000000693","000007","MJMkrp/kRMMGimeZo1uFOJzeDTVeOkFU","BAR","0010","","",20.00
5,"3","000000000694","000007","fDIBFgockQHhN+eVQxEBqqrJfZ78roja","BAR","0010","","",20.00
......and so on...

Each file has about a million records or more. Instead of iterating
through each line and counting line breaks, and ignoring header and
footer records and counting only data records, I thought of writing a
regex pattern for the same. Here's what I've written to count only data
records, i.e rows that start with a number followed by a comma and then
any othe text and ending with a line break.

numRecords = System.Text.RegularExpressions.Regex.Matches(ret,
"(?m)^[0-9]{1, 6}*$",
System.Text.RegularExpressions.RegexOptions.Multiline).Count;

I get a zero match collection count.
 
M

Marc Gravell

So using that form of .Matches means that you have to load the entire string
at once? Bet that's fast... ;-p Especially with the lookaround ...

However, it makes sense that it fails:

^[0-9]{1,6}*$

says newline, then "between 1 and 6 digits" "zero or more times" then end of
line, with nothing else; well the commas and quotes seem to get in the way?
Did you mean

^[0-9]{1-6},.*$

which is new line, "between 1 and 6 digits", comma, "zero-ormore chars
except newline", end of line

However, for performance I would still suggest using line by line,
stream-based, reading, and also re-using a single Regex instance (ideally
precompiled):

Regex re = new Regex("[0-9]{1-6},.*",RegexOptions.Compiled);
int count = 0;
using(StreamReader reader = File.OpenText(path)) {
while(!reader.EndOfStream) {
string line = reader.ReadLine();
if(!string.IsNullOrEmpty(line) && re.IsMatch(line))
count++;
}
}

Marc
 
M

Marc Gravell

Sorry - typo by me: I meant {1,6} (as per your original example); likewise
"^[0-9]{1,6},.*$" in the example code - although given we don't care about
the rhs it may also work with just "^[0-9]{1,6},".

Marc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top