Regular expression rejecting invalid files

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

Hi,

I am using a regular expression to read records from a text file. But when
reading files with invalid formats it takes ages before the program rjects
the file. So I want to optimise the expression to reject invalid files
faster.

The valid files are wellformed and looks something like this:

Once upon a time
CODE NUMBER:123
There was a little lamb
CODE NUMBER:2134

Each record is terminated by a form feed and the reg-expression is something
like this:
..Pattern = "(.*)\r\nCODE NUMBER:(\d+)\r\n\f"

Any ideas on how to speed up file rejection?

Regards
Bertrand
 
Since you have newlines embedded in your regex, you are obviously
reading the entire file in before comparing it to your pattern, rather
than reading a record at a time. Then your regex has to look through
the whole thing to see if your pattern is there, Since binary files
can be large, you're probably taking a hit on the file I/O, and then
another hit on the regex.

I would probably try reading a dozen bytes from the beginning of each
file, and make sure each of the characters I got was alphanumeric,
whitespace, or some small set of punctuation. If it was, I'd go to the
full I/O and regex; if not, I'd assume I had a binary file and go on to
the next one.
 
Thanks, that is one way to do it. It does not seem to be I/O though. I think
that perhaps my expression is to "loose" in the sence that it does not
include file start/end symbols. If I could only make the grammar more strict
then I would presume that files would be sooner rejected. Any ideas whether
this is possible?

Regards
Bertrand
 
Since you're reading the entire file into a string before executing
your regex, the start of file is the start of string. The way your
regex is coded, the regex has to go all the way through the file before
it can reject it (that (.*) at the beginning). Is it really necessary
to capture everything that comes before CODE NUMBER?

If it is, you might try something like "^([a-zA-Z ]{5}.*)" in place of
your (.*). Without knowing what your "Once upon a time"s really look
like, it's kind of hard to say.
 
Back
Top