Regular expression rejecting invalid files

Guest · May 8, 2006

Hi,

I am using a regular expression to read records from a text file. But when
reading files with invalid formats it takes ages before the program rjects
the file. So I want to optimise the expression to reject invalid files
faster.

The valid files are wellformed and looks something like this:

Once upon a time
CODE NUMBER:123
There was a little lamb
CODE NUMBER:2134

Each record is terminated by a form feed and the reg-expression is something
like this:
..Pattern = "(.*)\r\nCODE NUMBER

\d+)\r\n\f"

Any ideas on how to speed up file rejection?

Regards
Bertrand

Guest · May 8, 2006

Since you have newlines embedded in your regex, you are obviously
reading the entire file in before comparing it to your pattern, rather
than reading a record at a time. Then your regex has to look through
the whole thing to see if your pattern is there, Since binary files
can be large, you're probably taking a hit on the file I/O, and then
another hit on the regex.

I would probably try reading a dozen bytes from the beginning of each
file, and make sure each of the characters I got was alphanumeric,
whitespace, or some small set of punctuation. If it was, I'd go to the
full I/O and regex; if not, I'd assume I had a binary file and go on to
the next one.

Guest · May 9, 2006

Thanks, that is one way to do it. It does not seem to be I/O though. I think
that perhaps my expression is to "loose" in the sence that it does not
include file start/end symbols. If I could only make the grammar more strict
then I would presume that files would be sooner rejected. Any ideas whether
this is possible?

Regards
Bertrand

Guest · May 9, 2006

Since you're reading the entire file into a string before executing
your regex, the start of file is the start of string. The way your
regex is coded, the regex has to go all the way through the file before
it can reject it (that (.*) at the beginning). Is it really necessary
to capture everything that comes before CODE NUMBER?

If it is, you might try something like "^([a-zA-Z ]{5}.*)" in place of
your (.*). Without knowing what your "Once upon a time"s really look
like, it's kind of hard to say.

Regular expression rejecting invalid files

Guest

Guest

Guest

Guest