N
Neri
Some document processing program I write has to deal with documents
that have headers and footers that are unnecessary for the main
processing part. Therefore, I'm using a regular expression to go over
each document, find out if it contains a header and/or a footer and
extract only the main content part.
The headers and the footers have no specific format and I have to
detect and remove them using a list of strings that may appear as
seperators between the header (or the footer) and the content. These
strings are allowed to act as seperators only if they appear in a
seperate line, that may contain only whitespaces, commas, dots, etc.
but not letters or digits. In addition, the headers and the footers
don't appear in all the documents, so I have to consider this
possibility too.
For example, the list of headers may contain the following strings:
"header ended", "here comes the content", "hhhhh". The list of footers
may contain the strings: "footer begins", "fffff". The following
document sample has a header but no footer:
To: George W. Bush
Subject: Hello!
hhhhh ,/
bla bla bla
bla bla fffff bla bla this is not a footer since
the seperator string doesn't come in a seperated line
Well, here's a regular expression I managed to write:
(?<header>.*^\W*(header ended|hhhhh|here comes the
content)\W*$|)(?<content>.+)(?<footer>^\W*(footer
begins|fffff)\W*$.*|)
When I run it with the options Multiline and Singleline turned on, it
works fine for the header, but the content capture group of the
regular expression is "greedy" and it captures the footer too. Notice
that the reason it can do so is that the footer may be a zero-length
string in order to deal with documents that have no footer.
Now, I tried to fix it by making the content capture group "lazy":
(?<content>.+?)
Well, this helped capture the footer, but now the content group
captures every content character in a seperate match. If I want to use
it, I have to append all the matches using a StringBuilder...
Any ideas?
BTW, I know I can do it using two regular expressions and two phases,
but I'm looking for a single regular expression that will capture the
content without further processing.
Thanks,
Avner
that have headers and footers that are unnecessary for the main
processing part. Therefore, I'm using a regular expression to go over
each document, find out if it contains a header and/or a footer and
extract only the main content part.
The headers and the footers have no specific format and I have to
detect and remove them using a list of strings that may appear as
seperators between the header (or the footer) and the content. These
strings are allowed to act as seperators only if they appear in a
seperate line, that may contain only whitespaces, commas, dots, etc.
but not letters or digits. In addition, the headers and the footers
don't appear in all the documents, so I have to consider this
possibility too.
For example, the list of headers may contain the following strings:
"header ended", "here comes the content", "hhhhh". The list of footers
may contain the strings: "footer begins", "fffff". The following
document sample has a header but no footer:
To: George W. Bush
Subject: Hello!
hhhhh ,/
bla bla bla
bla bla fffff bla bla this is not a footer since
the seperator string doesn't come in a seperated line
Well, here's a regular expression I managed to write:
(?<header>.*^\W*(header ended|hhhhh|here comes the
content)\W*$|)(?<content>.+)(?<footer>^\W*(footer
begins|fffff)\W*$.*|)
When I run it with the options Multiline and Singleline turned on, it
works fine for the header, but the content capture group of the
regular expression is "greedy" and it captures the footer too. Notice
that the reason it can do so is that the footer may be a zero-length
string in order to deal with documents that have no footer.
Now, I tried to fix it by making the content capture group "lazy":
(?<content>.+?)
Well, this helped capture the footer, but now the content group
captures every content character in a seperate match. If I want to use
it, I have to append all the matches using a StringBuilder...
Any ideas?
BTW, I know I can do it using two regular expressions and two phases,
but I'm looking for a single regular expression that will capture the
content without further processing.
Thanks,
Avner