(href=(?:"|')
http://www\.thesite\.com/.+?/)(?<filename>(?:.(?!\.th\.
)) +?)["|']
A few words of explanation:
First, your regular expression has a "." prior to the "www". This
indicates that a single non-line-break character *must* precede the
"www". I don't know if that's intentional, but I took it out.
Second, I'm not sure why you specified any non-line-break character
repeated between 1 and 7 times, followed by a slash. That seemed
excessive. Regardless of the length, the important thing is that the
length is non-zero, and it is followed by a slash. So, I used the
dot (non-line-break) with a lazy plus ("+?"). Lazy matching is an
important aspect of regular expressions. It means that the match
should be repeated as *few* times as possible. The default is
"greedy" matching, which indicates that the match should be repeated
as *many* times as possible. In a regular expression where a lazy
match is followed by some other match, the lazy match stops at the
first incidence of the next match. This means that the sequence:
.+?/
indicates "any non-line-break character repeated as few times as
possible, until a slash is reached, followed by a slash.
It wasn't so important in that section of the regular expression,
but it's critical to the latter part:
(?<filename>(?:.(?!\.th\.))+?)["|']
This is your filename group. You can use quantifiers with un-named
groups. So, what the last part says about the "filename" group is:
Match any single non-line-break character that is *not* followed by
".th." character sequence as few times as possible.
This is followed by ["|'] - which indicates that the next match is
either a single quote or a double quote. Therefore, the matching of
the "filename" group halts when the quote is reached. In other
words, as the group is at the end of the match, it will be followed
by a single or double quote, and I use that to indicate the end of
that matching group, by using a lazy quantifier.
Kevin Spencer
Microsoft MVP
DSI PrintManager, Miradyne Component Libraries:
http://www.miradyne.net
"pedrito" <pixbypedrito at yahoo.com> wrote in message
I have a regex question and it never occurred to me to ask here,
until I saw Jesse Houwing's quick response to Phil for his Regex
question.
I have some filenames that I'm trying to parse out of URLs.
(href=("|')
http://.www\.thesite\.com/.{1,7}/)(?<filename>.[^"|'])
This generally works, but the problem is some of the image files
have .th.jpg at the end to indicate thumbnails. I want to exclude
those. I just want the ones that don't have .th. before the file
extension.
I've tried using forward and reverse negative lookups, but I guess
I'm not using them correctly. Any help on how to get a non-match
for a .th. would be great.