Using Regular Expressions

Xarky · Nov 20, 2004

Hi,
I have a text file full of html tags. I would like to extract the
text that is situated between two specific tags. The two tags are
 ????? .

Now I would like to get the text that is marked as ?. Its length is
not predefined.

Can someone help me out.

Thanks in Advance

Chris R. Timmons · Nov 20, 2004

(e-mail address removed) (Xarky) wrote in

Hi,
I have a text file full of html tags. I would like to extract
the
text that is situated between two specific tags. The two tags
are
 ????? .

Now I would like to get the text that is marked as ?. Its
length is
not predefined.

Xarky,

using System.Text.RegularExpressions;

....

Match m = Regex.Match(htmlText,
@"<\s*?span.*?class\s*?=\s*?""serif"".*?[^>]*?>(?<contents>.*?)</\s*?span\s*?>",
RegexOptions.Singleline |
RegexOptions.IgnoreCase);

return m.Groups["contents"].ToString();

Barry Mossman · Nov 22, 2004

Chris R. Timmons said:
Match m = Regex.Match(htmlText,

@"<\s*?span.*?class\s*?=\s*?""serif"".*?[^>]*?>(?<contents>.*?)</\s*?span\s*?>",
RegexOptions.Singleline |
RegexOptions.IgnoreCase);

return m.Groups["contents"].ToString();

Hi Chris,

I understand the string excepting for the "*?"s What is the impact of the
"?" ?
How does the behaviour differ from
@"<\s*span.*class\s*=\s*""serif"".*[^>]*>(?<contents>.*)</\s*span\s*>",

thanks
Barry Mossman

Chris R. Timmons · Nov 22, 2004

in message

Match m = Regex.Match(htmlText,

@"<\s*?span.*?class\s*?=\s*?""serif"".*?[^>]*?>(?<contents>.*?)<
/\s*?span\s*?>",
RegexOptions.Singleline |
RegexOptions.IgnoreCase);

return m.Groups["contents"].ToString();

Click to expand...

Hi Chris,

I understand the string excepting for the "*?"s What is the
impact of the "?" ?
How does the behaviour differ from
@"<\s*span.*class\s*=\s*""serif"".*[^>]*>(?<contents>.*)</\s*span
\s*>",

Barry,

Normally, quantifiers like * and + are "greedy", which means the
regex will match as many characters as possible. The ? makes the
regex "non-greedy", so the expression will only match the minimum
amount of characters necessary. (Note that in the regex I posted,
the \s*? is equivalent to \s*. I kind of went overboard with the ?s
:-)

).

For example, assume the following input:

<text>first</text><text>second</text>

Let's say I want to extract the text between the first set of <text>
tags. Using this greedy expression:

<text>(.*)</text>

would return this text:

first</text><text>second

See how the .* matched as much as possible?

If the regex is changed to:

<text>(.*?)</text>

then only the minimum amount of text is matched:

first

Barry Mossman · Nov 23, 2004

Chris R. Timmons said:
Hope this helps.

Yes, lots. Thanks

Barry Mossman

xarky · Nov 23, 2004

Thanks that helped alot.

Now suppose I have

<text>......</text>
and I would like to extract the text between the <text> tags, but remove
any possible tags.

I am doing it as follows -

First extracting the text and then applying this
string newText = Regex.Replace(text,"<[^>]*>","");

Is this the correct way to do it?

Thanks

Chris R. Timmons · Nov 23, 2004

Thanks that helped alot.

Now suppose I have

<text>......</text>
and I would like to extract the text between the <text> tags,
but remove any possible tags.

I am doing it as follows -

First extracting the text and then applying this
string newText = Regex.Replace(text,"<[^>]*>","");

Is this the correct way to do it?

xarky,

Yes.

Chris.