Using Regular Expressions

  • Thread starter Thread starter Xarky
  • Start date Start date
X

Xarky

Hi,
I have a text file full of html tags. I would like to extract the
text that is situated between two specific tags. The two tags are
<span class="serif"> ????? </span>.

Now I would like to get the text that is marked as ?. Its length is
not predefined.


Can someone help me out.

Thanks in Advance
 
(e-mail address removed) (Xarky) wrote in
Hi,
I have a text file full of html tags. I would like to extract
the
text that is situated between two specific tags. The two tags
are
<span class="serif"> ????? </span>.

Now I would like to get the text that is marked as ?. Its
length is
not predefined.

Xarky,


using System.Text.RegularExpressions;

....

Match m = Regex.Match(htmlText,
@"<\s*?span.*?class\s*?=\s*?""serif"".*?[^>]*?>(?<contents>.*?)</\s*?span\s*?>",
RegexOptions.Singleline |
RegexOptions.IgnoreCase);

return m.Groups["contents"].ToString();
 
Chris R. Timmons said:
Match m = Regex.Match(htmlText,

@"<\s*?span.*?class\s*?=\s*?""serif"".*?[^>]*?>(?<contents>.*?)</\s*?span\s*?>",
RegexOptions.Singleline |
RegexOptions.IgnoreCase);

return m.Groups["contents"].ToString();

Hi Chris,

I understand the string excepting for the "*?"s What is the impact of the
"?" ?
How does the behaviour differ from
@"<\s*span.*class\s*=\s*""serif"".*[^>]*>(?<contents>.*)</\s*span\s*>",

thanks
Barry Mossman
 
in message
Match m = Regex.Match(htmlText,

@"<\s*?span.*?class\s*?=\s*?""serif"".*?[^>]*?>(?<contents>.*?)<
/\s*?span\s*?>",
RegexOptions.Singleline |
RegexOptions.IgnoreCase);

return m.Groups["contents"].ToString();

Hi Chris,

I understand the string excepting for the "*?"s What is the
impact of the "?" ?
How does the behaviour differ from
@"<\s*span.*class\s*=\s*""serif"".*[^>]*>(?<contents>.*)</\s*span
\s*>",

Barry,

Normally, quantifiers like * and + are "greedy", which means the
regex will match as many characters as possible. The ? makes the
regex "non-greedy", so the expression will only match the minimum
amount of characters necessary. (Note that in the regex I posted,
the \s*? is equivalent to \s*. I kind of went overboard with the ?s
:-) ).

For example, assume the following input:

<text>first</text><text>second</text>

Let's say I want to extract the text between the first set of <text>
tags. Using this greedy expression:

<text>(.*)</text>

would return this text:

first</text><text>second

See how the .* matched as much as possible?

If the regex is changed to:

<text>(.*?)</text>

then only the minimum amount of text is matched:

first
 
Thanks that helped alot.

Now suppose I have

<text><I>......<I></text>
and I would like to extract the text between the <text> tags, but remove
any possible tags.

I am doing it as follows -

First extracting the text and then applying this
string newText = Regex.Replace(text,"<[^>]*>","");

Is this the correct way to do it?


Thanks
 
Thanks that helped alot.

Now suppose I have

<text><I>......<I></text>
and I would like to extract the text between the <text> tags,
but remove any possible tags.

I am doing it as follows -

First extracting the text and then applying this
string newText = Regex.Replace(text,"<[^>]*>","");

Is this the correct way to do it?

xarky,

Yes.


Chris.
 
Back
Top