Regular Expression Help

G

Guest

I am creating a screen scraping app that will extract data from a website.
The screen scraping is pretty straightforward using .NET 2.0, but stripping
out all extraneous characters is proving to be more difficult. I am
basically trying to extract the team, quarter, score for the quarter, and
score for the entire game from this html. (This html is a subset of the
entire page)

<table border="0" width="100%"><tr><td width="40%">Team</td><td width="10%"
align="center">1</td> <td width="10%" align="center">2</td><td width="10%"
align="center">3</td> <td width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td width="10%"
align="center" >10</td><td width="10%" align="center" >0</td><td width="10%"
align="center" >0</td><td width="20%" align="center" >10</td></tr><tr><td
width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indianapolis</A></td><td
width="10%" align="center" >7</td><td width="10%" align="center" >3</td><td
width="10%" align="center" >14</td><td width="10%" align="center" >17</td><td
width="20%" align="center" >41</td></tr></table>

In essance I want to be able to put the names and scores into an array so I
can add to a database. From what I read regular expressions should be able
to do this but I am a complete beginner using regex. Could someone assist
in getting me started? Many thanks.
 
A

Arnshea

I am creating a screen scraping app that will extract data from a website..
The screen scraping is pretty straightforward using .NET 2.0, but stripping
out all extraneous characters is proving to be more difficult. I am
basically trying to extract the team, quarter, score for the quarter, and
score for the entire game from this html. (This html is a subset of the
entire page)

<table border="0" width="100%"><tr><td width="40%">Team</td><td width="10%"
align="center">1</td> <td width="10%" align="center">2</td><td width="10%"
align="center">3</td> <td width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td width="10%"
align="center" >10</td><td width="10%" align="center" >0</td><td width="10%"
align="center" >0</td><td width="20%" align="center" >10</td></tr><tr><td
width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indianapolis</A>­</td><td
width="10%" align="center" >7</td><td width="10%" align="center" >3</td><td
width="10%" align="center" >14</td><td width="10%" align="center">17</td><td
width="20%" align="center" >41</td></tr></table>

In essance I want to be able to put the names and scores into an array soI
can add to a database. From what I read regular expressions should be able
to do this but I am a complete beginner using regex. Could someone assist
in getting me started? Many thanks.

One way to do this is to use the pattern:
([\w\s]+)<

then ignore whitespace only matches. So assuming you've got all of
your input on a single line, the code below should print out what
you're looking for. Depending on how you've got your input (as one
big multi-line string or as multiple strings) you may need to use the
RegexOptions.Multiline flag in the regex constructor.

static void Main(string[] args)
{
string pat = @">([\w\s]+)<";
string html = @"<table border=""0"" width=""100%""><tr><td
width=""40%"">Team</td><td width=""10%"" align=""center"">1</td> <td
width=""10%"" align=""center"">2</td><td width=""10%""
align=""center"">3</td> <td width=""10%"" align=""center"">4</td><td
width=""20%"" align=""center"">Score</td></tr><tr><td width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/078.htm"">New
Orleans</A></td><td width=""10%"" align=""center"" >0</td><td
width=""10%"" align=""center"" >10</td><td width=""10%""
align=""center"" >0</td><td width=""10%"" align=""center"" >0</td><td
width=""20%"" align=""center"" >10</td></tr><tr><td width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/
071.htm"">Indianapolis</A>­</td><td width=""10%"" align=""center"" >7</
td><td width=""10%"" align=""center"" >3</td><td width=""10%""
align=""center"" >14</td><td width=""10%"" align=""center"" >17</
td><td width=""20%"" align=""center"" >41</td></tr></table>";

Regex r = new Regex(pat);

foreach (Match m in r.Matches(html))
{
if ( m.Groups[1].Value.Trim() == "" )
// ignore these
continue;
else
// do whatever it is you want to do here
Console.WriteLine(m.Groups[1].Value);
}
}
 
G

Guest

Thanks! I think that almost did it. I ran into a problem testing "St.
Louis" though. Possibly due to the "."? When writing the results to the
debug window, St. Louis was omitted.

Arnshea said:
I am creating a screen scraping app that will extract data from a website..
The screen scraping is pretty straightforward using .NET 2.0, but stripping
out all extraneous characters is proving to be more difficult. I am
basically trying to extract the team, quarter, score for the quarter, and
score for the entire game from this html. (This html is a subset of the
entire page)

<table border="0" width="100%"><tr><td width="40%">Team</td><td width="10%"
align="center">1</td> <td width="10%" align="center">2</td><td width="10%"
align="center">3</td> <td width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td width="10%"
align="center" >10</td><td width="10%" align="center" >0</td><td width="10%"
align="center" >0</td><td width="20%" align="center" >10</td></tr><tr><td
width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indianapolis</A>-</td><td
width="10%" align="center" >7</td><td width="10%" align="center" >3</td><td
width="10%" align="center" >14</td><td width="10%" align="center" >17</td><td
width="20%" align="center" >41</td></tr></table>

In essance I want to be able to put the names and scores into an array so I
can add to a database. From what I read regular expressions should be able
to do this but I am a complete beginner using regex. Could someone assist
in getting me started? Many thanks.

One way to do this is to use the pattern:
([\w\s]+)<

then ignore whitespace only matches. So assuming you've got all of
your input on a single line, the code below should print out what
you're looking for. Depending on how you've got your input (as one
big multi-line string or as multiple strings) you may need to use the
RegexOptions.Multiline flag in the regex constructor.

static void Main(string[] args)
{
string pat = @">([\w\s]+)<";
string html = @"<table border=""0"" width=""100%""><tr><td
width=""40%"">Team</td><td width=""10%"" align=""center"">1</td> <td
width=""10%"" align=""center"">2</td><td width=""10%""
align=""center"">3</td> <td width=""10%"" align=""center"">4</td><td
width=""20%"" align=""center"">Score</td></tr><tr><td width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/078.htm"">New
Orleans</A></td><td width=""10%"" align=""center"" >0</td><td
width=""10%"" align=""center"" >10</td><td width=""10%""
align=""center"" >0</td><td width=""10%"" align=""center"" >0</td><td
width=""20%"" align=""center"" >10</td></tr><tr><td width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/
071.htm"">Indianapolis</A>-</td><td width=""10%"" align=""center"" >7</
td><td width=""10%"" align=""center"" >3</td><td width=""10%""
align=""center"" >14</td><td width=""10%"" align=""center"" >17</
td><td width=""20%"" align=""center"" >41</td></tr></table>";

Regex r = new Regex(pat);

foreach (Match m in r.Matches(html))
{
if ( m.Groups[1].Value.Trim() == "" )
// ignore these
continue;
else
// do whatever it is you want to do here
Console.WriteLine(m.Groups[1].Value);
}
}
 
A

Arnshea

Thanks! I think that almost did it. I ran into a problem testing "St.
Louis" though. Possibly due to the "."? When writing the results to the
debug window, St. Louis was omitted.



One way to do this is to use the pattern:

then ignore whitespace only matches. So assuming you've got all of
your input on a single line, the code below should print out what
you're looking for. Depending on how you've got your input (as one
big multi-line string or as multiple strings) you may need to use the
RegexOptions.Multiline flag in the regex constructor.
static void Main(string[] args)
{
string pat = @">([\w\s]+)<";
string html = @"<table border=""0"" width=""100%""><tr><td
width=""40%"">Team</td><td width=""10%"" align=""center"">1</td> <td
width=""10%"" align=""center"">2</td><td width=""10%""
align=""center"">3</td> <td width=""10%"" align=""center"">4</td><td
width=""20%"" align=""center"">Score</td></tr><tr><td width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/078.htm"">New
Orleans</A></td><td width=""10%"" align=""center"" >0</td><td
width=""10%"" align=""center"" >10</td><td width=""10%""
align=""center"" >0</td><td width=""10%"" align=""center"" >0</td><td
width=""20%"" align=""center"" >10</td></tr><tr><td width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/
071.htm"">Indianapolis</A>-</td><td width=""10%"" align=""center"" >7</
td><td width=""10%"" align=""center"" >3</td><td width=""10%""
align=""center"" >14</td><td width=""10%"" align=""center"" >17</
td><td width=""20%"" align=""center"" >41</td></tr></table>";
Regex r = new Regex(pat);
foreach (Match m in r.Matches(html))
{
if ( m.Groups[1].Value.Trim() == "" )
// ignore these
continue;
else
// do whatever it is you want to do here
Console.WriteLine(m.Groups[1].Value);
}
}- Hide quoted text -

- Show quoted text -

Yeah, the pattern will only match letters, numbers, _, digits and
whitespace. Now that I think about it though, try changing the
pattern to:

that should match any innermost text of the table cells.
 
J

Jesse Houwing

Hello JP,
I am creating a screen scraping app that will extract data from a
website. The screen scraping is pretty straightforward using .NET
2.0, but stripping out all extraneous characters is proving to be more
difficult. I am basically trying to extract the team, quarter, score
for the quarter, and score for the entire game from this html. (This
html is a subset of the entire page)

<table border="0" width="100%"><tr><td width="40%">Team</td><td
width="10%" align="center">1</td> <td width="10%"
align="center">2</td><td width="10%" align="center">3</td> <td
width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td width="10%"

href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indianapoli
s</A></td><td width="10%" align="center" >7</td><td width="10%"


In essance I want to be able to put the names and scores into an array
so I can add to a database. From what I read regular expressions
should be able to do this but I am a complete beginner using regex.
Could someone assist in getting me started? Many thanks.


I posted a regex a while back that did almost this.

<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quarter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>\s*</tr[^>]>

this will extract all the rows with the info you need.

It will store the respective values in a named group, so they're easily extracted:

foreach (Match m in regex.Matches(input))
{
string team = m.Groups["team"].Value;
string quarter = m.Groups["quarter"].Value;
string score = m.Groups["score"].Value;
}

You can even chain this expression so you can get all the results in one
pass:

(<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quarter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>\s*</tr[^>]>\s*)+

Match m = regex.Match(input);
for (int i = 0; i < m.Groups["team"].Captures.Length; i++)
{
string team = m.Groups["team"].Captures.Value;
string quarter = m.Groups["quarter"].Captures.Value;
string score = m.Groups["score"].Captures.Value;
}

As an alternative you might want to have a look at the HTML Agility pack.
It allows you to do XPath queries over HTML. A very powerful way to extract
data from HTML files.

http://www.codeplex.com/htmlagilitypack
 
J

Jesse Houwing

Hello JP,
Thanks! I think that almost did it. I ran into a problem testing
"St. Louis" though. Possibly due to the "."? When writing the
results to the debug window, St. Louis was omitted.


This is caused by the [\w\s]+. If you replace it with [^>]+ it should work.

Though keep in mind that this isn't very strict and will easily break if
the page layout changes... (as in it will return results that aren't possibly
what you expected instead of return no result at all).

Jesse

Arnshea said:
I am creating a screen scraping app that will extract data from a
website.. The screen scraping is pretty straightforward using .NET
2.0, but stripping out all extraneous characters is proving to be
more difficult. I am basically trying to extract the team,
quarter, score for the quarter, and score for the entire game from
this html. (This html is a subset of the entire page)

<table border="0" width="100%"><tr><td width="40%">Team</td><td
width="10%"
align="center">1</td> <td width="10%" align="center">2</td><td
width="10%"
align="center">3</td> <td width="10%" align="center">4</td><td
width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td
width="10%"
align="center" >10</td><td width="10%" align="center" >0</td><td
width="10%"
align="center" >0</td><td width="20%" align="center"
10</td></tr><tr><td
width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indianapo
lis</A>-</td><td
width="10%" align="center" >7</td><td width="10%" align="center"
3</td><td
width="10%" align="center" >14</td><td width="10%" align="center"
17</td><td
width="20%" align="center" >41</td></tr></table>
In essance I want to be able to put the names and scores into an
array so I can add to a database. From what I read regular
expressions should be able to do this but I am a complete beginner
using regex. Could someone assist in getting me started? Many
thanks.
One way to do this is to use the pattern:
([\w\s]+)<
then ignore whitespace only matches. So assuming you've got all of
your input on a single line, the code below should print out what
you're looking for. Depending on how you've got your input (as one
big multi-line string or as multiple strings) you may need to use the
RegexOptions.Multiline flag in the regex constructor.

static void Main(string[] args)
{
string pat = @">([\w\s]+)<";
string html = @"<table border=""0"" width=""100%""><tr><td
width=""40%"">Team</td><td width=""10%"" align=""center"">1</td> <td
width=""10%"" align=""center"">2</td><td width=""10%""
align=""center"">3</td> <td width=""10%"" align=""center"">4</td><td
width=""20%"" align=""center"">Score</td></tr><tr><td
width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/078.htm"">New
Orleans</A></td><td width=""10%"" align=""center"" >0</td><td
width=""10%"" align=""center"" >10</td><td width=""10%""
align=""center"" >0</td><td width=""10%"" align=""center"" >0</td><td
width=""20%"" align=""center"" >10</td></tr><tr><td width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/
071.htm"">Indianapolis said:
td><td width=""10%"" align=""center"" >3</td><td width=""10%""
align=""center"" >14</td><td width=""10%"" align=""center"" >17</
td><td width=""20%"" align=""center"" >41</td></tr></table>";
Regex r = new Regex(pat);

foreach (Match m in r.Matches(html))
{
if ( m.Groups[1].Value.Trim() == "" )
// ignore these
continue;
else
// do whatever it is you want to do here
Console.WriteLine(m.Groups[1].Value);
}
}
 
G

Guest

Jesse,

That's exactly what I am trying to do with the data within the HTML. The
expressions and code you listed don't apply to the HTML I posted correct?

I am not sure how I could group the teams and quarters if they are not
labeled. I probably don't understand how regex works...

Jesse Houwing said:
Hello JP,
I am creating a screen scraping app that will extract data from a
website. The screen scraping is pretty straightforward using .NET
2.0, but stripping out all extraneous characters is proving to be more
difficult. I am basically trying to extract the team, quarter, score
for the quarter, and score for the entire game from this html. (This
html is a subset of the entire page)

<table border="0" width="100%"><tr><td width="40%">Team</td><td
width="10%" align="center">1</td> <td width="10%"
align="center">2</td><td width="10%" align="center">3</td> <td
width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td width="10%"

href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indianapoli
s</A></td><td width="10%" align="center" >7</td><td width="10%"


In essance I want to be able to put the names and scores into an array
so I can add to a database. From what I read regular expressions
should be able to do this but I am a complete beginner using regex.
Could someone assist in getting me started? Many thanks.


I posted a regex a while back that did almost this.

<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quarter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>\s*</tr[^>]>

this will extract all the rows with the info you need.

It will store the respective values in a named group, so they're easily extracted:

foreach (Match m in regex.Matches(input))
{
string team = m.Groups["team"].Value;
string quarter = m.Groups["quarter"].Value;
string score = m.Groups["score"].Value;
}

You can even chain this expression so you can get all the results in one
pass:

(<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quarter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>\s*</tr[^>]>\s*)+

Match m = regex.Match(input);
for (int i = 0; i < m.Groups["team"].Captures.Length; i++)
{
string team = m.Groups["team"].Captures.Value;
string quarter = m.Groups["quarter"].Captures.Value;
string score = m.Groups["score"].Captures.Value;
}

As an alternative you might want to have a look at the HTML Agility pack.
It allows you to do XPath queries over HTML. A very powerful way to extract
data from HTML files.

http://www.codeplex.com/htmlagilitypack
 
J

Jesse Houwing

Hello JP,
Jesse,

That's exactly what I am trying to do with the data within the HTML.
The expressions and code you listed don't apply to the HTML I posted
correct?

They do. by just saying <td[^>]+> you say find a <td tag. Then find everything
up to the closing > and then match the closing >. It ignores all the border,
widtch, height and other stuff in there.

I am not sure how I could group the teams and quarters if they are not
labeled. I probably don't understand how regex works...

Well if they're always in the xth cell as in you example, you can use their
position (as I've done in the expression). The (?<name>...) construct in
the expression then labels them.

Jesse
Jesse Houwing said:
Hello JP,
I am creating a screen scraping app that will extract data from a
website. The screen scraping is pretty straightforward using .NET
2.0, but stripping out all extraneous characters is proving to be
more difficult. I am basically trying to extract the team,
quarter, score for the quarter, and score for the entire game from
this html. (This html is a subset of the entire page)

<table border="0" width="100%"><tr><td width="40%">Team</td><td
width="10%" align="center">1</td> <td width="10%"
align="center">2</td><td width="10%" align="center">3</td> <td
width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td
width="10%" align="center" >10</td><td width="10%" align="center"
0</td><td width="10%" align="center" >0</td><td width="20%"
align="center"

10</td></tr><tr><td width="40%"><A

href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indianapo
li s</A></td><td width="10%" align="center" >7</td><td width="10%"
align="center" >3</td><td width="10%" align="center" >14</td><td
width="10%" align="center" >17</td><td width="20%" align="center"

41</td></tr></table>

In essance I want to be able to put the names and scores into an
array so I can add to a database. From what I read regular
expressions should be able to do this but I am a complete beginner
using regex. Could someone assist in getting me started? Many
thanks.
I posted a regex a while back that did almost this.

\s*</tr[^>]>

this will extract all the rows with the info you need.

It will store the respective values in a named group, so they're
easily extracted:

foreach (Match m in regex.Matches(input))
{
string team = m.Groups["team"].Value;
string quarter = m.Groups["quarter"].Value;
string score = m.Groups["score"].Value;
}
You can even chain this expression so you can get all the results in
one pass:

(<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quar
ter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]
*>\s*</tr[^>]>\s*)+

Match m = regex.Match(input);
for (int i = 0; i < m.Groups["team"].Captures.Length; i++)
{
string team = m.Groups["team"].Captures.Value;
string quarter = m.Groups["quarter"].Captures.Value;
string score = m.Groups["score"].Captures.Value;
}
As an alternative you might want to have a look at the HTML Agility
pack. It allows you to do XPath queries over HTML. A very powerful
way to extract data from HTML files.

http://www.codeplex.com/htmlagilitypack
 
G

Guest

Hi Jesse,

Thanks again for the assistance.

I have taken what you have posted and put it into a console app but it
doesn't seem to pick up anything using the expression. If I uncomment the
shortpat and the first foreach, I get data. Am I missing something?

static void Main(string[] args)
{
//string shortpat = @">([\w\s]+)<";
string pat =
@"<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quarter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>\s*</tr[^>]>";

string html = @"<table border=""0"" width=""100%""><tr><td
width=""40%"">Team</td><td width=""10%"" align=""center"">1</td>
<td
width=""10%"" align=""center"">2</td><td width=""10%""
align=""center"">3</td> <td width=""10%""
align=""center"">4</td><td
width=""20%"" align=""center"">Score</td></tr><tr><td
width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/078.htm"">New
Orleans</A></td><td width=""10%"" align=""center"" >0</td><td
0</td><td
width=""20%"" align=""center"" >10</td></tr><tr><td
width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/
071.htm"">Indianapolis</A>-</td><td width=""10%""
align=""center"" >7</
td><td width=""10%"" align=""center"" >3</td><td width=""10%""
align=""center"" >14</td><td width=""10%"" align=""center"" >17</
td><td width=""20%"" align=""center"" >41</td></tr></table>";

Regex r = new Regex(pat);

//foreach (Match m in r.Matches(html))
//{
// if (m.Groups[1].Value.Trim() == "")
// // ignore these
// continue;
// else
// Console.WriteLine(m.Groups[1].Value);
// Console.ReadLine();
//}

foreach (Match m in r.Matches(html))
{
string team = m.Groups["team"].Value;
string quarter = m.Groups["quarter"].Value;
string score = m.Groups["score"].Value;

Console.WriteLine(team);
Console.WriteLine(quarter);
Console.WriteLine(score);
}

Console.ReadLine();

}


Jesse Houwing said:
Hello JP,
Jesse,

That's exactly what I am trying to do with the data within the HTML.
The expressions and code you listed don't apply to the HTML I posted
correct?

They do. by just saying <td[^>]+> you say find a <td tag. Then find everything
up to the closing > and then match the closing >. It ignores all the border,
widtch, height and other stuff in there.

For now I ignored the <a href> around the team name, but other than that,
the expression should work.
I am not sure how I could group the teams and quarters if they are not
labeled. I probably don't understand how regex works...

Well if they're always in the xth cell as in you example, you can use their
position (as I've done in the expression). The (?<name>...) construct in
the expression then labels them.

Jesse
Jesse Houwing said:
Hello JP,

I am creating a screen scraping app that will extract data from a
website. The screen scraping is pretty straightforward using .NET
2.0, but stripping out all extraneous characters is proving to be
more difficult. I am basically trying to extract the team,
quarter, score for the quarter, and score for the entire game from
this html. (This html is a subset of the entire page)

<table border="0" width="100%"><tr><td width="40%">Team</td><td
width="10%" align="center">1</td> <td width="10%"
align="center">2</td><td width="10%" align="center">3</td> <td
width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td
width="10%" align="center" >10</td><td width="10%" align="center"
0</td><td width="10%" align="center" >0</td><td width="20%"
align="center"

10</td></tr><tr><td width="40%"><A

href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indianapo
li s</A></td><td width="10%" align="center" >7</td><td width="10%"
align="center" >3</td><td width="10%" align="center" >14</td><td
width="10%" align="center" >17</td><td width="20%" align="center"

41</td></tr></table>

In essance I want to be able to put the names and scores into an
array so I can add to a database. From what I read regular
expressions should be able to do this but I am a complete beginner
using regex. Could someone assist in getting me started? Many
thanks.

I posted a regex a while back that did almost this.

<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quart
er>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*
\s*</tr[^>]>

this will extract all the rows with the info you need.

It will store the respective values in a named group, so they're
easily extracted:

foreach (Match m in regex.Matches(input))
{
string team = m.Groups["team"].Value;
string quarter = m.Groups["quarter"].Value;
string score = m.Groups["score"].Value;
}
You can even chain this expression so you can get all the results in
one pass:

(<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quar
ter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]
*>\s*</tr[^>]>\s*)+

Match m = regex.Match(input);
for (int i = 0; i < m.Groups["team"].Captures.Length; i++)
{
string team = m.Groups["team"].Captures.Value;
string quarter = m.Groups["quarter"].Captures.Value;
string score = m.Groups["score"].Captures.Value;
}
As an alternative you might want to have a look at the HTML Agility
pack. It allows you to do XPath queries over HTML. A very powerful
way to extract data from HTML files.

http://www.codeplex.com/htmlagilitypack
 
A

Arnshea

Hi Jesse,

Thanks again for the assistance.

I have taken what you have posted and put it into a console app but it
doesn't seem to pick up anything using the expression. If I uncomment the
shortpat and the first foreach, I get data. Am I missing something?

static void Main(string[] args)
{
//string shortpat = @">([\w\s]+)<";
string pat =
@"<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quarter>(­(?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>\s*</tr[^­>]>";

string html = @"<table border=""0"" width=""100%""><tr><td
width=""40%"">Team</td><td width=""10%"" align=""center"">1</td>
<td
width=""10%"" align=""center"">2</td><td width=""10%""
align=""center"">3</td> <td width=""10%""
align=""center"">4</td><td
width=""20%"" align=""center"">Score</td></tr><tr><td
width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/078.htm"">New
Orleans</A></td><td width=""10%"" align=""center"" >0</td><td
width=""10%"" align=""center"" >10</td><td width=""10%""
align=""center"" >0</td><td width=""10%"" align=""center"">0</td><td

width=""20%"" align=""center"" >10</td></tr><tr><td
width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/
071.htm"">Indianapolis</A>-</td><td width=""10%""
align=""center"" >7</
td><td width=""10%"" align=""center"" >3</td><td width=""10%""
align=""center"" >14</td><td width=""10%"" align=""center"" >17</
td><td width=""20%"" align=""center"" >41</td></tr></table>";

Regex r = new Regex(pat);

//foreach (Match m in r.Matches(html))
//{
// if (m.Groups[1].Value.Trim() == "")
// // ignore these
// continue;
// else
// Console.WriteLine(m.Groups[1].Value);
// Console.ReadLine();
//}

foreach (Match m in r.Matches(html))
{
string team = m.Groups["team"].Value;
string quarter = m.Groups["quarter"].Value;
string score = m.Groups["score"].Value;

Console.WriteLine(team);
Console.WriteLine(quarter);
Console.WriteLine(score);
}

Console.ReadLine();

}



Jesse Houwing said:
Hello JP,
They do. by just saying <td[^>]+> you say find a <td tag. Then find everything
up to the closing > and then match the closing >. It ignores all the border,
widtch, height and other stuff in there.
For now I ignored the <a href> around the team name, but other than that,
the expression should work.
Well if they're always in the xth cell as in you example, you can use their
position (as I've done in the expression). The (?<name>...) construct in
the expression then labels them.
:
Hello JP,
I am creating a screen scraping app that will extract data from a
website. The screen scraping is pretty straightforward using .NET
2.0, but stripping out all extraneous characters is proving to be
more difficult. I am basically trying to extract the team,
quarter, score for the quarter, and score for the entire game from
this html. (This html is a subset of the entire page)
<table border="0" width="100%"><tr><td width="40%">Team</td><td
width="10%" align="center">1</td> <td width="10%"
align="center">2</td><td width="10%" align="center">3</td> <td
width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td
width="10%" align="center" >10</td><td width="10%" align="center"
0</td><td width="10%" align="center" >0</td><td width="20%"
align="center"
10</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indianapo
li s</A></td><td width="10%" align="center" >7</td><td width="10%"
align="center" >3</td><td width="10%" align="center" >14</td><td
width="10%" align="center" >17</td><td width="20%" align="center"
41</td></tr></table>
In essance I want to be able to put the names and scores into an
array so I can add to a database. From what I read regular
expressions should be able to do this but I am a complete beginner
using regex. Could someone assist in getting me started? Many
thanks.
I posted a regex a while back that did almost this.
<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quart
er>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*
\s*</tr[^>]>
this will extract all the rows with the info you need.
It will store the respective values in a named group, so they're
easily extracted:
foreach (Match m in regex.Matches(input))
{
string team = m.Groups["team"].Value;
string quarter = m.Groups["quarter"].Value;
string score = m.Groups["score"].Value;
}
You can even chain this expression so you can get all the results in
one pass:
(<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quar
ter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]
*>\s*</tr[^>]>\s*)+
Match m = regex.Match(input);
for (int i = 0; i < m.Groups["team"].Captures.Length; i++)
{
string team = m.Groups["team"].Captures.Value;
string quarter = m.Groups["quarter"].Captures.Value;
string score = m.Groups["score"].Captures.Value;
}
As an alternative you might want to have a look at the HTML Agility
pack. It allows you to do XPath queries over HTML. A very powerful
way to extract data from HTML files.
http://www.codeplex.com/htmlagilitypack


- Show quoted text -


To capture st. louis try replacing shortpat with:
string shortpat = @">([^<]+)<";
 
J

Jesse Houwing

Hello JP,

I updated the expression and tested it. I'm not sure what to capture where.
Maybe if you can describe the table row to me I could do a better expression.

Right now it captures as follows: team, quarter and the score (4x).


<tr[^>]*>\s*<td[^>]*>\s*<a[^>]*>(?<team>((?!</a).)*)</a[^>]*>\s*</td[^>]*>\s*<td[^>]*>(?<quarter>((?!</td).)*)</td[^>]*>(\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>)+\s*</tr[^>]*>

it results in a match that is built up as follows:

Match
+- Groups
+- Team.Value "New Orleans"
+- Quarter.Value "0"
+- Score.Captures[0].Value = 10
+- Score.Captures[1].Value = 0
+- Score.Captures[2].Value = 0
+- Score.Captures[3].Value = 10

My question now is... did I interpret it right?

Is the row built up of a team name, quarter number and 4 scores?

To explain the expression:

<tr[^>*> Find a start of a row
\s* allow whitespace between <tr> tag and the first <td>
<td[^>]*> Find the first table cell.
\s* allow whitespace between <td> and <a>
<a[^>]*> Find the opening a href tag
(?team((?!</a).)*) capture everything you find from there to the </a tag
in a group named team.
</a[^>]*> Find the closing a tag
\s* allow whitespace between </a> and </td>
</td[^>]*> Find the closing td tag
\s* allow whitespace between </td> and <td>
<td[^>]*> Find opening <td>
(?<quarter>((?!</td).)*) capture everything from there to the </ta tag into
a group named quarter
</td[^>]*> Find the closing td tag
( start repeating group
\s* allow whitespace between </td> and <td>
<td[^>]*> Find opening <td>
(?<score>((?!</td).)*) capture everything from there to the </ta tag into
a group named score
</td[^>]*> Find the closing td tag
)+ end repeating group. If the named group in this repeating group captured
multiple times. The value scan be found in the Group.Captures collection.
\s*
</tr[^>]*> Finally make sure we've captured a complete row by finding it's
end tag.

Jesse


Hi Jesse,

Thanks again for the assistance.

I have taken what you have posted and put it into a console app but it
doesn't seem to pick up anything using the expression. If I
uncomment the shortpat and the first foreach, I get data. Am I
missing something?

static void Main(string[] args)
{
//string shortpat = @">([\w\s]+)<";
string pat =
@"<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quar
ter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*
\s*</tr[^>]>";
string html = @"<table border=""0"" width=""100%""><tr><td
width=""40%"">Team</td><td width=""10%""
align=""center"">1</td>
<td
width=""10%"" align=""center"">2</td><td width=""10%""
align=""center"">3</td> <td width=""10%""
align=""center"">4</td><td
width=""20%"" align=""center"">Score</td></tr><tr><td
width=""40%""><A

href=""/default.asp?c=sportsnetwork&page=nfl/teams/078.htm"">New
Orleans</A></td><td width=""10%"" align=""center""
0</td><td
0</td><td
width=""20%"" align=""center"" >10</td></tr><tr><td
width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/
071.htm"">Indianapolis</A>-</td><td width=""10%""
align=""center"" >7</
align=""center"" >14</td><td width=""10%""
align=""center"" >17</
Regex r = new Regex(pat);

//foreach (Match m in r.Matches(html))
//{
// if (m.Groups[1].Value.Trim() == "")
// // ignore these
// continue;
// else
// Console.WriteLine(m.Groups[1].Value);
// Console.ReadLine();
//}
foreach (Match m in r.Matches(html))
{
string team = m.Groups["team"].Value;
string quarter = m.Groups["quarter"].Value;
string score = m.Groups["score"].Value;
Console.WriteLine(team);
Console.WriteLine(quarter);
Console.WriteLine(score);
}
Console.ReadLine();

}

Jesse Houwing said:
Hello JP,
Jesse,

That's exactly what I am trying to do with the data within the HTML.
The expressions and code you listed don't apply to the HTML I posted
correct?
They do. by just saying <td[^>]+> you say find a <td tag. Then find
everything up to the closing > and then match the closing >. It
ignores all the border, widtch, height and other stuff in there.

For now I ignored the <a href> around the team name, but other than
that, the expression should work.
I am not sure how I could group the teams and quarters if they are
not labeled. I probably don't understand how regex works...
Well if they're always in the xth cell as in you example, you can use
their position (as I've done in the expression). The (?<name>...)
construct in the expression then labels them.

Jesse
:

Hello JP,

I am creating a screen scraping app that will extract data from a
website. The screen scraping is pretty straightforward using .NET
2.0, but stripping out all extraneous characters is proving to be
more difficult. I am basically trying to extract the team,
quarter, score for the quarter, and score for the entire game from
this html. (This html is a subset of the entire page)

<table border="0" width="100%"><tr><td width="40%">Team</td><td
width="10%" align="center">1</td> <td width="10%"
align="center">2</td><td width="10%" align="center">3</td> <td
width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td
width="10%" align="center" >10</td><td width="10%" align="center"

0</td><td width="10%" align="center" >0</td><td width="20%"

align="center"

10</td></tr><tr><td width="40%"><A

href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indiana
po li s</A></td><td width="10%" align="center" >7</td><td
width="10%" align="center" >3</td><td width="10%" align="center"
14</td><td width="10%" align="center" >17</td><td width="20%"
align="center"

41</td></tr></table>

In essance I want to be able to put the names and scores into an
array so I can add to a database. From what I read regular
expressions should be able to do this but I am a complete beginner
using regex. Could someone assist in getting me started? Many
thanks.

I posted a regex a while back that did almost this.

<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<qua
rt
er>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>
]*

\s*</tr[^>]>

this will extract all the rows with the info you need.

It will store the respective values in a named group, so they're
easily extracted:

foreach (Match m in regex.Matches(input))
{
string team = m.Groups["team"].Value;
string quarter = m.Groups["quarter"].Value;
string score = m.Groups["score"].Value;
}
You can even chain this expression so you can get all the results
in
one pass:
(<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<qu
ar
ter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^
] *>\s*</tr[^>]>\s*)+

Match m = regex.Match(input);
for (int i = 0; i < m.Groups["team"].Captures.Length; i++)
{
string team = m.Groups["team"].Captures.Value;
string quarter = m.Groups["quarter"].Captures.Value;
string score = m.Groups["score"].Captures.Value;
}
As an alternative you might want to have a look at the HTML Agility
pack. It allows you to do XPath queries over HTML. A very powerful
way to extract data from HTML files.
http://www.codeplex.com/htmlagilitypack
 
A

Arnshea

Hi Jesse,

Thanks again for the assistance.

I have taken what you have posted and put it into a console app but it
doesn't seem to pick up anything using the expression. If I uncomment the
shortpat and the first foreach, I get data. Am I missing something?

static void Main(string[] args)
{
//string shortpat = @">([\w\s]+)<";
string pat =
@"<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quarter>(­(?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>\s*</tr[^­>]>";

string html = @"<table border=""0"" width=""100%""><tr><td
width=""40%"">Team</td><td width=""10%"" align=""center"">1</td>
<td
width=""10%"" align=""center"">2</td><td width=""10%""
align=""center"">3</td> <td width=""10%""
align=""center"">4</td><td
width=""20%"" align=""center"">Score</td></tr><tr><td
width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/078.htm"">New
Orleans</A></td><td width=""10%"" align=""center"" >0</td><td
width=""10%"" align=""center"" >10</td><td width=""10%""
align=""center"" >0</td><td width=""10%"" align=""center"">0</td><td

width=""20%"" align=""center"" >10</td></tr><tr><td
width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/
071.htm"">Indianapolis</A>-</td><td width=""10%""
align=""center"" >7</
td><td width=""10%"" align=""center"" >3</td><td width=""10%""
align=""center"" >14</td><td width=""10%"" align=""center"" >17</
td><td width=""20%"" align=""center"" >41</td></tr></table>";

Regex r = new Regex(pat);

//foreach (Match m in r.Matches(html))
//{
// if (m.Groups[1].Value.Trim() == "")
// // ignore these
// continue;
// else
// Console.WriteLine(m.Groups[1].Value);
// Console.ReadLine();
//}

foreach (Match m in r.Matches(html))
{
string team = m.Groups["team"].Value;
string quarter = m.Groups["quarter"].Value;
string score = m.Groups["score"].Value;

Console.WriteLine(team);
Console.WriteLine(quarter);
Console.WriteLine(score);
}

Console.ReadLine();

}



Jesse Houwing said:
Hello JP,
They do. by just saying <td[^>]+> you say find a <td tag. Then find everything
up to the closing > and then match the closing >. It ignores all the border,
widtch, height and other stuff in there.
For now I ignored the <a href> around the team name, but other than that,
the expression should work.
Well if they're always in the xth cell as in you example, you can use their
position (as I've done in the expression). The (?<name>...) construct in
the expression then labels them.
:
Hello JP,
I am creating a screen scraping app that will extract data from a
website. The screen scraping is pretty straightforward using .NET
2.0, but stripping out all extraneous characters is proving to be
more difficult. I am basically trying to extract the team,
quarter, score for the quarter, and score for the entire game from
this html. (This html is a subset of the entire page)
<table border="0" width="100%"><tr><td width="40%">Team</td><td
width="10%" align="center">1</td> <td width="10%"
align="center">2</td><td width="10%" align="center">3</td> <td
width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td
width="10%" align="center" >10</td><td width="10%" align="center"
0</td><td width="10%" align="center" >0</td><td width="20%"
align="center"
10</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indianapo
li s</A></td><td width="10%" align="center" >7</td><td width="10%"
align="center" >3</td><td width="10%" align="center" >14</td><td
width="10%" align="center" >17</td><td width="20%" align="center"
41</td></tr></table>
In essance I want to be able to put the names and scores into an
array so I can add to a database. From what I read regular
expressions should be able to do this but I am a complete beginner
using regex. Could someone assist in getting me started? Many
thanks.
I posted a regex a while back that did almost this.
<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quart
er>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*
\s*</tr[^>]>
this will extract all the rows with the info you need.
It will store the respective values in a named group, so they're
easily extracted:
foreach (Match m in regex.Matches(input))
{
string team = m.Groups["team"].Value;
string quarter = m.Groups["quarter"].Value;
string score = m.Groups["score"].Value;
}
You can even chain this expression so you can get all the results in
one pass:
(<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quar
ter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]
*>\s*</tr[^>]>\s*)+
Match m = regex.Match(input);
for (int i = 0; i < m.Groups["team"].Captures.Length; i++)
{
string team = m.Groups["team"].Captures.Value;
string quarter = m.Groups["quarter"].Captures.Value;
string score = m.Groups["score"].Captures.Value;
}
As an alternative you might want to have a look at the HTML Agility
pack. It allows you to do XPath queries over HTML. A very powerful
way to extract data from HTML files.
http://www.codeplex.com/htmlagilitypack


- Show quoted text -


Ok, this should (will?) grab everything you're looking for and print
it out in order:

static void Main(string[] args)
{
string html = @"<table border=""0"" width=""100%""><tr><td
width=""40%"">Team</td><td width=""10%"" align=""center"">1</td> <td
width=""10%"" align=""center"">2</td><td width=""10%""
align=""center"">3</td> <td width=""10%"" align=""center"">4</td><td
width=""20%"" align=""center"">Score</td></tr><tr><td width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/078.htm"">New
Orleans</A></td><td width=""10%"" align=""center"" >0</td><td
width=""10%"" align=""center"" >10</td><td width=""10%""
align=""center"" >0</td><td width=""10%"" align=""center"" >0</td><td
width=""20%"" align=""center"" >10</td></tr><tr><td width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/
071.htm"">Indianapolis</A>­</td><td width=""10%"" align=""center"" >7</
td><td width=""10%"" align=""center"" >3</td><td width=""10%""
align=""center"" >14</td><td width=""10%"" align=""center"" >17</
td><td width=""20%"" align=""center"" >41</td></tr></table>";
string contents = null;
string tdPat = "<td[^>]*>(.+?)</td>"; // grabs everything between
<td>...</td>
string innerTextPat = ">([^>]+)<"; // grabs innermost non-html

Regex tdRegex = new Regex(tdPat);
Regex innerTextRegex = new Regex(innerTextPat);

foreach (Match m in tdRegex.Matches(html))
{
contents = m.Groups[1].Value;

Console.WriteLine(contents); // will include <a href="...">TEAM
NAME</a>

// the following will print out the team name w/o the hyperlink
foreach (Match m2 in innerTextRegex.Matches(contents))
Console.WriteLine(m2.Groups[1].Value);
}
}
 
G

Guest

Jesse,

Cut and paste this text into notepad and then save as an html file. You
will see how the data relates. The teams run vertically along with the
quarters and final score. I guess that is what I am having a hard time
understanding. How you can group the different data points with the way
this table is structured.

I did write a little routine to pull out different data points based on
looping through the data but it is a bit kludgy. Was hoping to do something
more robust like your solution.

<html>
<table border="0" width="100%"><tr><td width="40%">Team</td><td width="10%"
align="center">1</td> <td width="10%" align="center">2</td><td width="10%"
align="center">3</td> <td width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td width="10%"
align="center" >10</td><td width="10%" align="center" >0</td><td width="10%"
align="center" >0</td><td width="20%" align="center" >10</td></tr><tr><td
width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indianapolis</A></td><td
width="10%" align="center" >7</td><td width="10%" align="center" >3</td><td
width="10%" align="center" >14</td><td width="10%" align="center" >17</td><td
width="20%" align="center" >41</td></tr></table>
</html>

Jesse Houwing said:
Hello JP,

I updated the expression and tested it. I'm not sure what to capture where.
Maybe if you can describe the table row to me I could do a better expression.

Right now it captures as follows: team, quarter and the score (4x).


<tr[^>]*>\s*<td[^>]*>\s*<a[^>]*>(?<team>((?!</a).)*)</a[^>]*>\s*</td[^>]*>\s*<td[^>]*>(?<quarter>((?!</td).)*)</td[^>]*>(\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>)+\s*</tr[^>]*>

it results in a match that is built up as follows:

Match
+- Groups
+- Team.Value "New Orleans"
+- Quarter.Value "0"
+- Score.Captures[0].Value = 10
+- Score.Captures[1].Value = 0
+- Score.Captures[2].Value = 0
+- Score.Captures[3].Value = 10

My question now is... did I interpret it right?

Is the row built up of a team name, quarter number and 4 scores?

To explain the expression:

<tr[^>*> Find a start of a row
\s* allow whitespace between <tr> tag and the first <td>
<td[^>]*> Find the first table cell.
\s* allow whitespace between <td> and <a>
<a[^>]*> Find the opening a href tag
(?team((?!</a).)*) capture everything you find from there to the </a tag
in a group named team.
</a[^>]*> Find the closing a tag
\s* allow whitespace between </a> and </td>
</td[^>]*> Find the closing td tag
\s* allow whitespace between </td> and <td>
<td[^>]*> Find opening <td>
(?<quarter>((?!</td).)*) capture everything from there to the </ta tag into
a group named quarter
</td[^>]*> Find the closing td tag
( start repeating group
\s* allow whitespace between </td> and <td>
<td[^>]*> Find opening <td>
(?<score>((?!</td).)*) capture everything from there to the </ta tag into
a group named score
</td[^>]*> Find the closing td tag
)+ end repeating group. If the named group in this repeating group captured
multiple times. The value scan be found in the Group.Captures collection.
\s*
</tr[^>]*> Finally make sure we've captured a complete row by finding it's
end tag.

Jesse


Hi Jesse,

Thanks again for the assistance.

I have taken what you have posted and put it into a console app but it
doesn't seem to pick up anything using the expression. If I
uncomment the shortpat and the first foreach, I get data. Am I
missing something?

static void Main(string[] args)
{
//string shortpat = @">([\w\s]+)<";
string pat =
@"<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<quar
ter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*
\s*</tr[^>]>";
string html = @"<table border=""0"" width=""100%""><tr><td
width=""40%"">Team</td><td width=""10%""
align=""center"">1</td>
<td
width=""10%"" align=""center"">2</td><td width=""10%""
align=""center"">3</td> <td width=""10%""
align=""center"">4</td><td
width=""20%"" align=""center"">Score</td></tr><tr><td
width=""40%""><A

href=""/default.asp?c=sportsnetwork&page=nfl/teams/078.htm"">New
Orleans</A></td><td width=""10%"" align=""center""
0</td><td
0</td><td
width=""20%"" align=""center"" >10</td></tr><tr><td
width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/
071.htm"">Indianapolis</A>-</td><td width=""10%""
align=""center"" >7</
align=""center"" >14</td><td width=""10%""
align=""center"" >17</
Regex r = new Regex(pat);

//foreach (Match m in r.Matches(html))
//{
// if (m.Groups[1].Value.Trim() == "")
// // ignore these
// continue;
// else
// Console.WriteLine(m.Groups[1].Value);
// Console.ReadLine();
//}
foreach (Match m in r.Matches(html))
{
string team = m.Groups["team"].Value;
string quarter = m.Groups["quarter"].Value;
string score = m.Groups["score"].Value;
Console.WriteLine(team);
Console.WriteLine(quarter);
Console.WriteLine(score);
}
Console.ReadLine();

}

Jesse Houwing said:
Hello JP,

Jesse,

That's exactly what I am trying to do with the data within the HTML.
The expressions and code you listed don't apply to the HTML I posted
correct?

They do. by just saying <td[^>]+> you say find a <td tag. Then find
everything up to the closing > and then match the closing >. It
ignores all the border, widtch, height and other stuff in there.

For now I ignored the <a href> around the team name, but other than
that, the expression should work.

I am not sure how I could group the teams and quarters if they are
not labeled. I probably don't understand how regex works...

Well if they're always in the xth cell as in you example, you can use
their position (as I've done in the expression). The (?<name>...)
construct in the expression then labels them.

Jesse

:

Hello JP,

I am creating a screen scraping app that will extract data from a
website. The screen scraping is pretty straightforward using .NET
2.0, but stripping out all extraneous characters is proving to be
more difficult. I am basically trying to extract the team,
quarter, score for the quarter, and score for the entire game from
this html. (This html is a subset of the entire page)

<table border="0" width="100%"><tr><td width="40%">Team</td><td
width="10%" align="center">1</td> <td width="10%"
align="center">2</td><td width="10%" align="center">3</td> <td
width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td
width="10%" align="center" >10</td><td width="10%" align="center"

0</td><td width="10%" align="center" >0</td><td width="20%"

align="center"

10</td></tr><tr><td width="40%"><A

href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indiana
po li s</A></td><td width="10%" align="center" >7</td><td
width="10%" align="center" >3</td><td width="10%" align="center"
14</td><td width="10%" align="center" >17</td><td width="20%"
align="center"

41</td></tr></table>

In essance I want to be able to put the names and scores into an
array so I can add to a database. From what I read regular
expressions should be able to do this but I am a complete beginner
using regex. Could someone assist in getting me started? Many
thanks.

I posted a regex a while back that did almost this.

<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<qua
rt
er>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>
]*

\s*</tr[^>]>

this will extract all the rows with the info you need.

It will store the respective values in a named group, so they're
easily extracted:

foreach (Match m in regex.Matches(input))
{
string team = m.Groups["team"].Value;
string quarter = m.Groups["quarter"].Value;
string score = m.Groups["score"].Value;
}
You can even chain this expression so you can get all the results
in
one pass:
(<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<qu
ar
ter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^
] *>\s*</tr[^>]>\s*)+

Match m = regex.Match(input);
for (int i = 0; i < m.Groups["team"].Captures.Length; i++)
{
string team = m.Groups["team"].Captures.Value;
string quarter = m.Groups["quarter"].Captures.Value;
string score = m.Groups["score"].Captures.Value;
}
As an alternative you might want to have a look at the HTML Agility
pack. It allows you to do XPath queries over HTML. A very powerful
way to extract data from HTML files.
http://www.codeplex.com/htmlagilitypack
 
J

Jesse Houwing

Hello JP,
Jesse,

Cut and paste this text into notepad and then save as an html file.
You will see how the data relates. The teams run vertically along
with the quarters and final score. I guess that is what I am having a
hard time understanding. How you can group the different data points
with the way this table is structured.

I did write a little routine to pull out different data points based
on looping through the data but it is a bit kludgy. Was hoping to do
something more robust like your solution.

Given that the last column is the final score (excuse my lack of insight
in sports with more than 2 halves, I'm a european soccer watcher), it should
be quite easy to alter the expression I gave before:

<tr[^>]*>\s*<td[^>]*>\s*<a[^>]*>(?<team>((?!</a).)*)</a[^>]*>\s*</td[^>]*>\s*(\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>){4}\s*<td[^>]*>(?<finalscore>((?!</td).)*)</td[^>]*></tr[^>]*>

It uses the position in teh table relative to the row as the way to find
out which is what.

1st td, team
2nd td .. 5th td, score for each quarter (score x4)
6th column final score

Now if you match against it, make sure you specify the option RegexOptions.SingleLine,
so that the . will match a newline. I migth have forgotten to mention that
before.

The code should work out like this:

private static Regex scoreRegex = new Regex ("..", RegexOptions.SingleLine
| RegexOptions.Compiled | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);

public void ExtractData(string htmlInput)
{
Match m = scoreRegex.Match(htmlInput);

while (m != null && m.Success)
{
string team = m.Groups["team"].Value;
string quarter1 = m.Groups["score"].Captures[0].Value;
string quarter2 = m.Groups["score"].Captures[1].Value;
string quarter3 = m.Groups["score"].Captures[2].Value;
string quarter4 = m.Groups["score"].Captures[3].Value;
string final = m.Groups["finalscore"].Value;

// Do your thing with the data before moving on to the next match

m = m.NextMatch();
}
}

If you look at the differences between this regex and the last. The most
important difference is that the score part will only be repeated 4 times.
{4}.

I hope this works out.

Jesse
<html>
<table border="0" width="100%"><tr><td width="40%">Team</td><td
width="10%"
align="center">1</td> <td width="10%" align="center">2</td><td
width="10%"
align="center">3</td> <td width="10%" align="center">4</td><td
width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td width="10%"
align="center" >10</td><td width="10%" align="center" >0</td><td
width="10%"
align="center" >0</td><td width="20%" align="center"
10</td></tr><tr><td
width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indianapoli
s</A></td><td
width="10%" align="center" >7</td><td width="10%" align="center"
3</td><td
width="10%" align="center" >14</td><td width="10%" align="center"
17</td><td
width="20%" align="center" >41</td></tr></table>
</html>
Jesse Houwing said:
Hello JP,

I updated the expression and tested it. I'm not sure what to capture
where. Maybe if you can describe the table row to me I could do a
better expression.

Right now it captures as follows: team, quarter and the score (4x).

<tr[^>]*>\s*<td[^>]*>\s*<a[^>]*>(?<team>((?!</a).)*)</a[^>]*>\s*</td[
^>]*>\s*<td[^>]*>(?<quarter>((?!</td).)*)</td[^>]*>(\s*<td[^>]*>(?<sc
ore>((?!</td).)*)</td[^>]*>)+\s*</tr[^>]*>

it results in a match that is built up as follows:

Match
+- Groups
+- Team.Value "New Orleans"
+- Quarter.Value "0"
+- Score.Captures[0].Value = 10
+- Score.Captures[1].Value = 0
+- Score.Captures[2].Value = 0
+- Score.Captures[3].Value = 10
My question now is... did I interpret it right?

Is the row built up of a team name, quarter number and 4 scores?

To explain the expression:

<tr[^>*> Find a start of a row
\s* allow whitespace between <tr> tag and the first <td>
<td[^>]*> Find the first table cell.
\s* allow whitespace between <td> and <a>
<a[^>]*> Find the opening a href tag
(?team((?!</a).)*) capture everything you find from there to the </a
tag
in a group named team.
</a[^>]*> Find the closing a tag
\s* allow whitespace between </a> and </td>
</td[^>]*> Find the closing td tag
\s* allow whitespace between </td> and <td>
<td[^>]*> Find opening <td>
(?<quarter>((?!</td).)*) capture everything from there to the </ta
tag into
a group named quarter
</td[^>]*> Find the closing td tag
( start repeating group
\s* allow whitespace between </td> and <td>
<td[^>]*> Find opening <td>
(?<score>((?!</td).)*) capture everything from there to the </ta tag
into
a group named score
</td[^>]*> Find the closing td tag
)+ end repeating group. If the named group in this repeating group
captured
multiple times. The value scan be found in the Group.Captures
collection.
\s*
</tr[^>]*> Finally make sure we've captured a complete row by finding
it's
end tag.
Jesse
Hi Jesse,

Thanks again for the assistance.

I have taken what you have posted and put it into a console app but
it doesn't seem to pick up anything using the expression. If I
uncomment the shortpat and the first foreach, I get data. Am I
missing something?

static void Main(string[] args)
{
//string shortpat = @">([\w\s]+)<";
string pat =
@"<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<qu
ar
ter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>
]*
\s*</tr[^>]>";

string html = @"<table border=""0"" width=""100%""><tr><td
width=""40%"">Team</td><td width=""10%""
align=""center"">1</td>
<td
width=""10%"" align=""center"">2</td><td width=""10%""
align=""center"">3</td> <td width=""10%""
align=""center"">4</td><td
width=""20%"" align=""center"">Score</td></tr><tr><td
width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/078.htm"">New
Orleans</A></td><td width=""10%"" align=""center""

0</td><td

width=""10%"" align=""center"" >10</td><td width=""10%""
align=""center"" >0</td><td width=""10%"" align=""center""

0</td><td

width=""20%"" align=""center"" >10</td></tr><tr><td width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/
071.htm"">Indianapolis</A>-</td><td width=""10%"" align=""center""
7</
td>> <td width=""10%"" align=""center"" >3</td><td width=""10%"" td>>
align=""center"" >14</td><td width=""10%""
align=""center"" >17</
td>> <td width=""20%"" align=""center"" >41</td></tr></table>"; td>>
Regex r = new Regex(pat);

//foreach (Match m in r.Matches(html))
//{
// if (m.Groups[1].Value.Trim() == "")
// // ignore these
// continue;
// else
// Console.WriteLine(m.Groups[1].Value);
// Console.ReadLine();
//}
foreach (Match m in r.Matches(html))
{
string team = m.Groups["team"].Value;
string quarter = m.Groups["quarter"].Value;
string score = m.Groups["score"].Value;
Console.WriteLine(team);
Console.WriteLine(quarter);
Console.WriteLine(score);
}
Console.ReadLine();
}

:

Hello JP,

Jesse,

That's exactly what I am trying to do with the data within the
HTML. The expressions and code you listed don't apply to the HTML
I posted correct?

They do. by just saying <td[^>]+> you say find a <td tag. Then find
everything up to the closing > and then match the closing >. It
ignores all the border, widtch, height and other stuff in there.

For now I ignored the <a href> around the team name, but other than
that, the expression should work.

I am not sure how I could group the teams and quarters if they are
not labeled. I probably don't understand how regex works...

Well if they're always in the xth cell as in you example, you can
use their position (as I've done in the expression). The
(?<name>...) construct in the expression then labels them.

Jesse

:

Hello JP,

I am creating a screen scraping app that will extract data from
a website. The screen scraping is pretty straightforward using
.NET 2.0, but stripping out all extraneous characters is proving
to be more difficult. I am basically trying to extract the
team, quarter, score for the quarter, and score for the entire
game from this html. (This html is a subset of the entire
page)

<table border="0" width="100%"><tr><td width="40%">Team</td><td
width="10%" align="center">1</td> <td width="10%"
align="center">2</td><td width="10%" align="center">3</td> <td
width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td
width="10%" align="center" >10</td><td width="10%"
align="center"

0</td><td width="10%" align="center" >0</td><td width="20%"

align="center"

10</td></tr><tr><td width="40%"><A

href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">India
na po li s</A></td><td width="10%" align="center" >7</td><td
width="10%" align="center" >3</td><td width="10%" align="center"

14</td><td width="10%" align="center" >17</td><td width="20%"

align="center"

41</td></tr></table>

In essance I want to be able to put the names and scores into an
array so I can add to a database. From what I read regular
expressions should be able to do this but I am a complete
beginner using regex. Could someone assist in getting me
started? Many thanks.

I posted a regex a while back that did almost this.

<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<q
ua rt
er>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[
^> ]*

\s*</tr[^>]>

this will extract all the rows with the info you need.

It will store the respective values in a named group, so they're
easily extracted:

foreach (Match m in regex.Matches(input))
{
string team = m.Groups["team"].Value;
string quarter = m.Groups["quarter"].Value;
string score = m.Groups["score"].Value;
}
You can even chain this expression so you can get all the results
in
one pass:
(<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<
qu
ar
ter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td
[^
] *>\s*</tr[^>]>\s*)+

Match m = regex.Match(input);
for (int i = 0; i < m.Groups["team"].Captures.Length; i++)
{
string team = m.Groups["team"].Captures.Value;
string quarter = m.Groups["quarter"].Captures.Value;
string score = m.Groups["score"].Captures.Value;
}
As an alternative you might want to have a look at the HTML
Agility
pack. It allows you to do XPath queries over HTML. A very
powerful
way to extract data from HTML files.
http://www.codeplex.com/htmlagilitypack
 
G

Guest

Thanks Jesse.

Jesse Houwing said:
Hello JP,
Jesse,

Cut and paste this text into notepad and then save as an html file.
You will see how the data relates. The teams run vertically along
with the quarters and final score. I guess that is what I am having a
hard time understanding. How you can group the different data points
with the way this table is structured.

I did write a little routine to pull out different data points based
on looping through the data but it is a bit kludgy. Was hoping to do
something more robust like your solution.

Given that the last column is the final score (excuse my lack of insight
in sports with more than 2 halves, I'm a european soccer watcher), it should
be quite easy to alter the expression I gave before:

<tr[^>]*>\s*<td[^>]*>\s*<a[^>]*>(?<team>((?!</a).)*)</a[^>]*>\s*</td[^>]*>\s*(\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>]*>){4}\s*<td[^>]*>(?<finalscore>((?!</td).)*)</td[^>]*></tr[^>]*>

It uses the position in teh table relative to the row as the way to find
out which is what.

1st td, team
2nd td .. 5th td, score for each quarter (score x4)
6th column final score

Now if you match against it, make sure you specify the option RegexOptions.SingleLine,
so that the . will match a newline. I migth have forgotten to mention that
before.

The code should work out like this:

private static Regex scoreRegex = new Regex ("..", RegexOptions.SingleLine
| RegexOptions.Compiled | RegexOptions.ExplicitCapture | RegexOptions.IgnoreCase);

public void ExtractData(string htmlInput)
{
Match m = scoreRegex.Match(htmlInput);

while (m != null && m.Success)
{
string team = m.Groups["team"].Value;
string quarter1 = m.Groups["score"].Captures[0].Value;
string quarter2 = m.Groups["score"].Captures[1].Value;
string quarter3 = m.Groups["score"].Captures[2].Value;
string quarter4 = m.Groups["score"].Captures[3].Value;
string final = m.Groups["finalscore"].Value;

// Do your thing with the data before moving on to the next match

m = m.NextMatch();
}
}

If you look at the differences between this regex and the last. The most
important difference is that the score part will only be repeated 4 times.
{4}.

I hope this works out.

Jesse
<html>
<table border="0" width="100%"><tr><td width="40%">Team</td><td
width="10%"
align="center">1</td> <td width="10%" align="center">2</td><td
width="10%"
align="center">3</td> <td width="10%" align="center">4</td><td
width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td width="10%"
align="center" >10</td><td width="10%" align="center" >0</td><td
width="10%"
align="center" >0</td><td width="20%" align="center"
10</td></tr><tr><td
width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indianapoli
s</A></td><td
width="10%" align="center" >7</td><td width="10%" align="center"
3</td><td
width="10%" align="center" >14</td><td width="10%" align="center"
17</td><td
width="20%" align="center" >41</td></tr></table>
</html>
Jesse Houwing said:
Hello JP,

I updated the expression and tested it. I'm not sure what to capture
where. Maybe if you can describe the table row to me I could do a
better expression.

Right now it captures as follows: team, quarter and the score (4x).

<tr[^>]*>\s*<td[^>]*>\s*<a[^>]*>(?<team>((?!</a).)*)</a[^>]*>\s*</td[
^>]*>\s*<td[^>]*>(?<quarter>((?!</td).)*)</td[^>]*>(\s*<td[^>]*>(?<sc
ore>((?!</td).)*)</td[^>]*>)+\s*</tr[^>]*>

it results in a match that is built up as follows:

Match
+- Groups
+- Team.Value "New Orleans"
+- Quarter.Value "0"
+- Score.Captures[0].Value = 10
+- Score.Captures[1].Value = 0
+- Score.Captures[2].Value = 0
+- Score.Captures[3].Value = 10
My question now is... did I interpret it right?

Is the row built up of a team name, quarter number and 4 scores?

To explain the expression:

<tr[^>*> Find a start of a row
\s* allow whitespace between <tr> tag and the first <td>
<td[^>]*> Find the first table cell.
\s* allow whitespace between <td> and <a>
<a[^>]*> Find the opening a href tag
(?team((?!</a).)*) capture everything you find from there to the </a
tag
in a group named team.
</a[^>]*> Find the closing a tag
\s* allow whitespace between </a> and </td>
</td[^>]*> Find the closing td tag
\s* allow whitespace between </td> and <td>
<td[^>]*> Find opening <td>
(?<quarter>((?!</td).)*) capture everything from there to the </ta
tag into
a group named quarter
</td[^>]*> Find the closing td tag
( start repeating group
\s* allow whitespace between </td> and <td>
<td[^>]*> Find opening <td>
(?<score>((?!</td).)*) capture everything from there to the </ta tag
into
a group named score
</td[^>]*> Find the closing td tag
)+ end repeating group. If the named group in this repeating group
captured
multiple times. The value scan be found in the Group.Captures
collection.
\s*
</tr[^>]*> Finally make sure we've captured a complete row by finding
it's
end tag.
Jesse

Hi Jesse,

Thanks again for the assistance.

I have taken what you have posted and put it into a console app but
it doesn't seem to pick up anything using the expression. If I
uncomment the shortpat and the first foreach, I get data. Am I
missing something?

static void Main(string[] args)
{
//string shortpat = @">([\w\s]+)<";
string pat =
@"<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<qu
ar
ter>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[^>
]*
\s*</tr[^>]>";

string html = @"<table border=""0"" width=""100%""><tr><td
width=""40%"">Team</td><td width=""10%""
align=""center"">1</td>
<td
width=""10%"" align=""center"">2</td><td width=""10%""
align=""center"">3</td> <td width=""10%""
align=""center"">4</td><td
width=""20%"" align=""center"">Score</td></tr><tr><td
width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/078.htm"">New
Orleans</A></td><td width=""10%"" align=""center""

0</td><td

width=""10%"" align=""center"" >10</td><td width=""10%""
align=""center"" >0</td><td width=""10%"" align=""center""

0</td><td

width=""20%"" align=""center"" >10</td></tr><tr><td width=""40%""><A
href=""/default.asp?c=sportsnetwork&page=nfl/teams/
071.htm"">Indianapolis</A>-</td><td width=""10%"" align=""center""
7</

td>> <td width=""10%"" align=""center"" >3</td><td width=""10%"" td>>

align=""center"" >14</td><td width=""10%""
align=""center"" >17</
td>> <td width=""20%"" align=""center"" >41</td></tr></table>"; td>>

Regex r = new Regex(pat);

//foreach (Match m in r.Matches(html))
//{
// if (m.Groups[1].Value.Trim() == "")
// // ignore these
// continue;
// else
// Console.WriteLine(m.Groups[1].Value);
// Console.ReadLine();
//}
foreach (Match m in r.Matches(html))
{
string team = m.Groups["team"].Value;
string quarter = m.Groups["quarter"].Value;
string score = m.Groups["score"].Value;
Console.WriteLine(team);
Console.WriteLine(quarter);
Console.WriteLine(score);
}
Console.ReadLine();
}

:

Hello JP,

Jesse,

That's exactly what I am trying to do with the data within the
HTML. The expressions and code you listed don't apply to the HTML
I posted correct?

They do. by just saying <td[^>]+> you say find a <td tag. Then find
everything up to the closing > and then match the closing >. It
ignores all the border, widtch, height and other stuff in there.

For now I ignored the <a href> around the team name, but other than
that, the expression should work.

I am not sure how I could group the teams and quarters if they are
not labeled. I probably don't understand how regex works...

Well if they're always in the xth cell as in you example, you can
use their position (as I've done in the expression). The
(?<name>...) construct in the expression then labels them.

Jesse

:

Hello JP,

I am creating a screen scraping app that will extract data from
a website. The screen scraping is pretty straightforward using
.NET 2.0, but stripping out all extraneous characters is proving
to be more difficult. I am basically trying to extract the
team, quarter, score for the quarter, and score for the entire
game from this html. (This html is a subset of the entire
page)

<table border="0" width="100%"><tr><td width="40%">Team</td><td
width="10%" align="center">1</td> <td width="10%"
align="center">2</td><td width="10%" align="center">3</td> <td
width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td
width="10%" align="center" >10</td><td width="10%"
align="center"

0</td><td width="10%" align="center" >0</td><td width="20%"

align="center"

10</td></tr><tr><td width="40%"><A

href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">India
na po li s</A></td><td width="10%" align="center" >7</td><td
width="10%" align="center" >3</td><td width="10%" align="center"

14</td><td width="10%" align="center" >17</td><td width="20%"

align="center"

41</td></tr></table>

In essance I want to be able to put the names and scores into an
array so I can add to a database. From what I read regular
expressions should be able to do this but I am a complete
beginner using regex. Could someone assist in getting me
started? Many thanks.

I posted a regex a while back that did almost this.

<tr[^>]*><td[^>]*>(?<team>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<q
ua rt
er>((?!</td).)*)</td[^>]*>\s*<td[^>]*>(?<score>((?!</td).)*)</td[
^> ]*

\s*</tr[^>]>
 
A

alex_f_il

I am creating a screenscrapingapp that will extract data from a website.
The screenscrapingis pretty straightforward using .NET 2.0, but stripping
out all extraneous characters is proving to be more difficult. I am
basically trying to extract the team, quarter, score for the quarter, and
score for the entire game from thishtml. (Thishtmlis a subset of the
entire page)

<table border="0" width="100%"><tr><td width="40%">Team</td><td width="10%"
align="center">1</td> <td width="10%" align="center">2</td><td width="10%"
align="center">3</td> <td width="10%" align="center">4</td><td width="20%"
align="center">Score</td></tr><tr><td width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/078.htm">New
Orleans</A></td><td width="10%" align="center" >0</td><td width="10%"
align="center" >10</td><td width="10%" align="center" >0</td><td width="10%"
align="center" >0</td><td width="20%" align="center" >10</td></tr><tr><td
width="40%"><A
href="/default.asp?c=sportsnetwork&page=nfl/teams/071.htm">Indianapolis</A>­</td><td
width="10%" align="center" >7</td><td width="10%" align="center" >3</td><td
width="10%" align="center" >14</td><td width="10%" align="center">17</td><td
width="20%" align="center" >41</td></tr></table>

In essance I want to be able to put the names and scores into an array soI
can add to a database. From what I read regular expressions should be able
to do this but I am a complete beginner using regex. Could someone assist
in getting me started? Many thanks.

You can also try SWExplorerAutomation (SWEA) http://webius.net. SWEA
Visual Data Extractors (XPathDataExtractor and TableDataExtractor )
save time on development Web Scraping solutions.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top