Need help in Regex.

J

jmchadha

I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."> click for <b>info</b> on city1 </a>
... some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a> </a> tag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></a> tag. Please note that there can be other tags between <a></a>
tags also like <b></b> tag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.

Thanks
JM
 
N

Nicholas Paldino [.NET/C# MVP]

JM,

Why use Regex? Why not use MSHTML through COM interop and just parse
the HTML? Then, you can access the object model and find the item that way.

Hope this helps.
 
J

jmchadha

I am using this for a small .NET program and don't want to use
Unmanaged code/COM. Thats why looking for solution based on Regex.

Thanks
JM
JM,

Why use Regex? Why not use MSHTML through COM interop and just parse
the HTML? Then, you can access the object model and find the item that way.

Hope this helps.


--
- Nicholas Paldino [.NET/C# MVP]
- (e-mail address removed)

I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."> click for <b>info</b> on city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a> </a> tag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></a> tag. Please note that there can be other tags between <a></a>
tags also like <b></b> tag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.

Thanks
JM
 
J

Jesse Houwing

I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."> click for <b>info</b> on city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a> </a> tag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></a> tag. Please note that there can be other tags between <a></a>
tags also like <b></b> tag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.

Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"

It starts searching when it encounters an <a tag, then continues on
looking for anything that's not > and picks our the href attribute. It
captures the href value in a group and searched on for the end of the
opening tag. Once there it searches for city1, failing if it finds </a
before encountering that specific text.

And the code to extract the value would then be something like (haven't
compiled, so might contain a few small errors):

Regex rx = new
Regex(@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a",
RegexOptions.None);
Match m = rx.Match(input);
if (m.Success)
{
string href = m.Groups[1].Value;
}


Jesse
 
J

Jesse Houwing

Jesse said:
I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."> click for <b>info</b> on city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a> </a> tag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></a> tag. Please note that there can be other tags between <a></a>
tags also like <b></b> tag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.

Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"

@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"
 
J

jmchadha

Hi Jesse,

Thanks a lot for your help. It worked like a charm. I am searching this
for a particular page only, so I think it should be fine.

The only thing here is that if value of "href" lies between double
quotes, it works. But I need that "href" can have:
(a) value within single quotes also like href='city1.html'
(b) there shouldn't be any single or double quote between. i.e
href=city1.html. I think here the criteria to determine is that after
"city.html", there is atleast single space.

Thanks & Regards
JM

Jesse said:
Jesse said:
I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."> click for <b>info</b> on city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a> </a> tag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></a> tag. Please note that there can be other tags between <a></a>
tags also like <b></b> tag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.

Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"

@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

might be even better as it will also find: said:
 
J

Jesse Houwing

Hi Jesse,

Thanks a lot for your help. It worked like a charm. I am searching this
for a particular page only, so I think it should be fine.

The only thing here is that if value of "href" lies between double
quotes, it works. But I need that "href" can have:
(a) value within single quotes also like href='city1.html'
(b) there shouldn't be any single or double quote between. i.e
href=city1.html. I think here the criteria to determine is that after
"city.html", there is atleast single space.

Ok, can do that as well:

@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

To make extracting the value easier I've named the results url (watch
the wrapping).

Regex rx = new
Regex(@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a",
RegexOptions.None);
Match m = rx.Match(input);
if (m.Success)
{
string href = m.Groups["url"].Value;
}
Thanks & Regards
JM

Jesse said:
Jesse said:
(e-mail address removed) wrote:
I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."> click for <b>info</b> on city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a> </a> tag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></a> tag. Please note that there can be other tags between <a></a>
tags also like <b></b> tag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.
Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"
@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

might be even better as it will also find: said:
 
J

jmchadha

Hi Jesse,

It works great. Thanks a lot for all your time. I really appreciate it.

btw, what are the good resources on net from where I can start learning
about Regex ?

And as you were saying that for arbitrary html Regex is not the best
option, then what is the best option for the same. Just curious to
know.

Thanks again for all your help
JM

Jesse said:
Hi Jesse,

Thanks a lot for your help. It worked like a charm. I am searching this
for a particular page only, so I think it should be fine.

The only thing here is that if value of "href" lies between double
quotes, it works. But I need that "href" can have:
(a) value within single quotes also like href='city1.html'
(b) there shouldn't be any single or double quote between. i.e
href=city1.html. I think here the criteria to determine is that after
"city.html", there is atleast single space.

Ok, can do that as well:

@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

To make extracting the value easier I've named the results url (watch
the wrapping).

Regex rx = new
Regex(@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a",
RegexOptions.None);
Match m = rx.Match(input);
if (m.Success)
{
string href = m.Groups["url"].Value;
}
Thanks & Regards
JM

Jesse said:
Jesse Houwing wrote:
(e-mail address removed) wrote:
I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."> click for <b>info</b> on city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a> </a> tag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></a> tag. Please note that there can be other tags between <a></a>
tags also like <b></b> tag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.
Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"
@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

might be even better as it will also find: <a href="...">city1</a>

Jesse
 
J

Jesse Houwing

Hi Jesse,

It works great. Thanks a lot for all your time. I really appreciate it.

You're welcome
btw, what are the good resources on net from where I can start learning
about Regex ?

Download The Regulator and try experimenting (just disable intellisense
from the options window, it doesn't work).
http://regex.osherove.com/

Try some of the exercises here:
http://www.cs.princeton.edu/introcs/72regular/

There are a number of articles on MSDN that might also be of help:
http://msdn.microsoft.com/library/d...html/2380d458-3366-402b-996c-9363906a7353.asp
http://msdn2.microsoft.com/en-us/library/az24scfc.aspx

And a general explanation on regular expressions (not specific to .Net):
http://www.regularexpressions.info/

Especially for .Net buy the ebook from Dan Applemen
http://www.amazon.com/gp/product/B0000632ZU/103-6339483-1670225?v=glance&n=551440

And for general insight on the workings of Regular Expressions the
following book is one of the best resources available:
http://www.amazon.com/gp/product/05...103-6339483-1670225?s=books&v=glance&n=283155

And finally:
Keep trying! Keep Practising and don't be afraid to ask questions.
And as you were saying that for arbitrary html Regex is not the best
option, then what is the best option for the same. Just curious to
know.

HTML isn't that strict and people write some funny stuff in their pages
once in a while. You can't predict these things, and you're better
equipped with a parser that knows most of these exceptions.

There the .NET Html Agility Pack that parses HTML and makes it search
able. The MSHTML object can also be a great help.
http://smourier.blogspot.com/
Thanks again for all your help

Again, welcome!

Jesse
Jesse said:
Hi Jesse,

Thanks a lot for your help. It worked like a charm. I am searching this
for a particular page only, so I think it should be fine.

The only thing here is that if value of "href" lies between double
quotes, it works. But I need that "href" can have:
(a) value within single quotes also like href='city1.html'
(b) there shouldn't be any single or double quote between. i.e
href=city1.html. I think here the criteria to determine is that after
"city.html", there is atleast single space.
Ok, can do that as well:

@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

To make extracting the value easier I've named the results url (watch
the wrapping).

Regex rx = new
Regex(@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a",
RegexOptions.None);
Match m = rx.Match(input);
if (m.Success)
{
string href = m.Groups["url"].Value;
}
Thanks & Regards
JM

Jesse Houwing wrote:
Jesse Houwing wrote:
(e-mail address removed) wrote:
I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."> click for <b>info</b> on city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a> </a> tag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></a> tag. Please note that there can be other tags between <a></a>
tags also like <b></b> tag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.
Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"
@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

might be even better as it will also find: <a href="...">city1</a>

Jesse
 
J

jmchadha

I will definitely go thru the resources you have mentioned here. You
have solved my problem which I was trying for last one week. Thanks for
your time and professional help.

Regards
JM

Jesse said:
Hi Jesse,

It works great. Thanks a lot for all your time. I really appreciate it.

You're welcome
btw, what are the good resources on net from where I can start learning
about Regex ?

Download The Regulator and try experimenting (just disable intellisense
from the options window, it doesn't work).
http://regex.osherove.com/

Try some of the exercises here:
http://www.cs.princeton.edu/introcs/72regular/

There are a number of articles on MSDN that might also be of help:
http://msdn.microsoft.com/library/d...html/2380d458-3366-402b-996c-9363906a7353.asp
http://msdn2.microsoft.com/en-us/library/az24scfc.aspx

And a general explanation on regular expressions (not specific to .Net):
http://www.regularexpressions.info/

Especially for .Net buy the ebook from Dan Applemen
http://www.amazon.com/gp/product/B0000632ZU/103-6339483-1670225?v=glance&n=551440

And for general insight on the workings of Regular Expressions the
following book is one of the best resources available:
http://www.amazon.com/gp/product/05...103-6339483-1670225?s=books&v=glance&n=283155

And finally:
Keep trying! Keep Practising and don't be afraid to ask questions.
And as you were saying that for arbitrary html Regex is not the best
option, then what is the best option for the same. Just curious to
know.

HTML isn't that strict and people write some funny stuff in their pages
once in a while. You can't predict these things, and you're better
equipped with a parser that knows most of these exceptions.

There the .NET Html Agility Pack that parses HTML and makes it search
able. The MSHTML object can also be a great help.
http://smourier.blogspot.com/
Thanks again for all your help

Again, welcome!

Jesse
Jesse said:
(e-mail address removed) wrote:
Hi Jesse,

Thanks a lot for your help. It worked like a charm. I am searching this
for a particular page only, so I think it should be fine.

The only thing here is that if value of "href" lies between double
quotes, it works. But I need that "href" can have:
(a) value within single quotes also like href='city1.html'
(b) there shouldn't be any single or double quote between. i.e
href=city1.html. I think here the criteria to determine is that after
"city.html", there is atleast single space.
Ok, can do that as well:

@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

To make extracting the value easier I've named the results url (watch
the wrapping).

Regex rx = new
Regex(@"<a[^>]+href=(?:""(?<url>[^""]+)""|'(?<url>[^']+)'|(?<url>[^\s'">]+))[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a",
RegexOptions.None);
Match m = rx.Match(input);
if (m.Success)
{
string href = m.Groups["url"].Value;
}

Thanks & Regards
JM

Jesse Houwing wrote:
Jesse Houwing wrote:
(e-mail address removed) wrote:
I have got the following html:

"something in html ... etc.. city1... etc... <a class="font1"
href="city1.html" onclick="etc."> click for <b>info</b> on city1 </a>
.. some html. city1.. can repeat lot of times here....

Requirement:
-------------------
I want to get the value of "href" i.e "city1.html" by searching "city1"
between the <a> </a> tag. Please note that "city1" can repeat lot of
times in the html, but I have to search for "city1" which lies between
<a></a> tag. Please note that there can be other tags between <a></a>
tags also like <b></b> tag in above html.

I want to do this in C# and using Regex. I am new to Regex and I would
really appreciate any help on this.
Regex might not be the best solution for this problem, but it is
possible. Note that you should probably not use this if the HTML you're
parsing is arbitrary, but for an adhoc search though a couple of files,
it should do:

@"<a[^>]+href=""([^""]+)""[^>]+>(?:(?!</a).)+city1(?:(?!</a).)+</a"
@"<a[^>]+href=""([^""]+)""[^>]*>(?:(?!</a).)*city1(?:(?!</a).)*</a"

might be even better as it will also find: <a href="...">city1</a>

Jesse
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top