I need to find all instences of "http://" copy that and everycharactor to ".com" into an array.

T

trint

I need to find all instences of "http://" copy that and every
charactor to ".com" into an array.
example:

mshtml.HTMLDocumentClass doc = (mshtml.HTMLDocumentClass)
this.axWebBrowser1.Document;
StringBuilder sb = new StringBuilder();
string str1 = "";
sb.Append(doc.documentElement.innerHTML);
str1 = sb.ToString();

contents of str1, at this point is:

<TR>
<TD VALIGN=center ALIGN=center><A HREF="http://www.apl.com/"
target="new">American President Line</A></TD>
<TD VALIGN=center ALIGN=center><A HREF="http://www.maerskline.com/"
target="new">Maersk-SeaLand</A></TD>
<TD VALIGN=center ALIGN=center><A HREF="http://www.chevron.com/
about/our_businesses/shipping.asp" target="new">Chevron Shipping</A></
TD>
</TR>

I want each element of an array to contain only the entire "http://
www.address.com".

Anything that you guru's can point me to or tell me is much
appreciated. Searcharoo isn't what I really need to do.
Thanks,
Trint
 
P

Pavel Minaev

I need to find all instences of "http://" copy that and every
charactor to ".com" into an array.
example:

           mshtml.HTMLDocumentClass doc = (mshtml.HTMLDocumentClass)
this.axWebBrowser1.Document;
           StringBuilder sb = new StringBuilder();
            string str1 = "";
            sb.Append(doc.documentElement.innerHTML);
            str1 = sb.ToString();

contents of str1, at this point is:

<TR>
   <TD VALIGN=center ALIGN=center><A HREF="http://www.apl.com/"
target="new">American President Line</A></TD>
   <TD VALIGN=center ALIGN=center><A HREF="http://www.maerskline.com/"
target="new">Maersk-SeaLand</A></TD>
   <TD VALIGN=center ALIGN=center><A HREF="http://www.chevron.com/
about/our_businesses/shipping.asp" target="new">Chevron Shipping</A></
TD>
</TR>

You really need to be more specific. How about stuff such as:

<link rel="stylesheet" href="http://example.com/style.css" />

Somehow I don't think you'd want the above URL in your list. Or, say,
how about an URL that's just plain text in the middle of the
paragraph:

<p>The following is not a link: http://example.com - just plain
text</p>

Also, what about "https://"?..

It's also not clear what kind of input you have there. Is it
guaranteed to be valid XHTML (i.e. parsable as XML)? Or is it HTML tag
soup? Or is it arbitrary text which may happen to be HTML sometimes
but not always?
 
J

Jesse Houwing

Hello trint,
I need to find all instences of "http://" copy that and every
charactor to ".com" into an array.
example:
mshtml.HTMLDocumentClass doc = (mshtml.HTMLDocumentClass)
this.axWebBrowser1.Document;
StringBuilder sb = new StringBuilder();
string str1 = "";
sb.Append(doc.documentElement.innerHTML);
str1 = sb.ToString();
contents of str1, at this point is:

<TR>
<TD VALIGN=center ALIGN=center><A HREF="http://www.apl.com/"
target="new">American President Line</A></TD>
<TD VALIGN=center ALIGN=center><A HREF="http://www.maerskline.com/"
target="new">Maersk-SeaLand</A></TD>
<TD VALIGN=center ALIGN=center><A HREF="http://www.chevron.com/
about/our_businesses/shipping.asp" target="new">Chevron Shipping</A></
</TR>

I want each element of an array to contain only the entire "http://
www.address.com".

Anything that you guru's can point me to or tell me is much
appreciated. Searcharoo isn't what I really need to do.

You can easily fetch the href attributes using the HTMLDocumentClass (you
might also want to look at the HTML Agility pack on Codeplex).

Then use the System.IO.Uri class to get the part that you need.

Or you could use regex if this isn't going to be a production app, but a
one time tool... (something like http://[^/]+ ).
 
I

Ignacio Machin ( .NET/ C# MVP )

I need to find all instences of "http://" copy that and every
charactor to ".com" into an array.
example:

           mshtml.HTMLDocumentClass doc = (mshtml.HTMLDocumentClass)
this.axWebBrowser1.Document;
           StringBuilder sb = new StringBuilder();
            string str1 = "";
            sb.Append(doc.documentElement.innerHTML);
            str1 = sb.ToString();

contents of str1, at this point is:

<TR>
   <TD VALIGN=center ALIGN=center><A HREF="http://www.apl.com/"
target="new">American President Line</A></TD>
   <TD VALIGN=center ALIGN=center><A HREF="http://www.maerskline.com/"
target="new">Maersk-SeaLand</A></TD>
   <TD VALIGN=center ALIGN=center><A HREF="http://www.chevron.com/
about/our_businesses/shipping.asp" target="new">Chevron Shipping</A></
TD>
</TR>

I want each element of an array to contain only the entire "http://www.address.com".

Anything that you guru's can point me to or tell me is much
appreciated.  Searcharoo isn't what I really need to do.
Thanks,
Trint

Hi,

There are different ways of doing it, the easiest safest way is to use
document.getElementById and then get the value of href property.
 
T

Tim Williams

Here's how to extract all of the links from a webbrowser control hosted on a
form.

private void button1_Click(object sender, EventArgs e)
{
HtmlDocument d = this.wbTest.Document;
HtmlElementCollection links = d.GetElementsByTagName("A");
foreach(HtmlElement l in links)
{
this.listBox1.Items.Add(l.GetAttribute("href"));
}
}


Tim




message
I need to find all instences of "http://" copy that and every
charactor to ".com" into an array.
example:

mshtml.HTMLDocumentClass doc = (mshtml.HTMLDocumentClass)
this.axWebBrowser1.Document;
StringBuilder sb = new StringBuilder();
string str1 = "";
sb.Append(doc.documentElement.innerHTML);
str1 = sb.ToString();

contents of str1, at this point is:

<TR>
<TD VALIGN=center ALIGN=center><A HREF="http://www.apl.com/"
target="new">American President Line</A></TD>
<TD VALIGN=center ALIGN=center><A HREF="http://www.maerskline.com/"
target="new">Maersk-SeaLand</A></TD>
<TD VALIGN=center ALIGN=center><A HREF="http://www.chevron.com/
about/our_businesses/shipping.asp" target="new">Chevron Shipping</A></
TD>
</TR>

I want each element of an array to contain only the entire
"http://www.address.com".

Anything that you guru's can point me to or tell me is much
appreciated. Searcharoo isn't what I really need to do.
Thanks,
Trint

Hi,

There are different ways of doing it, the easiest safest way is to use
document.getElementById and then get the value of href property.
 
T

trint

Here's how to extract all of the links from a webbrowser control hosted on a
form.

private void button1_Click(object sender, EventArgs e)
{
   HtmlDocument d = this.wbTest.Document;
   HtmlElementCollection links = d.GetElementsByTagName("A");
   foreach(HtmlElement l in links)
   {
       this.listBox1.Items.Add(l.GetAttribute("href"));
    }

}

Tim

message






Hi,

There are different ways of doing it, the easiest safest way is to use
document.getElementById and then get the value of href property.- Hide quoted text -

- Show quoted text -

Here is what I'm trying to do with my program:

1. Start Google.com:

object loc = "http://
www.google.com/";
object null_obj_str = "";
System.Object null_obj = 0;
this.axWebBrowser1.Navigate2(ref loc , ref null_obj, ref null_obj,
ref null_obj_str, ref null_obj_str);

2. Then start a search for all cargo ship companies:
2a. This works great because it starts a sight that has all the cargo
ship companies that I need on the page it finds and then starts:


HTMLInputElement otxtSearchBox = (HTMLInputElement) myDoc.all.item
("q", 0);

otxtSearchBox.value = "cargo ship companies";
// google html source for the I'm Feeling Lucky Button:
// <INPUT type=submit value="I'm Feeling Lucky" name=btnI>
//
HTMLInputElement btnSearch = (HTMLInputElement) myDoc.all.item
("btnI", 0);
btnSearch.click();

3. I want to take this page that is now in my browser started by my
program and put the source of the page in a string:
3a. This also works great and the string contains all of the "http://
addresses.com" that I want to grab out of the string:

mshtml.HTMLDocumentClass doc = (mshtml.HTMLDocumentClass)
this.axWebBrowser1.Document;
StringBuilder sb = new StringBuilder();
string str1 = "";

sb.Append(doc.documentElement.innerHTML);

str1 = sb.ToString();

4. Now, all I want are the .com addresses that are in this string
(str1) to be parsed out into a string array to look like this:
4a. This is where I am stuck now and don't have:

http://www.address1.com
http://www.address2.com
http://www.address3.com
http://www.address4.com
....etc.

any help is appreciated,
Thanks,
Trint
 
P

Pavel Minaev

Here is what I'm trying to do with my program:

1. Start Google.com:

                                                object loc = "http://www.google.com/";
                        object null_obj_str = "";
                        System.Object null_obj = 0;
                        this.axWebBrowser1.Navigate2(ref loc , ref null_obj, ref null_obj,
ref null_obj_str, ref null_obj_str);

2. Then start a search for all cargo ship companies:
2a. This works great because it starts a sight that has all the cargo
ship companies that I need on the page it finds and then starts:

HTMLInputElement otxtSearchBox = (HTMLInputElement) myDoc.all.item
("q", 0);

otxtSearchBox.value = "cargo ship companies";
                                        // google html source for the I'm Feeling Lucky Button:
                                        // <INPUT type=submit value="I'm Feeling Lucky" name=btnI>
                                        //
                                        HTMLInputElement btnSearch = (HTMLInputElement) myDoc.all.item
("btnI", 0);
                                        btnSearch.click();

3. I want to take this page that is now in my browser started by my
program and put the source of the page in a string:
3a. This also works great and the string contains all of the "http://
addresses.com" that I want to grab out of the string:

            mshtml.HTMLDocumentClass doc = (mshtml.HTMLDocumentClass)
this.axWebBrowser1.Document;
            StringBuilder sb = new StringBuilder();
            string str1 = "";

            sb.Append(doc.documentElement.innerHTML);

            str1 = sb.ToString();

This is where you go wrong. It's much more convenient to extract links
from the HTML DOM document than it is from string. Tim has posted how
to do that above (GetElementsByTagName / GetAttribute).
 
T

trint

This is where you go wrong. It's much more convenient to extract links
from the HTML DOM document than it is from string. Tim has posted how
to do that above (GetElementsByTagName / GetAttribute).- Hide quoted text-

- Show quoted text -

Thanks, I just now saw that!
I will reply after I do this.
Trint
 
T

trint

This is where you go wrong. It's much more convenient to extract links
from the HTML DOM document than it is from string. Tim has posted how
to do that above (GetElementsByTagName / GetAttribute).- Hide quoted text-

- Show quoted text -

I have a question...where is wbTest defined?
 
T

trint

I have a question...where is wbTest defined?- Hide quoted text -

- Show quoted text -

This doesn't seem to work with:

mshtml.HTMLDocumentClass doc = (mshtml.HTMLDocumentClass)
this.axWebBrowser1.Document;

Thanks,
Trint
 
T

Tim Williams

What does it do instead of working ?

Tim


I have a question...where is wbTest defined?- Hide quoted text -

- Show quoted text -

This doesn't seem to work with:

mshtml.HTMLDocumentClass doc = (mshtml.HTMLDocumentClass)
this.axWebBrowser1.Document;

Thanks,
Trint
 
R

Ratnesh Maurya

Hi Trint,

I will suggest to go with Regex. and you can do all operation in
string,
I am not sure how HTMLDocumentClass is a better solution than Regex in
this case. As Regex is meant for searching and replacing string
patterns, that is what we need here.

Cheers,
-Ratnesh
 
R

Ratnesh Maurya

Regex exp = new Regex(@"http://(www\.)?([^\.]+)\.com",
RegexOptions.IgnoreCase);
MatchCollection collection = exp.Matches("your string goes here");

iterate through collection of Match objects.
match.Value is what u r looking for.


I will not suggest you to go into XML parsing. advantage with this
approach is you can run the same code for different sources xml, html,
plain text, all it needs is just sting input to match the regex. I
love regex :)

Cheers,
-Ratnesh
S7Software
 
P

Pavel Minaev

I will suggest to go with Regex. and you can do all operation in
string,
I am not sure how HTMLDocumentClass is a better solution than Regex in
this case. As Regex is meant for searching and replacing string
patterns, that is what we need here.

That's assuming that he needs to retrieve all URLs in the document no
matter where they are. If you read the description of his problem,
this isn't the case - he only needs URLs in HREF attributes of
particular A elements. You cannot do that without parsing HTML.
 
P

Pavel Minaev

Actually, I think that one could probably come up with a regex that  
targets only those specific matches, leaving other URLs untouched.  It's  
just a matter of being context-sensitive, rather than blindly matching  
arbitrary URL patterns.

How do you correctly handle things such as character entities, CDATA,
and other HTML/XML quirks, in a regex? I'm not aware of any fool-proof
solution. Regex, after all, is not a universal parsing tool - there
are many kinds of inputs which it is not sufficiently expressive to
define.
 
T

trint

Regex exp = new Regex(@"http://(www\.)?([^\.]+)\.com",
RegexOptions.IgnoreCase);
MatchCollection collection = exp.Matches("your string goes here");

iterate through collection of Match objects.
match.Value is what u r looking for.

I will not suggest you to go into XML parsing. advantage with this
approach is you can run the same code for different sources xml, html,
plain text, all it needs is just sting input to match the regex. I
love regex :)

Cheers,
-Ratnesh
S7Software

Ratnesh,
This worked and is exactly what I needed to do!
Thanks very much
 
P

Pavel Minaev

I'm not really sure what you mean.  AFAIK, you can't use character  
entities or CDATA as part of the syntax for declaring an element start  
tag, which is where one would find an "href" attribute for an element.  A  
regex can safely ignore anything outside of the particular element start  
tag.

But how do you determine if something is actually a start tag? I mean,
distinguishing stuff like this:

The following is a link: <a href="http://...">real link</a>

and

<![CDATA[ The following is not a link: <a href="http://...">not a
link</a> ]]>

or even

<!-- The following is not a link: <a href="http://...">not a link</a>
-->

And then there are more interesting cases. Another example:

<!-- Yes, unescaped '>' in attribute values is legal SGML/HTML! -->
<a id="foo>bar" href="http://example.com"></a>
<a id="foo>bar"> href="http://example.com"</a>

So you have to understand quoted/non-quoted attribute values and treat
'>' accordingly to be able to find if you're looking at an attribute-
value pair in a tag, or just something that looks like it.
Again, to be clear: I agree that handling the document as a proper HTML  
document is a much better approach.  But there are some people out there  
who are extremely fluent using regex and for whom it would not be  
impossible to approach the problem that way.

The usual problem one runs into with regex is correctly parsing nested
constructs. .NET regex extensions have something for this, but I don't
think it covers all possible cases.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top