How to WebBrowser.DocumentText with right encoding

M

MaartyMan

Hi, I am new to C#. Maybe someone can help me with this:
I am writing a web crawler that puts one page at a time in WebBrowser and I
want to get the DocumentText and work with it. Since I don't know the
encoding of the page beforehand, I have to get the encoding and then set so I
get the correct html text (without any "funny" characters). Any suggestions
which way is the best way of doing this? Thanks in advance.
 
M

MaartyMan

I read the encoding with string searching in the "META" HtmlElement (String
charsetEncoding = "iso-8859-1") from the stream from WebBrowser. Now I try
to get the stream again by setting the encoding to this encoding, but still I
get wrong characters in the extracted html (string htmlText). Any ideas why
this is not working below:?

HttpWebRequest request2 = (HttpWebRequest)HttpWebRequest.Create(url);
request2.UserAgent = "A1 .NET Web Crawler";

WebResponse response2 = request2.GetResponse();

Stream stream2 = response2.GetResponseStream();

Encoding charsetEncoding = Encoding.GetEncoding(charSetStr);
StreamReader reader = new StreamReader(stream2, charsetEncoding);

//StreamReader reader = new StreamReader(stream2);
string htmlText = reader.ReadToEnd();
 
M

Mihai N.

Any ideas why this is not working below:?

Looks ok (without trying it).
Can you make sure the page is indeed 8859-1?
Some pages are not tagged correctly.
Or maye you can post here what you see and what you expect
(even better, describe it (e with accent grave) and post it,
to make sure nothing got damaged on the way)
Some hex values might also help.
 
M

MaartyMan

Thanks for the reply. I'm not sure if it is really "8859-1", although I've
checked it is specified that way in the meta tag. It seems it replaces a
single apostrophe (') with 2 hex characters \0xC2 \0x91 in the extracted html
string. I don't know how to check whether it really is encoded in 8859-1?
(don't know much about code pages). Any suggestions? Thanks in advance.
 
M

MaartyMan

It seems I was looking at entries in the database I saved before I was
extracting with the encoding, which is why I was getting incorrect characters
in the data extracted for earlier items in the database. When I looked at
latest entries in the database saved it seems it was saving correct with the
encoding now. Thanks for all the help.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top