How to WebBrowser.DocumentText with right encoding

MaartyMan · Jul 17, 2009

Hi, I am new to C#. Maybe someone can help me with this:
I am writing a web crawler that puts one page at a time in WebBrowser and I
want to get the DocumentText and work with it. Since I don't know the
encoding of the page beforehand, I have to get the encoding and then set so I
get the correct html text (without any "funny" characters). Any suggestions
which way is the best way of doing this? Thanks in advance.

MaartyMan · Jul 17, 2009

I read the encoding with string searching in the "META" HtmlElement (String
charsetEncoding = "iso-8859-1") from the stream from WebBrowser. Now I try
to get the stream again by setting the encoding to this encoding, but still I
get wrong characters in the extracted html (string htmlText). Any ideas why
this is not working below:?

HttpWebRequest request2 = (HttpWebRequest)HttpWebRequest.Create(url);
request2.UserAgent = "A1 .NET Web Crawler";

WebResponse response2 = request2.GetResponse();

Stream stream2 = response2.GetResponseStream();

Encoding charsetEncoding = Encoding.GetEncoding(charSetStr);
StreamReader reader = new StreamReader(stream2, charsetEncoding);

//StreamReader reader = new StreamReader(stream2);
string htmlText = reader.ReadToEnd();

Mihai N. · Jul 18, 2009

Any ideas why this is not working below:?

Looks ok (without trying it).
Can you make sure the page is indeed 8859-1?
Some pages are not tagged correctly.
Or maye you can post here what you see and what you expect
(even better, describe it (e with accent grave) and post it,
to make sure nothing got damaged on the way)
Some hex values might also help.

MaartyMan · Jul 20, 2009

Thanks for the reply. I'm not sure if it is really "8859-1", although I've
checked it is specified that way in the meta tag. It seems it replaces a
single apostrophe (') with 2 hex characters \0xC2 \0x91 in the extracted html
string. I don't know how to check whether it really is encoded in 8859-1?
(don't know much about code pages). Any suggestions? Thanks in advance.

Mihai N. · Jul 20, 2009

It seems it replaces a

single apostrophe (') with 2 hex characters \0xC2 \0x91 in the extracted
html string.

That "smells" like utf-8.

MaartyMan · Jul 31, 2009

It seems I was looking at entries in the database I saved before I was
extracting with the encoding, which is why I was getting incorrect characters
in the data extracted for earlier items in the database. When I looked at
latest entries in the database saved it seems it was saving correct with the
encoding now. Thanks for all the help.

Setting WebBrowser.DocumentText - interprets as plaintext?	2	Aug 23, 2007
encoding	1	Feb 11, 2012
WebBrowser.DocumentText problem	2	Jan 24, 2006
WebBrowser.DocumentText being set stays on about:blank sometimes	2	Mar 23, 2007
Removing modeless dialog from WebBrowser control.	0	Jan 15, 2007
How to create a .txt file with unicode encoding	1	Mar 27, 2007
Webbrowser.DocumentText	1	May 11, 2006
Using a System.Windows.Forms.WebBrowser inside ASP web form	1	Nov 5, 2008

How to WebBrowser.DocumentText with right encoding

MaartyMan

MaartyMan

Mihai N.

MaartyMan

Mihai N.

MaartyMan

Ask a Question

Similar Threads