Is this HttpWebRequest correct?

Nightcrawler · Oct 3, 2008

I am currently using the HttpWebRequest and HttpWebResponse to pull
webpages down from a few urls.

string url = "some url";
HttpWebRequest httpWebRequest =
(HttpWebRequest)WebRequest.Create(url);

using (HttpWebResponse httpWebResponse =
(HttpWebResponse)httpWebRequest.GetResponse())
{
string html = string.Empty;

StreamReader responseReader = new
StreamReader(httpWebResponse.GetResponseStream(), Encoding.UTF7);
html = responseReader.ReadToEnd();
}

My code works but my question is, am I doing it the right way
(especially the encoding part)? Some of the websites I pull content
from have charachters in them that do not exist in the english
alphabet and currently the only way for these to be read correctly by
my streamreader is if I am using UTF7 encoding. Is this really the
only way?

Before I move forward in the project I would like to understand if
this indeed is the way to do this or if I am missing anything?

Any help is appreciated.

Thanks

Martin Honnen · Oct 3, 2008

Nightcrawler said:
I am currently using the HttpWebRequest and HttpWebResponse to pull
webpages down from a few urls.

string url = "some url";
HttpWebRequest httpWebRequest =
(HttpWebRequest)WebRequest.Create(url);

using (HttpWebResponse httpWebResponse =
(HttpWebResponse)httpWebRequest.GetResponse())
{
string html = string.Empty;

StreamReader responseReader = new
StreamReader(httpWebResponse.GetResponseStream(), Encoding.UTF7);
html = responseReader.ReadToEnd();
}

My code works but my question is, am I doing it the right way
(especially the encoding part)? Some of the websites I pull content
from have charachters in them that do not exist in the english
alphabet and currently the only way for these to be read correctly by
my streamreader is if I am using UTF7 encoding. Is this really the
only way?

You should check the HTTP response header Content-Type for a charset
parameter and use that to create the stream reader. So for instance if
the server sends a header
Content-Type: text/html; charset=Windows-1252
then you would use
new StreamReader(httpWebResponse.GetResponseStream(),
Encoding.GetEncoding("Windows-1252"))

On the other hand on the wild wild web the server often does not send a
charset parameter and the author of the HTML document only includes the
charset in a meta element e.g.
<meta http-equiv="Content-Type" content="text/html;
charset=Windows-1252">
Therefore user agents like browsers put in a lot of effort to try to
read enough of the document to find and parse that meta element to then
be able to decode the rest of the document.

Nightcrawler · Oct 3, 2008

So what you basically are saying is that my best bet is to look for
the meta tags in the page to determine the encoding to use and don't
rely on the HTTP response header.

Most of the sites I read using the streamreader say: <meta http-
equiv="Content-type" content="text/html; charset=UTF-8" /> but there
are a few that do not have that meta tag included in their code. How
should I approach those? Is there a way for the streamreader to detect
what encoding the page is using?

Thanks for you help!

Nightcrawler · Oct 3, 2008

What is even more annoying is that one of the websites I read is
stating it's using UTF-8 and my streamreader still does not translate
the charachters correctly. I get little square boxes instead of the
charachters.

Nightcrawler · Oct 3, 2008

If I view the very same page in my browser it shows up correctly.

The meta tag states it's using UTF-8 but when I use:

StreamReader responseReader = new
StreamReader(httpWebResponse.GetResponseStream(), Encoding.UTF8);

The charachters are still unreadable. However, if I use UTF7 instead
the charachters show up correctly BUT, when I try to convert the page
to XML I get an error saying "hexadecimal value 0xD85E, is an invalid
character". I am very confused with all this. Seems a little like the
wild wild west.

Any further help is highly appreciated.

Thanks

Nightcrawler · Oct 3, 2008

I guess another interesting point is that when I change the code to
use: "ISO-8859-1" instead of UTF-8 like the website claims it uses, it
seems that it actuallly is reading the charachters correctly AND the
string translates into XML without any issues. Why? I have no idea and
I wish I understood it better. Again, any insight to this problem is
appreciated.

Thanks

Nightcrawler · Oct 3, 2008

Pete,

You can see the page if you go to the link below. It's iTunes
linkmaker page:

http://ax.phobos.apple.com.edgesuit...1=&country=US&term=love happiness&media=music

As you can see they claim they use utf-8 but when you read it using a
streamreader with that encoding, it does not read "foreign"
charachters correctly. However, when I tried the ISO-8859-1 encoding
it seemed to work.

Thanks

Arne Vajhøj · Oct 5, 2008

Nightcrawler said:
I am currently using the HttpWebRequest and HttpWebResponse to pull
webpages down from a few urls.

string url = "some url";
HttpWebRequest httpWebRequest =
(HttpWebRequest)WebRequest.Create(url);

using (HttpWebResponse httpWebResponse =
(HttpWebResponse)httpWebRequest.GetResponse())
{
string html = string.Empty;

StreamReader responseReader = new
StreamReader(httpWebResponse.GetResponseStream(), Encoding.UTF7);
html = responseReader.ReadToEnd();
}

My code works but my question is, am I doing it the right way
(especially the encoding part)? Some of the websites I pull content
from have charachters in them that do not exist in the english
alphabet and currently the only way for these to be read correctly by
my streamreader is if I am using UTF7 encoding. Is this really the
only way?

I am a bit surprised by the UTF-7, that is a rare encoding - at least
where I surf.

But else Martin Honnen is correct - you need to look at HTTP header
and HTML META tag.

See the code attached below for a starting point.

Arne

=========================================================

public class HttpDownloadCharset
{
private static Regex encpat = new
Regex("charset=([A-Za-z0-9-]+)", RegexOptions.IgnoreCase |
RegexOptions.Compiled);
private static string ParseContentType(string contenttype)
{
Match m = encpat.Match(contenttype);
if(m.Success)
{
return m.Groups[1].Value;
}
else
{
return "ISO-8859-1";
}
}
private static Regex metaencpat = new
Regex("<META\\s+HTTP-EQUIV\\s*=\\s*[\"']Content-Type[\"']\\s+CONTENT\\s*=\\s*[\"']([^\"']*)[\"']>",
RegexOptions.IgnoreCase | RegexOptions.Compiled);
private static string ParseMetaContentType(String html, String
defenc)
{
Match m = metaencpat.Match(html);
if(m.Success)
{
return ParseContentType(m.Groups[1].Value);
} else {
return defenc;
}
}
private const int DEFAULT_BUFSIZ = 1000000;
public static string Download(string urlstr)
{
HttpWebRequest req = (HttpWebRequest)WebRequest.Create(urlstr);
using(HttpWebResponse resp =
(HttpWebResponse)req.GetResponse())
{
if (resp.StatusCode == HttpStatusCode.OK)
{
string enc = ParseContentType(resp.ContentType);
int bufsiz = (int)resp.ContentLength;
if(bufsiz < 0) {
bufsiz = DEFAULT_BUFSIZ;
}
byte[] buf = new byte[bufsiz];
Stream stm = resp.GetResponseStream();
int ix = 0;
int n;
while((n = stm.Read(buf, ix, buf.Length - ix)) > 0) {
ix += n;
}
stm.Close();
string temp = Encoding.ASCII.GetString(buf);
enc = ParseMetaContentType(temp, enc);
return Encoding.GetEncoding(enc).GetString(buf);
}
else
{
throw new ArgumentException("URL " + urlstr + "
returned " + resp.StatusDescription);
}
}
}
}

Nightcrawler · Oct 6, 2008

Peter,

Thanks for your feedback. One example of data that I was having
trouble with would be the 6th row from the bottom (Love & Happiness
(Yemaya y Ochùn) [12' Club Mix]). The special "u" charachter in the
word Ochun was coming out wrong when I used UTF-8 encoding. Once I
changed it to ISO-8859-1 I was able to parse it out correctly.

I really would like to understand encodings and why I was running into
this problem. Are there any articles or websites you can recommend
that will allow me to learn a bit more about this. I hate "solving" a
problem and moving on without really knowing why it works.

Thanks again.

Nightcrawler · Oct 6, 2008

Arne,

Thanks for the code. I will give this a try.

Jon Skeet [C# MVP] · Oct 6, 2008

Thanks for your feedback. One example of data that I was having
trouble with would be the 6th row from the bottom (Love & Happiness
(Yemaya y Ochùn) [12' Club Mix]). The special "u" charachter in the
word Ochun was coming out wrong when I used UTF-8 encoding. Once I
changed it to ISO-8859-1 I was able to parse it out correctly.

I really would like to understand encodings and why I was running into
this problem. Are there any articles or websites you can recommend
that will allow me to learn a bit more about this. I hate "solving" a
problem and moving on without really knowing why it works.

I have an article on Unicode at http://pobox.com/~skeet/csharp/unicode.html
Whether it contains anything you don't already know is a different
matter...

Jon

Nightcrawler · Oct 6, 2008

Jon,

Thanks, I will check this out.

HttpWebRequest	5	Dec 20, 2007
HttpWebRequest and IDisposable	12	Dec 10, 2006
HttpWebRequest with SSL URI	2	Oct 21, 2010
Encoding Method as variable	1	Jun 7, 2009
How to correctly carry over SessionID via HttpWebRequest?	9	Sep 5, 2006
HttpWebRequest - data is truncated?	2	Apr 24, 2008
Problem with encoding a character	5	Sep 15, 2009
Read contents of a web page	4	Oct 24, 2006

Is this HttpWebRequest correct?

Nightcrawler

Martin Honnen

Nightcrawler

Nightcrawler

Nightcrawler

Nightcrawler

Nightcrawler

Arne Vajhøj

Nightcrawler

Nightcrawler

Jon Skeet [C# MVP]

Nightcrawler

Ask a Question

Similar Threads