Is this an encoding problem?

David Taylor · Nov 12, 2004

In .net I am using a HttpWebRequest to read from a WebSite. I am getting
everything back except for some characters above hex 7F which appear to have
been stripped out of my response. I see these characters if I examine the
site with IE.

It has been suggested that this is an encoding problem, but I'm unsure as
what I need to do about it. Can anybody help?

Morten Wennevik · Nov 12, 2004

Hi David,

This does indeed look like an encoding problem. The WebSite probably does
not use UTF-8 which is the default encoding for the StreamReader used by
GetResponseStream. If there is no encoding information in the headers you
should be able to find the information somewhere in the data itself.

Check WebResponse.ContentEncoding or if no encoding is set look for
'charset' in the source code.
You can download the data using UTF-8 then convert it to a bytestream and
back to a string using the new encoding.

David Taylor · Nov 15, 2004

It certainly sounds as though you know what you are talking about, however I
need more help.

The WebSite talks OK to IE just not to my C# .net program.

I checked the WebResponse with a quick watch and the stated encoding was "".

Can you tell me more precisely what coding changes I need to make to my code
to get it to work. I tried setting the TransferEncoding but that just caused
an error in Get Response.

My current coding snippet for getting the URL response is as follows.

private String ReadURL()

{

HttpWebRequest reqURL =
(HttpWebRequest)WebRequest.Create(ToString());

reqURL.Credentials = CredentialCache.DefaultCredentials;

HttpWebResponse respURL = (HttpWebResponse)reqURL.GetResponse();

Stream streamURL = respURL.GetResponseStream();

return (new StreamReader(streamURL)).ReadToEnd();

}

Morten Wennevik · Nov 15, 2004

This is a piece of code I used to handle encodings not found in the
response stream.

Stream s = resp.GetResponseStream();
byte[] buffer = ReadStream(s); // ReadStream reads the Stream into a byte[]

// time to check encoding

string urlEnc = resp.ContentEncoding;

Encoding e = null;

if(urlEnc.Length > 0)
e = Encoding.GetEncoding(urlEnc);
else
e = Encoding.UTF8;

string temp = e.GetString(buffer, 0, buffer.Length);

// in case if no encoding, redecode the page
if(resp.ContentEncoding.Length == 0)
{
string charset = GetCharSet(resp.ContentType, true);
if(charset == null)
charset = GetCharSet(temp, false);
if(charset != null)
temp = Encoding.GetEncoding(charset).GetString(buffer, 0, buffer.Length);
}

....

// the idea of getcharset is to look for the charset tag in the source
// I forgot why all the details, but those are probably to ensure all
manners of writing will be detected

private static string GetCharSet(string s, bool header)
{
try
{
int i = s.IndexOf("charset"); // try lower case first
if(i == -1)
i = s.IndexOf("CHARSET");
if(i == -1) // charset not found, return
return null;

int j = s.IndexOf("=", i+1);
if(j == -1)
return null;

if(header)
{
int n = s.IndexOf(";", j+1);
if(n == -1)
return s.Substring(j+1);
else
return s.Substring(j+1, n-(j+1));
}

int k = s.IndexOf("\"", j+1);
int l = s.IndexOf(">", j+1);
int m = s.IndexOf("'", j+1);

if(k == -1 && l == -1 && m == -1) // not able to detect end of the
encoding word
return null;

if(k == -1)
k = Int32.MaxValue;
if(l == -1)
l = Int32.MaxValue;
if(m == -1)
l = Int32.MaxValue;
if(k == Int32.MaxValue)
return null;

// the previous eight lines are probably obsolete code I forgot to remove
// if k == -1 the substring wouldn't work

string temp = s.Substring(j+1, k-j-1);
if(temp.Length == 0)
return null;
else
return temp;
}
catch(Exception ex)
{
MessageBox.Show("GetCharSet Error: " + ex.Message);
return null;
}
}

Is this HttpWebRequest correct?	11	Oct 3, 2008
C# and encodings	30	Feb 3, 2009
Reading from a controller by HTTP	11	Mar 8, 2012
Problem with encoding a character	5	Sep 15, 2009
Hex Strings	2	Apr 13, 2005
Read from any URL into Stream	2	Oct 17, 2005
What happens with encoding when a mail is received	1	Jun 3, 2010
ASCII / ANSI encoding	10	Dec 16, 2005

Is this an encoding problem?

David Taylor

Morten Wennevik

David Taylor

Morten Wennevik

Ask a Question

Similar Threads