Is this an encoding problem?

  • Thread starter Thread starter David Taylor
  • Start date Start date
D

David Taylor

In .net I am using a HttpWebRequest to read from a WebSite. I am getting
everything back except for some characters above hex 7F which appear to have
been stripped out of my response. I see these characters if I examine the
site with IE.

It has been suggested that this is an encoding problem, but I'm unsure as
what I need to do about it. Can anybody help?
 
Hi David,

This does indeed look like an encoding problem. The WebSite probably does
not use UTF-8 which is the default encoding for the StreamReader used by
GetResponseStream. If there is no encoding information in the headers you
should be able to find the information somewhere in the data itself.

Check WebResponse.ContentEncoding or if no encoding is set look for
'charset' in the source code.
You can download the data using UTF-8 then convert it to a bytestream and
back to a string using the new encoding.
 
It certainly sounds as though you know what you are talking about, however I
need more help.

The WebSite talks OK to IE just not to my C# .net program.

I checked the WebResponse with a quick watch and the stated encoding was "".

Can you tell me more precisely what coding changes I need to make to my code
to get it to work. I tried setting the TransferEncoding but that just caused
an error in Get Response.

My current coding snippet for getting the URL response is as follows.

private String ReadURL()

{

HttpWebRequest reqURL =
(HttpWebRequest)WebRequest.Create(ToString());

reqURL.Credentials = CredentialCache.DefaultCredentials;

HttpWebResponse respURL = (HttpWebResponse)reqURL.GetResponse();

Stream streamURL = respURL.GetResponseStream();

return (new StreamReader(streamURL)).ReadToEnd();

}
 
This is a piece of code I used to handle encodings not found in the
response stream.

Stream s = resp.GetResponseStream();
byte[] buffer = ReadStream(s); // ReadStream reads the Stream into a byte[]

// time to check encoding

string urlEnc = resp.ContentEncoding;

Encoding e = null;

if(urlEnc.Length > 0)
e = Encoding.GetEncoding(urlEnc);
else
e = Encoding.UTF8;

string temp = e.GetString(buffer, 0, buffer.Length);

// in case if no encoding, redecode the page
if(resp.ContentEncoding.Length == 0)
{
string charset = GetCharSet(resp.ContentType, true);
if(charset == null)
charset = GetCharSet(temp, false);
if(charset != null)
temp = Encoding.GetEncoding(charset).GetString(buffer, 0, buffer.Length);
}

....

// the idea of getcharset is to look for the charset tag in the source
// I forgot why all the details, but those are probably to ensure all
manners of writing will be detected

private static string GetCharSet(string s, bool header)
{
try
{
int i = s.IndexOf("charset"); // try lower case first
if(i == -1)
i = s.IndexOf("CHARSET");
if(i == -1) // charset not found, return
return null;

int j = s.IndexOf("=", i+1);
if(j == -1)
return null;

if(header)
{
int n = s.IndexOf(";", j+1);
if(n == -1)
return s.Substring(j+1);
else
return s.Substring(j+1, n-(j+1));
}

int k = s.IndexOf("\"", j+1);
int l = s.IndexOf(">", j+1);
int m = s.IndexOf("'", j+1);

if(k == -1 && l == -1 && m == -1) // not able to detect end of the
encoding word
return null;

if(k == -1)
k = Int32.MaxValue;
if(l == -1)
l = Int32.MaxValue;
if(m == -1)
l = Int32.MaxValue;
if(k == Int32.MaxValue)
return null;

// the previous eight lines are probably obsolete code I forgot to remove
// if k == -1 the substring wouldn't work

string temp = s.Substring(j+1, k-j-1);
if(temp.Length == 0)
return null;
else
return temp;
}
catch(Exception ex)
{
MessageBox.Show("GetCharSet Error: " + ex.Message);
return null;
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Back
Top