Encoding XML troubles

F

fitsch

Hi,

I am trying to write a generic RSS/Atom/OPML feed client. The problem
is, that those xml feeds may have different encodings:

- <?xml version="1.0" encoding="ISO-8859-1" ?>...
- <?xml version="1.0" encoding="utf-8" ?>...
- ...

I am using the WebRequest functionality to get the feeds. So, my code
looks simplified like this:

WebRequest req = WebRequest.Create(url);
StreamReader reader = new StreamReader(..., Encoding.Default);
string result = readerEnc.ReadToEnd();

As you can see on the second line, I can (or must, because utf-8 is
default) already define the encoding type of the expected stream.
However, as I do not now the encoding type while fetching the xml
stream, I use Encoding.Default.

And now, I am in the middle of the problem: I like to read the result
xml string, get the encoding type and re-encode result string with the
correct encoding type. Otherwise, all special characters are not
readable or missing in the result string.

I have unlukely tried following work-arounds:
- convert directly the result xml string from Encoding.Default to XML
Encoding Type:
result = this.convertString(result, Encoding.Default,
Encoding.GetEncoding(myEncodingStringFromXMLFile));

The convertString function uses similar code as the convert example on
msdn: http://msdn.microsoft.com/library/d...ef/html/frlrfsystemtextencodingclasstopic.asp
--> did not work - characters remained as they where before

- Creating a second StreamReader instance with the right encoding:
StreamReader reader2 = new StreamReader(...,
Encoding.GetEncoding(myEncodingStringFromXMLFile));
string result = readerEnc.ReadToEnd();
--> did not work - it seems, that the ResponseStream from the
WebRequest class can only be read once! I am getting an error when
trying to modify the Position attribute on the stream (Another guy had
exactly the same problem:
http://groups.google.ch/groups?hl=de&lr=&th=cfabc4548f67a0c2&rnum=1)

Is there another solution, than fetching the URL twice? Do I miss some
basic functionalities? Thanks for your help...

Greets,

Phil
 
J

Joerg Jooss

fitsch said:
Hi,

I am trying to write a generic RSS/Atom/OPML feed client. The problem
is, that those xml feeds may have different encodings:

- <?xml version="1.0" encoding="ISO-8859-1" ?>...
- <?xml version="1.0" encoding="utf-8" ?>...
- ...

I am using the WebRequest functionality to get the feeds. So, my code
looks simplified like this:

WebRequest req = WebRequest.Create(url);
StreamReader reader = new StreamReader(..., Encoding.Default);
string result = readerEnc.ReadToEnd();

As you can see on the second line, I can (or must, because utf-8 is
default) already define the encoding type of the expected stream.
However, as I do not now the encoding type while fetching the xml
stream, I use Encoding.Default.

Note that Encoding.Default is your OS default character set, and no
magic catch all encoding. This step will already render a lot of XML
input useless.
And now, I am in the middle of the problem: I like to read the result
xml string, get the encoding type and re-encode result string with the
correct encoding type. Otherwise, all special characters are not
readable or missing in the result string.

Once it's a string, it's a string. You must re-*de*code bytes.
I have unlukely tried following work-arounds:
- convert directly the result xml string from Encoding.Default to XML
Encoding Type:
result = this.convertString(result, Encoding.Default,
Encoding.GetEncoding(myEncodingStringFromXMLFile));

The convertString function uses similar code as the convert example on
msdn:
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/cpref
/html/frlrfsystemtextencodingclasstopic.asp --> did not work -
characters remained as they where before

- Creating a second StreamReader instance with the right encoding:
StreamReader reader2 = new StreamReader(...,
Encoding.GetEncoding(myEncodingStringFromXMLFile));
string result = readerEnc.ReadToEnd();
--> did not work - it seems, that the ResponseStream from the
WebRequest class can only be read once! I am getting an error when
trying to modify the Position attribute on the stream (Another guy had
exactly the same problem:
http://groups.google.ch/groups?hl=de&lr=&th=cfabc4548f67a0c2&rnum=1)

Is there another solution, than fetching the URL twice? Do I miss some
basic functionalities? Thanks for your help...

The functionality to safely decode XML content is already available in
the BCL. Just use an XmlTextReader.

Cheers,
 
J

Jon Skeet [C# MVP]

fitsch said:
I am trying to write a generic RSS/Atom/OPML feed client. The problem
is, that those xml feeds may have different encodings:

- <?xml version="1.0" encoding="ISO-8859-1" ?>...
- <?xml version="1.0" encoding="utf-8" ?>...
- ...

I am using the WebRequest functionality to get the feeds. So, my code
looks simplified like this:

WebRequest req = WebRequest.Create(url);
StreamReader reader = new StreamReader(..., Encoding.Default);
string result = readerEnc.ReadToEnd();

Why bother reading it as a string? The best solution is to get the
stream and pass it directly to XmlTextReader - then the XmlTextReader,
which knows how to deal with the encoding part of the XML declaration,
can do the right thing.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Top