UTF-8 preamble -> Possible bug in StreamWriter(or at least strange behaviour..)

O

Oscar Thornell

Hi,

I generate and temporary saves a text file to disk. Later I upload this file
to Microsoft MapPoint (not so important).
The file needs to be in UTF-8 encoding and I explicitly use the
"Encoding.UTF8" in the constructor like this:

StreamWriter writer = new StreamWriter(file, Encoding.UTF8);

When I do this the StreamWriter inserts an UTF-8 preamble "" into the
beginning of the file.
// http://www.chilkatsoft.com/faq/Utf8Preamble.html

MapPoint throws an Exception for this UTF-8 preamble and aborts the parsing
of the file.

The annoying thing is that if I don´t explicitly state the Encoding in the
constructor the documentation for StreamWriter.Encoding property says:
"The Encoding specified in the constructor for the current instance, or
UTF8Encoding if an encoding was not specified."

But! If I don´t specify the encoding I end up with text that is not UTF-8
(without the preamble..).

Without the Encoding in the constructor: "Fältöverstens Teleshop"
With the Encoding in the constructor: "Fältöverstens Teleshop"

So my question is how can I get ride of this preamble? Because if I get ride
of that everything should work...

Regards
/Oscar
 
G

Guest

But! If I don´t specify the encoding I end up with text that is not UTF-8
(without the preamble..).

Are you sure about that? Perhaps it's just the application you use to view
the output (Notepad?) that fails to recognize it as UTF-8 if the preamble is
missing.


Mattias
 
O

Oscar Thornell

I can´t explain it otherwise...
Signs like åäö ends up like this in the file..
"Fältöverstens Teleshop"

If I specify UTF8:
"Fältöverstens Teleshop"

The problem is the IO write operation. If I change the behaviour and write
the data directly to the HTTP output stream and saves the file it looks ok!
//
Response.Clear();
Response.Charset = "iso-8859-1";
Response.ContentEncoding = System.Text.Encoding.GetEncoding("iso-8859-1");
Response.ContentType = "text/plain";
Response.AddHeader("content-disposition", "attachment; filename=\"" +
fileName + "\"");
Response.Write(fileData);
Response.End();

The following code writes "fileData" ( a String) to disk. In this case the
file would be messed up with: "Fältöverstens Teleshop"
//
file = new FileStream(filePath, fileMode, fileAccess);
StreamWriter writer = new StreamWriter(file);
writer.Write(fileData);

Not messed up but with the preamble...
//
file = new FileStream(filePath, fileMode, fileAccess);
StreamWriter writer = new StreamWriter(file, Encoding.UTF8);
writer.Write(fileData);


Maybee I should use the GetEncoding() method for the IO version instead of
directly going for UTF8!?

/Oscar
 
O

Oscar Thornell

An other thing my fix for this is to read the file into an Byte[] buffer and
get ride of the three first bytes i.e. the preamble...
It feels akward (and very 1990) though and .NET is deemed to have a better
approach for this..

/Oscar
 
C

Christof Nordiek

Oscar Thornell said:
I can´t explain it otherwise...
Signs like åäö ends up like this in the file..
"Fältöverstens Teleshop"
This looks like your text below encoded in UTF-8 and then interpreted as
iso-8859-1 or similar.
If I specify UTF8:
"Fältöverstens Teleshop"

The problem is the IO write operation. If I change the behaviour and write
the data directly to the HTTP output stream and saves the file it looks
ok!
//
Response.Clear();
Response.Charset = "iso-8859-1"; This is not! UTF-8
Response.ContentEncoding = System.Text.Encoding.GetEncoding("iso-8859-1");
Response.ContentType = "text/plain";
Response.AddHeader("content-disposition", "attachment; filename=\"" +
fileName + "\"");
Response.Write(fileData);
Here I supose, the Response.Write encodes in iso-8859-1, not in UTF-8.
Response.End();

The following code writes "fileData" ( a String) to disk. In this case the
file would be messed up with: "Fältöverstens Teleshop"
//
That's actually good plain UTF-8, it's only read with an other encoding.
file = new FileStream(filePath, fileMode, fileAccess);
StreamWriter writer = new StreamWriter(file);
writer.Write(fileData);

Not messed up but with the preamble...
//
How did you read this?
If the reader correctly interprets UTF-8, the preamble should be invisable.
That really puzzles me.
 
J

Jon Skeet [C# MVP]

Oscar Thornell said:
I generate and temporary saves a text file to disk. Later I upload this file
to Microsoft MapPoint (not so important).
The file needs to be in UTF-8 encoding and I explicitly use the
"Encoding.UTF8" in the constructor like this:

StreamWriter writer = new StreamWriter(file, Encoding.UTF8);

When I do this the StreamWriter inserts an UTF-8 preamble "" into the
beginning of the file.
// http://www.chilkatsoft.com/faq/Utf8Preamble.html

MapPoint throws an Exception for this UTF-8 preamble and aborts the parsing
of the file.

The annoying thing is that if I don´t explicitly state the Encoding in the
constructor the documentation for StreamWriter.Encoding property says:
"The Encoding specified in the constructor for the current instance, or
UTF8Encoding if an encoding was not specified."

But! If I don´t specify the encoding I end up with text that is not UTF-8
(without the preamble..).

That sounds very unliikely. As others have suggested, it sounds like
whatever you're using to read the file is assuming the wrong thing.

Could you post a short but complete program which demonstrates the
problem?

See http://www.pobox.com/~skeet/csharp/complete.html for details of
what I mean by that.

You should be able to provide an example where writing without
specifying an encoding and writing where you specify Encoding.UTF8 make
a difference to the binary output, other than in terms of the existence
of the preamble.
 
O

Oscar Thornell

Hi again! I have worked some more with this and..

First, the unlikley thing that is part of my problem is Microsofts MapPoint
Web Service.
Hosted at: https://mappoint-*****.partners.extranet.microsoft.com/*****

If I create a file with the following code..
FileStream file = new FileStream(filePath, fileMode, fileAccess);
StreamWriter writer = new StreamWriter(file); or...
StreamWriter writer = new StreamWriter(file, new UTF8Encoding(false));
//Does not insert the preamble
writer.Write(fileData);

MapPoint serves my client with this: "Fältöverstens Teleshop" instead of
this: "Fältöverstens Teleshop".

If I create a file with this instantiation of StreamWriter..
StreamWriter writer = new StreamWriter(file, Encoding.UTF8);

MapPoint throws an Exception telling me that it does not recognize "".
"The UTF-8 preamble!"

If I take that very file and opens it with a BinaryReader and drops the
three first bytes(the  preamble).
Then uploads it to MapPoint everything works nicely!
No errors and no messed up text!

If I instantiate StreamWriter with:
StreamWriter writer = new StreamWriter(file, Encoding.Default);
Everything works directly!
But I do not want to use that method since it is dependent upon the current
coding page in the system.

What I rely can´t understand here is why MapPoint messes up the text with
this code:
StreamWriter writer = new StreamWriter(file, new UTF8Encoding(false));

and works with this(if I drop the three first bytes..):
StreamWriter writer = new StreamWriter(file, Encoding.UTF8);


//The following code can be used to read the preamble from a file.
//In this case it recognizes UTF-8 and UTF-16.
FileStream stream = new FileStream("The_File.txt", FileMode.Open);
BinaryReader reader = new BinaryReader(stream);

byte[] buffer = reader.ReadBytes(size);

if ( buffer[0] == 0xff && buffer[1] == 0xfe )
{
//UTF-16
Console.WriteLine("UTF-16");
}
else if( buffer[0] == 0xef && buffer[1] == 0xbb && buffer[2] == 0xbf)
{
//UTF-8
Console.WriteLine("UTF-8");
}

/Oscar
 
J

Jon Skeet [C# MVP]

Hi again! I have worked some more with this and..

First, the unlikley thing that is part of my problem is Microsofts MapPoint
Web Service.
Hosted at: https://mappoint-*****.partners.extranet.microsoft.com/*****

If I create a file with the following code..
FileStream file = new FileStream(filePath, fileMode, fileAccess);
StreamWriter writer = new StreamWriter(file); or...
StreamWriter writer = new StreamWriter(file, new UTF8Encoding(false));
//Does not insert the preamble
writer.Write(fileData);

MapPoint serves my client with this: "Fältöverstens Teleshop" instead of
this: "Fältöverstens Teleshop".

According to what - MapPoint? What's reading the file at that point?
That's the important bit - I bet you'll find the file is actually
exactly the same, just missing the UTF-8 preamble.
 
O

Oscar Thornell

First the only application that reads the file is MapPoint. After that
process MapPoint creates a geocoded datasource based on the file.
The behaviour is consistent in a number different ways of reading data from
the MapPoint datasource at that point.

1) A client utilizing the Web Service Find() method that queries the
mappoint datasource and retrieves textual descriptions...
a) The clients are in this case both dev test clients written in .NET/C#
running on Win2003
b) J2EE production clients running on Solaris

2) MapPoint supports exports of datasources in several ways CVS, XML and so
on...
a) Exporting a datasource in Access 2003 XML format and reading it into
a new Access db also gives the presentation problems with
encoding/text (as described in this thread..)

My only conclusion is that MapPoint does not support UTF-8 and I am doing
tests to soly use "iso-8859-1".

/Oscar

Hi again! I have worked some more with this and..

First, the unlikley thing that is part of my problem is Microsofts
MapPoint
Web Service.
Hosted at: https://mappoint-*****.partners.extranet.microsoft.com/*****

If I create a file with the following code..
FileStream file = new FileStream(filePath, fileMode, fileAccess);
StreamWriter writer = new StreamWriter(file); or...
StreamWriter writer = new StreamWriter(file, new UTF8Encoding(false));
//Does not insert the preamble
writer.Write(fileData);

MapPoint serves my client with this: "Fältöverstens Teleshop" instead of
this: "Fältöverstens Teleshop".

According to what - MapPoint? What's reading the file at that point?
That's the important bit - I bet you'll find the file is actually
exactly the same, just missing the UTF-8 preamble.
 
J

Jon Skeet [C# MVP]

Oscar Thornell said:
First the only application that reads the file is MapPoint. After that
process MapPoint creates a geocoded datasource based on the file.
The behaviour is consistent in a number different ways of reading data from
the MapPoint datasource at that point.

1) A client utilizing the Web Service Find() method that queries the
mappoint datasource and retrieves textual descriptions...
a) The clients are in this case both dev test clients written in .NET/C#
running on Win2003
b) J2EE production clients running on Solaris

2) MapPoint supports exports of datasources in several ways CVS, XML and so
on...
a) Exporting a datasource in Access 2003 XML format and reading it into
a new Access db also gives the presentation problems with
encoding/text (as described in this thread..)

My only conclusion is that MapPoint does not support UTF-8 and I am doing
tests to soly use "iso-8859-1".

Does the MapPoint documentation not give any indication about which
encodings are supported, or any way of specifying the encoding?
 
O

Oscar Thornell

No way of specifying...
I haven´t found any specs. for upload, only what formats a datasource can be
transformed to during an export.

Among those are: "TabDelimitedTextUTF8"...ISO 10646-1:2000 Annex D

So one could assume that UTF8 is supported for "uploads" aswell... :-(

Anyway "ISO 8859-1" seems ok for now so I stick with that...

Regards
/Oscar
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top