StreamReader.StreamReader(String, bool) bug - no BOM detection

P

Polanski24

Hello!

During my app testing I discovered the following bug in .NET v2.0 (have
not tested 1.1 yet).

Constructors of StreamReader supposed to detect byte order mark fail to
do so.

Simple test case is below just feed it with files with different BOM
and one can see that StreamReader encoding is always default
UTF8Encoding disregard for BOM of file.
In case somone needs BOM detection use code below instead of
StringReader constructors.

StreamReader reader = null;
System.IO.FileStream file = null;
Encoding enc = null;
try
{
file = new System.IO.FileStream(path,
FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
if (file.CanSeek)
{

byte[] bom = new byte[4]; // Get the byte-order mark, if there is one
file.Read(bom, 0, 4);
if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf){
enc = Encoding.UTF8;
} // utf-8
else if (bom[0] == 0xff && bom[1] == 0xfe){
enc = Encoding.Unicode;
} // ucs-2le, ucs-4le, and ucs-16le
else if (bom[0] == 0xfe && bom[1] == 0xff) {
enc = Encoding.Unicode;
} // utf-16 and ucs-2
else if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] ==
0xff) // ucs-4
{
enc = System.Text.Encoding.UTF32;
}
else
{
enc = System.Text.Encoding.ASCII;
}

file.Close();
}

reader = new StreamReader(path, true);

Trace.WriteLine("StreamReader encoding: " + reader.CurrentEncoding);
Trace.WriteLine("BOM detected encoding: " + enc.ToString());

}
catch (Exception ex)
{
Trace.WriteLine(ex.ToString());
}
finally
{
if (reader != null) reader.Close();
if (file != null) file.Close();
}


Cheers,

http://sourceforge.net/projects/ngmp
 
J

Jon Skeet [C# MVP]

Polanski24 said:
During my app testing I discovered the following bug in .NET v2.0 (have
not tested 1.1 yet).

Constructors of StreamReader supposed to detect byte order mark fail to
do so.

No, they don't. You're trying to use CurrentEncoding prior to reading
any data. From the docs:

<quote>
Property Value
The current character encoding used by the current reader. The value
can be different after the first call to any Read method of
StreamReader, since encoding autodetection is not done until the first
call to a Read method.
</quote>

I don't believe there's a bug at all.


If you run the code below, you'll see it doing the right thing. In the
last case, the same data as a previous test case is used, claiming to
be UTF-8 but then using little endian UTF-16 data. That's the only case
in which things go "wrong" (understandably) - it copes with all the
rest.

using System;
using System.IO;

class Test
{
static void Main (string[] args)
{
byte[] littleEndian = new byte[]
{0xff, 0xfe, 0x41, 0x00, 0x42, 0x00};
byte[] bigEndian = new byte[]
{0xfe, 0xff, 0x00, 0x41, 0x00, 0x42};
byte[] utf8 = new byte[]
{0xef, 0xbb, 0xbf, 0x41, 0x42};
byte[] utf8DuffData = new byte[]
{0xef, 0xbb, 0x41, 0x00, 0x42, 0x00};

ShowEncoding ("Big endian", bigEndian);
ShowEncoding ("Little endian", littleEndian);
ShowEncoding ("UTF-8", utf8);
ShowEncoding ("UTF-8 with little endian UTF-16 data",
utf8DuffData);
}

static void ShowEncoding (string correct, byte[] data)
{
using (MemoryStream ms = new MemoryStream(data))
{
using (StreamReader reader = new StreamReader(ms, true))
{
Console.WriteLine (correct);
Console.WriteLine (reader.CurrentEncoding);
Console.WriteLine (reader.ReadLine());
}
}
}
}
 
P

Polanski24

Hello!

Thanks for reply. I do with some hesitation agree that it's not a bug
but it's rather desing flaw (or bug). The reason is very simple - there
is no way to freely seek in the stream using StringReader - it works
only in one direction - but during code execution usually it is
necessary to detect encoding before any reads or processing is done.

Since StramReader uses internally Stream to go through data it should
in the constructor code do the check internally than rewind Stream
position to 0 and wait for first read operation instead of providing
incorrect value in CurrentEncoding property. This is at least the way I
would design that feature.

Cheers

http://sourceforge.net/projects/ngmp
 
C

Carl Daniel [VC++ MVP]

Polanski24 said:
Hello!

Thanks for reply. I do with some hesitation agree that it's not a bug
but it's rather desing flaw (or bug). The reason is very simple -
there is no way to freely seek in the stream using StringReader - it
works only in one direction - but during code execution usually it is
necessary to detect encoding before any reads or processing is done.

Since StramReader uses internally Stream to go through data it should
in the constructor code do the check internally than rewind Stream
position to 0 and wait for first read operation instead of providing
incorrect value in CurrentEncoding property. This is at least the way
I would design that feature.

.... which would then make it useless on non-seekable streams, like the
stream from a decompressor or a network socket. Poor choice, IMO.

-cd
 
J

Jon Skeet [C# MVP]

Polanski24 said:
Thanks for reply. I do with some hesitation agree that it's not a bug
but it's rather desing flaw (or bug). The reason is very simple - there
is no way to freely seek in the stream using StringReader - it works
only in one direction - but during code execution usually it is
necessary to detect encoding before any reads or processing is done.

I very rarely find that necessary, actually. If I don't know what the
encoding is before I start, I rarely care about it at all, so long as
I'm getting the right text data.
Since StramReader uses internally Stream to go through data it should
in the constructor code do the check internally than rewind Stream
position to 0 and wait for first read operation instead of providing
incorrect value in CurrentEncoding property. This is at least the way I
would design that feature.

You're making some assumptions there:

1) Reading the stream when you haven't been asked to won't have any
nasty side effects

2) The stream can be rewound


Now, assuming you meant to write StreamReader rather than StringReader
in the first paragraph, all you need to use is use the BaseStream
property, set the position on *that*, and then call
StreamReader.DiscardBufferedData.

So, you can get the behaviour you want by:

1) Find the position of the stream
2) Create the StreamReader
3) Call StreamReader.Read()
4) Call StreamReader.BaseStream.Position = <whatever is was before>
5) Call StreamReader.DiscardBufferedData

This will still cause problems if the stream isn't seekable, of course.
In that case, you'd have to read the first character and remember it
for when you first wanted actual data.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top