StreamReader.StreamReader(String, bool) bug - no BOM detection

Polanski24 · Apr 22, 2006

Hello!

During my app testing I discovered the following bug in .NET v2.0 (have
not tested 1.1 yet).

Constructors of StreamReader supposed to detect byte order mark fail to
do so.

Simple test case is below just feed it with files with different BOM
and one can see that StreamReader encoding is always default
UTF8Encoding disregard for BOM of file.
In case somone needs BOM detection use code below instead of
StringReader constructors.

StreamReader reader = null;
System.IO.FileStream file = null;
Encoding enc = null;
try
{
file = new System.IO.FileStream(path,
FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
if (file.CanSeek)
{

byte[] bom = new byte[4]; // Get the byte-order mark, if there is one
file.Read(bom, 0, 4);
if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf){
enc = Encoding.UTF8;
} // utf-8
else if (bom[0] == 0xff && bom[1] == 0xfe){
enc = Encoding.Unicode;
} // ucs-2le, ucs-4le, and ucs-16le
else if (bom[0] == 0xfe && bom[1] == 0xff) {
enc = Encoding.Unicode;
} // utf-16 and ucs-2
else if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] ==
0xff) // ucs-4
{
enc = System.Text.Encoding.UTF32;
}
else
{
enc = System.Text.Encoding.ASCII;
}

file.Close();
}

reader = new StreamReader(path, true);

Trace.WriteLine("StreamReader encoding: " + reader.CurrentEncoding);
Trace.WriteLine("BOM detected encoding: " + enc.ToString());

}
catch (Exception ex)
{
Trace.WriteLine(ex.ToString());
}
finally
{
if (reader != null) reader.Close();
if (file != null) file.Close();
}

Cheers,

http://sourceforge.net/projects/ngmp

Jon Skeet [C# MVP] · Apr 22, 2006

Polanski24 said:
During my app testing I discovered the following bug in .NET v2.0 (have
not tested 1.1 yet).

Constructors of StreamReader supposed to detect byte order mark fail to
do so.

No, they don't. You're trying to use CurrentEncoding prior to reading
any data. From the docs:

<quote>
Property Value
The current character encoding used by the current reader. The value
can be different after the first call to any Read method of
StreamReader, since encoding autodetection is not done until the first
call to a Read method.
</quote>

I don't believe there's a bug at all.

If you run the code below, you'll see it doing the right thing. In the
last case, the same data as a previous test case is used, claiming to
be UTF-8 but then using little endian UTF-16 data. That's the only case
in which things go "wrong" (understandably) - it copes with all the
rest.

using System;
using System.IO;

class Test
{
static void Main (string[] args)
{
byte[] littleEndian = new byte[]
{0xff, 0xfe, 0x41, 0x00, 0x42, 0x00};
byte[] bigEndian = new byte[]
{0xfe, 0xff, 0x00, 0x41, 0x00, 0x42};
byte[] utf8 = new byte[]
{0xef, 0xbb, 0xbf, 0x41, 0x42};
byte[] utf8DuffData = new byte[]
{0xef, 0xbb, 0x41, 0x00, 0x42, 0x00};

ShowEncoding ("Big endian", bigEndian);
ShowEncoding ("Little endian", littleEndian);
ShowEncoding ("UTF-8", utf8);
ShowEncoding ("UTF-8 with little endian UTF-16 data",
utf8DuffData);
}

static void ShowEncoding (string correct, byte[] data)
{
using (MemoryStream ms = new MemoryStream(data))
{
using (StreamReader reader = new StreamReader(ms, true))
{
Console.WriteLine (correct);
Console.WriteLine (reader.CurrentEncoding);
Console.WriteLine (reader.ReadLine());
}
}
}
}

Polanski24 · Apr 22, 2006

Hello!

Thanks for reply. I do with some hesitation agree that it's not a bug
but it's rather desing flaw (or bug). The reason is very simple - there
is no way to freely seek in the stream using StringReader - it works
only in one direction - but during code execution usually it is
necessary to detect encoding before any reads or processing is done.

Since StramReader uses internally Stream to go through data it should
in the constructor code do the check internally than rewind Stream
position to 0 and wait for first read operation instead of providing
incorrect value in CurrentEncoding property. This is at least the way I
would design that feature.

Cheers

http://sourceforge.net/projects/ngmp

Carl Daniel [VC++ MVP] · Apr 22, 2006

Polanski24 said:
Hello!

Thanks for reply. I do with some hesitation agree that it's not a bug
but it's rather desing flaw (or bug). The reason is very simple -
there is no way to freely seek in the stream using StringReader - it
works only in one direction - but during code execution usually it is
necessary to detect encoding before any reads or processing is done.

Since StramReader uses internally Stream to go through data it should
in the constructor code do the check internally than rewind Stream
position to 0 and wait for first read operation instead of providing
incorrect value in CurrentEncoding property. This is at least the way
I would design that feature.

.... which would then make it useless on non-seekable streams, like the
stream from a decompressor or a network socket. Poor choice, IMO.

-cd

Jon Skeet [C# MVP] · Apr 22, 2006

Polanski24 said:
Thanks for reply. I do with some hesitation agree that it's not a bug
but it's rather desing flaw (or bug). The reason is very simple - there
is no way to freely seek in the stream using StringReader - it works
only in one direction - but during code execution usually it is
necessary to detect encoding before any reads or processing is done.

I very rarely find that necessary, actually. If I don't know what the
encoding is before I start, I rarely care about it at all, so long as
I'm getting the right text data.

Since StramReader uses internally Stream to go through data it should
in the constructor code do the check internally than rewind Stream
position to 0 and wait for first read operation instead of providing
incorrect value in CurrentEncoding property. This is at least the way I
would design that feature.

You're making some assumptions there:

1) Reading the stream when you haven't been asked to won't have any
nasty side effects

2) The stream can be rewound

Now, assuming you meant to write StreamReader rather than StringReader
in the first paragraph, all you need to use is use the BaseStream
property, set the position on *that*, and then call
StreamReader.DiscardBufferedData.

So, you can get the behaviour you want by:

1) Find the position of the stream
2) Create the StreamReader
3) Call StreamReader.Read()
4) Call StreamReader.BaseStream.Position = <whatever is was before>
5) Call StreamReader.DiscardBufferedData

This will still cause problems if the stream isn't seekable, of course.
In that case, you'd have to read the first character and remember it
for when you first wanted actual data.

Bug or Feature in BinaryReader.PeekChar()?	0	Nov 29, 2004
Server not receiving write() here and there	1	Mar 12, 2012
bug in clientpost.cs?	2	Nov 17, 2004
Consuming webservices with gprs	3	Feb 12, 2005
Automating a POST request	4	Sep 7, 2004
HttpWebRequest intermittent trouble: ConnectionClosed, SecureChannelFailure, and others issues	0	Dec 7, 2006
Converting byte array to Unicode string in C#	1	Feb 13, 2006
Audio/Sound Problem	2	Nov 28, 2008

StreamReader.StreamReader(String, bool) bug - no BOM detection

Polanski24

Jon Skeet [C# MVP]

Polanski24

Carl Daniel [VC++ MVP]

Jon Skeet [C# MVP]

Ask a Question

Similar Threads