size of a file and unicode

Tony Johansson · Mar 2, 2010

Hi!

I have this simple program and write 16 character into the file called
test.txt.
If I check the size of the file by using the Length propery on the FileInfo
class
I can see that myLong variable is 16.
If I use a binary Viewer to look at the file test.txt I see that the number
of bytes is 16 with the following Hex values
31 32 33 34 35 36 37 38 39 31 32 33 34 35 36 37

Now to my question but unicode is two bytes.
I have thought that unicode is used but here it's not used because each
charcter is just a single byte and not 2 bytes that unicode is ?
static void Main(string[] args)
{
StreamWriter writer = new StreamWriter("test.txt");
writer.Write("A");
writer.Close();
FileInfo file = new FileInfo("test.txt");
long mylong = file.Length;
}

//Tony

Patrice · Mar 2, 2010

Hello,

The problem is that Unicode allows to represent characters but you have
several ways to encode those characters (a bit like a compressed file is
still the same file but doesn't have the same byte content).

The doc tells that by default it uses UTF-8 encoding
(http://msdn.microsoft.com/en-us/library/system.io.streamwriter(VS.80).aspx)
that uses a variable length depending on the encoded character (and in
particular ASCII characters are encoded the same way making this encoding
scheme compatible with ASCII).

You can choose the appropriate constructor from
http://msdn.microsoft.com/en-us/library/system.io.streamwriter.streamwriter(VS.80).aspx
if you need to use another encoding...

Harlan Messinger · Mar 2, 2010

Tony said:
Hi!

I have this simple program and write 16 character into the file called
test.txt.
If I check the size of the file by using the Length propery on the FileInfo
class
I can see that myLong variable is 16.
If I use a binary Viewer to look at the file test.txt I see that the number
of bytes is 16 with the following Hex values
31 32 33 34 35 36 37 38 39 31 32 33 34 35 36 37

Now to my question but unicode is two bytes.
I have thought that unicode is used but here it's not used because each
charcter is just a single byte and not 2 bytes that unicode is ?
static void Main(string[] args)
{
StreamWriter writer = new StreamWriter("test.txt");
writer.Write("A");
writer.Close();
FileInfo file = new FileInfo("test.txt");
long mylong = file.Length;
}

That's the correct result if the characters you wrote all come from the
ASCII subset.

UTF-8 isn't Unicode, it's an *encoding*, one of several that have been
designed, for Unicode. It uses a variable number of bytes, from one to
four, for each character. For Unicode character positions up to 127, the
UTF-8 encoding is a single byte, identical to the Unicode character
position. For U+0080 to U+07FF, it uses two bytes; for U+0800 to U+FFFF,
it uses three bytes; and for U+10000 to U+10FFFF it uses four bytes.

See http://en.wikipedia.org/wiki/Utf-8 for details.

Arne Vajhøj · Mar 2, 2010

I have this simple program and write 16 character into the file called
test.txt.
If I check the size of the file by using the Length propery on the FileInfo
class
I can see that myLong variable is 16.
If I use a binary Viewer to look at the file test.txt I see that the number
of bytes is 16 with the following Hex values
31 32 33 34 35 36 37 38 39 31 32 33 34 35 36 37

Now to my question but unicode is two bytes.
I have thought that unicode is used but here it's not used because each
charcter is just a single byte and not 2 bytes that unicode is ?
static void Main(string[] args)
{
StreamWriter writer = new StreamWriter("test.txt");
writer.Write("A");
writer.Close();
FileInfo file = new FileInfo("test.txt");
long mylong = file.Length;
}

Unicode does not have anything to do with bytes. It just specify
the numeric value of characters. From 0 to approx. 100000.

If you store Unicode as UTF-16 then each unicode character
will be stored as two bytes (except for the very high code
points that may end up as a pair of two bytes, but forget about
that for now).

If you store Unicode as as UTF-8 then each Unicode character
will be stores as a variable number of bytes. Characters in
US-ASCII (English alphabet) will be one byte. The extra
characters in the Swedish alphabet will be two bytes. And
you will need more bytes for the Asian languages.

Try compare the output files from:

using System;
using System.IO;
using System.Text;

namespace E
{
public class Program
{
public static void Test(string s, string fnm, Encoding enc)
{
using(StreamWriter sw = new StreamWriter(@"C:\" + fnm,
false, enc))
{
sw.WriteLine(s);
}
}
public static void Main(string[] args)
{
string s = "ABCabc123ÅÄÖåäö";
Test(s, "utf8.txt", Encoding.UTF8);
Test(s, "utf16.txt", Encoding.GetEncoding("UTF-16"));
Test(s, "cp1252.txt", Encoding.GetEncoding(1252));
}
}
}

Arne

Tony Johansson · Mar 3, 2010

Arne Vajhøj said:
I have this simple program and write 16 character into the file called
test.txt.
If I check the size of the file by using the Length propery on the
FileInfo
class
I can see that myLong variable is 16.
If I use a binary Viewer to look at the file test.txt I see that the
number
of bytes is 16 with the following Hex values
31 32 33 34 35 36 37 38 39 31 32 33 34 35 36 37

Now to my question but unicode is two bytes.
I have thought that unicode is used but here it's not used because each
charcter is just a single byte and not 2 bytes that unicode is ?
static void Main(string[] args)
{
StreamWriter writer = new StreamWriter("test.txt");
writer.Write("A");
writer.Close();
FileInfo file = new FileInfo("test.txt");
long mylong = file.Length;
}

Click to expand...

Unicode does not have anything to do with bytes. It just specify
the numeric value of characters. From 0 to approx. 100000.

If you store Unicode as UTF-16 then each unicode character
will be stored as two bytes (except for the very high code
points that may end up as a pair of two bytes, but forget about
that for now).

If you store Unicode as as UTF-8 then each Unicode character
will be stores as a variable number of bytes. Characters in
US-ASCII (English alphabet) will be one byte. The extra
characters in the Swedish alphabet will be two bytes. And
you will need more bytes for the Asian languages.

Try compare the output files from:

using System;
using System.IO;
using System.Text;

namespace E
{
public class Program
{
public static void Test(string s, string fnm, Encoding enc)
{
using(StreamWriter sw = new StreamWriter(@"C:\" + fnm, false,
enc))
{
sw.WriteLine(s);
}
}
public static void Main(string[] args)
{
string s = "ABCabc123ÅÄÖåäö";
Test(s, "utf8.txt", Encoding.UTF8);
Test(s, "utf16.txt", Encoding.GetEncoding("UTF-16"));
Test(s, "cp1252.txt", Encoding.GetEncoding(1252));
}
}
}

Arne

I just wonder about what this text that you wrote.
If you store Unicode as as UTF-8 then each Unicode character
will be stores as a variable number of bytes. Characters in
US-ASCII (English alphabet) will be one byte. The extra
characters in the Swedish alphabet will be two bytes. And
you will need more bytes for the Asian languages.

Do I understand you right if I say that you had to use UTF-16 only when you
are dealing with the Asian language and the like.

//Tony

Konrad Neitzel · Mar 3, 2010

Hi Tony!

I just wonder about what this text that you wrote.
If you store Unicode as as UTF-8 then each Unicode character
will be stores as a variable number of bytes. Characters in
US-ASCII (English alphabet) will be one byte. The extra
characters in the Swedish alphabet will be two bytes. And
you will need more bytes for the Asian languages.

Do I understand you right if I say that you had to use UTF-16 only
when you
are dealing with the Asian language and the like.

No. You do not have to use any special encoding.

UTF-8, UTF-16 and UTF-32 support all characters (That is, what I
understood so far), because UTF-8 and UTF-16 can use up to 4 Bytes to
store a character.

So you have to decide, what you need:
- If you need random access to characters, you should take UTF-32. (So
all characters will require 32 Bit = 4 Byte)
- If you want to keep files small, you should choose UTF-8 (In the hope,
that a lot of characters simply use 1 or 2 Bytes only!)

But you can encode the text in all of these formats. (Simpyl read
wikipedia on UTF-8, UTF-16 and UTF-32.)

With kind regards,

Konrad

Arne Vajhøj · Mar 3, 2010

Arne Vajhøj said:
Arne Vajhøj said:

I have this simple program and write 16 character into the file called
test.txt.
If I check the size of the file by using the Length propery on the
FileInfo
class
I can see that myLong variable is 16.
If I use a binary Viewer to look at the file test.txt I see that the
number
of bytes is 16 with the following Hex values
31 32 33 34 35 36 37 38 39 31 32 33 34 35 36 37

Now to my question but unicode is two bytes.
I have thought that unicode is used but here it's not used because each
charcter is just a single byte and not 2 bytes that unicode is ?
static void Main(string[] args)
{
StreamWriter writer = new StreamWriter("test.txt");
writer.Write("A");
writer.Close();
FileInfo file = new FileInfo("test.txt");
long mylong = file.Length;
}

Click to expand...

Unicode does not have anything to do with bytes. It just specify
the numeric value of characters. From 0 to approx. 100000.

If you store Unicode as UTF-16 then each unicode character
will be stored as two bytes (except for the very high code
points that may end up as a pair of two bytes, but forget about
that for now).

If you store Unicode as as UTF-8 then each Unicode character
will be stores as a variable number of bytes. Characters in
US-ASCII (English alphabet) will be one byte. The extra
characters in the Swedish alphabet will be two bytes. And
you will need more bytes for the Asian languages.

Try compare the output files from:

using System;
using System.IO;
using System.Text;

namespace E
{
public class Program
{
public static void Test(string s, string fnm, Encoding enc)
{
using(StreamWriter sw = new StreamWriter(@"C:\" + fnm, false,
enc))
{
sw.WriteLine(s);
}
}
public static void Main(string[] args)
{
string s = "ABCabc123ÅÄÖåäö";
Test(s, "utf8.txt", Encoding.UTF8);
Test(s, "utf16.txt", Encoding.GetEncoding("UTF-16"));
Test(s, "cp1252.txt", Encoding.GetEncoding(1252));
}
}
}

Click to expand...

I just wonder about what this text that you wrote.
If you store Unicode as as UTF-8 then each Unicode character
will be stores as a variable number of bytes. Characters in
US-ASCII (English alphabet) will be one byte. The extra
characters in the Swedish alphabet will be two bytes. And
you will need more bytes for the Asian languages.

Do I understand you right if I say that you had to use UTF-16 only when you
are dealing with the Asian language and the like.

UTF-8 and UTF-16 both supports all of Unicode.

They just encode them differently.

Usually UTF-8 is the best for external usage, because it is
more compatible with ISO-8859-x.

But there are a few trillion strings stored in memory in UTF-16 !

Arne

I'm using about twice as many bytes of memory as the size of the file	8	Mar 4, 2010
How to read a Unicode data saved as ASCII in notepad file as txt ?	3	Aug 8, 2007
C# and encodings	30	Feb 3, 2009
Out of curiosity: file count and total size from one LINQ query	4	Sep 11, 2009
Help!! Convert file encoding	2	Sep 2, 2008
is unicode support in c# a fake?	5	Aug 22, 2007
How to get file size for an ongoing file?	1	Mar 13, 2005
How to create a .txt file with unicode encoding	1	Mar 27, 2007

size of a file and unicode

Tony Johansson

Patrice

Harlan Messinger

Arne Vajhøj

Tony Johansson

Konrad Neitzel

Arne Vajhøj

Ask a Question

Similar Threads