size of a file and unicode


T

Tony Johansson

Hi!

I have this simple program and write 16 character into the file called
test.txt.
If I check the size of the file by using the Length propery on the FileInfo
class
I can see that myLong variable is 16.
If I use a binary Viewer to look at the file test.txt I see that the number
of bytes is 16 with the following Hex values
31 32 33 34 35 36 37 38 39 31 32 33 34 35 36 37

Now to my question but unicode is two bytes.
I have thought that unicode is used but here it's not used because each
charcter is just a single byte and not 2 bytes that unicode is ?
static void Main(string[] args)
{
StreamWriter writer = new StreamWriter("test.txt");
writer.Write("A");
writer.Close();
FileInfo file = new FileInfo("test.txt");
long mylong = file.Length;
}

//Tony
 
Ad

Advertisements

P

Patrice

Hello,

The problem is that Unicode allows to represent characters but you have
several ways to encode those characters (a bit like a compressed file is
still the same file but doesn't have the same byte content).

The doc tells that by default it uses UTF-8 encoding
(http://msdn.microsoft.com/en-us/library/system.io.streamwriter(VS.80).aspx)
that uses a variable length depending on the encoded character (and in
particular ASCII characters are encoded the same way making this encoding
scheme compatible with ASCII).

You can choose the appropriate constructor from
http://msdn.microsoft.com/en-us/library/system.io.streamwriter.streamwriter(VS.80).aspx
if you need to use another encoding...
 
H

Harlan Messinger

Tony said:
Hi!

I have this simple program and write 16 character into the file called
test.txt.
If I check the size of the file by using the Length propery on the FileInfo
class
I can see that myLong variable is 16.
If I use a binary Viewer to look at the file test.txt I see that the number
of bytes is 16 with the following Hex values
31 32 33 34 35 36 37 38 39 31 32 33 34 35 36 37

Now to my question but unicode is two bytes.
I have thought that unicode is used but here it's not used because each
charcter is just a single byte and not 2 bytes that unicode is ?
static void Main(string[] args)
{
StreamWriter writer = new StreamWriter("test.txt");
writer.Write("A");
writer.Close();
FileInfo file = new FileInfo("test.txt");
long mylong = file.Length;
}

That's the correct result if the characters you wrote all come from the
ASCII subset.

UTF-8 isn't Unicode, it's an *encoding*, one of several that have been
designed, for Unicode. It uses a variable number of bytes, from one to
four, for each character. For Unicode character positions up to 127, the
UTF-8 encoding is a single byte, identical to the Unicode character
position. For U+0080 to U+07FF, it uses two bytes; for U+0800 to U+FFFF,
it uses three bytes; and for U+10000 to U+10FFFF it uses four bytes.

See http://en.wikipedia.org/wiki/Utf-8 for details.
 
A

Arne Vajhøj

I have this simple program and write 16 character into the file called
test.txt.
If I check the size of the file by using the Length propery on the FileInfo
class
I can see that myLong variable is 16.
If I use a binary Viewer to look at the file test.txt I see that the number
of bytes is 16 with the following Hex values
31 32 33 34 35 36 37 38 39 31 32 33 34 35 36 37

Now to my question but unicode is two bytes.
I have thought that unicode is used but here it's not used because each
charcter is just a single byte and not 2 bytes that unicode is ?
static void Main(string[] args)
{
StreamWriter writer = new StreamWriter("test.txt");
writer.Write("A");
writer.Close();
FileInfo file = new FileInfo("test.txt");
long mylong = file.Length;
}

Unicode does not have anything to do with bytes. It just specify
the numeric value of characters. From 0 to approx. 100000.

If you store Unicode as UTF-16 then each unicode character
will be stored as two bytes (except for the very high code
points that may end up as a pair of two bytes, but forget about
that for now).

If you store Unicode as as UTF-8 then each Unicode character
will be stores as a variable number of bytes. Characters in
US-ASCII (English alphabet) will be one byte. The extra
characters in the Swedish alphabet will be two bytes. And
you will need more bytes for the Asian languages.

Try compare the output files from:

using System;
using System.IO;
using System.Text;

namespace E
{
public class Program
{
public static void Test(string s, string fnm, Encoding enc)
{
using(StreamWriter sw = new StreamWriter(@"C:\" + fnm,
false, enc))
{
sw.WriteLine(s);
}
}
public static void Main(string[] args)
{
string s = "ABCabc123ÅÄÖåäö";
Test(s, "utf8.txt", Encoding.UTF8);
Test(s, "utf16.txt", Encoding.GetEncoding("UTF-16"));
Test(s, "cp1252.txt", Encoding.GetEncoding(1252));
}
}
}

Arne
 
T

Tony Johansson

Arne Vajhøj said:
I have this simple program and write 16 character into the file called
test.txt.
If I check the size of the file by using the Length propery on the
FileInfo
class
I can see that myLong variable is 16.
If I use a binary Viewer to look at the file test.txt I see that the
number
of bytes is 16 with the following Hex values
31 32 33 34 35 36 37 38 39 31 32 33 34 35 36 37

Now to my question but unicode is two bytes.
I have thought that unicode is used but here it's not used because each
charcter is just a single byte and not 2 bytes that unicode is ?
static void Main(string[] args)
{
StreamWriter writer = new StreamWriter("test.txt");
writer.Write("A");
writer.Close();
FileInfo file = new FileInfo("test.txt");
long mylong = file.Length;
}

Unicode does not have anything to do with bytes. It just specify
the numeric value of characters. From 0 to approx. 100000.

If you store Unicode as UTF-16 then each unicode character
will be stored as two bytes (except for the very high code
points that may end up as a pair of two bytes, but forget about
that for now).

If you store Unicode as as UTF-8 then each Unicode character
will be stores as a variable number of bytes. Characters in
US-ASCII (English alphabet) will be one byte. The extra
characters in the Swedish alphabet will be two bytes. And
you will need more bytes for the Asian languages.

Try compare the output files from:

using System;
using System.IO;
using System.Text;

namespace E
{
public class Program
{
public static void Test(string s, string fnm, Encoding enc)
{
using(StreamWriter sw = new StreamWriter(@"C:\" + fnm, false,
enc))
{
sw.WriteLine(s);
}
}
public static void Main(string[] args)
{
string s = "ABCabc123ÅÄÖåäö";
Test(s, "utf8.txt", Encoding.UTF8);
Test(s, "utf16.txt", Encoding.GetEncoding("UTF-16"));
Test(s, "cp1252.txt", Encoding.GetEncoding(1252));
}
}
}

Arne

I just wonder about what this text that you wrote.
If you store Unicode as as UTF-8 then each Unicode character
will be stores as a variable number of bytes. Characters in
US-ASCII (English alphabet) will be one byte. The extra
characters in the Swedish alphabet will be two bytes. And
you will need more bytes for the Asian languages.

Do I understand you right if I say that you had to use UTF-16 only when you
are dealing with the Asian language and the like.

//Tony
 
K

Konrad Neitzel

Hi Tony!
I just wonder about what this text that you wrote.
If you store Unicode as as UTF-8 then each Unicode character
will be stores as a variable number of bytes. Characters in
US-ASCII (English alphabet) will be one byte. The extra
characters in the Swedish alphabet will be two bytes. And
you will need more bytes for the Asian languages.
Do I understand you right if I say that you had to use UTF-16 only
when you
are dealing with the Asian language and the like.

No. You do not have to use any special encoding.

UTF-8, UTF-16 and UTF-32 support all characters (That is, what I
understood so far), because UTF-8 and UTF-16 can use up to 4 Bytes to
store a character.

So you have to decide, what you need:
- If you need random access to characters, you should take UTF-32. (So
all characters will require 32 Bit = 4 Byte)
- If you want to keep files small, you should choose UTF-8 (In the hope,
that a lot of characters simply use 1 or 2 Bytes only!)

But you can encode the text in all of these formats. (Simpyl read
wikipedia on UTF-8, UTF-16 and UTF-32.)

With kind regards,

Konrad
 
Ad

Advertisements

A

Arne Vajhøj

Arne Vajhøj said:
I have this simple program and write 16 character into the file called
test.txt.
If I check the size of the file by using the Length propery on the
FileInfo
class
I can see that myLong variable is 16.
If I use a binary Viewer to look at the file test.txt I see that the
number
of bytes is 16 with the following Hex values
31 32 33 34 35 36 37 38 39 31 32 33 34 35 36 37

Now to my question but unicode is two bytes.
I have thought that unicode is used but here it's not used because each
charcter is just a single byte and not 2 bytes that unicode is ?
static void Main(string[] args)
{
StreamWriter writer = new StreamWriter("test.txt");
writer.Write("A");
writer.Close();
FileInfo file = new FileInfo("test.txt");
long mylong = file.Length;
}

Unicode does not have anything to do with bytes. It just specify
the numeric value of characters. From 0 to approx. 100000.

If you store Unicode as UTF-16 then each unicode character
will be stored as two bytes (except for the very high code
points that may end up as a pair of two bytes, but forget about
that for now).

If you store Unicode as as UTF-8 then each Unicode character
will be stores as a variable number of bytes. Characters in
US-ASCII (English alphabet) will be one byte. The extra
characters in the Swedish alphabet will be two bytes. And
you will need more bytes for the Asian languages.

Try compare the output files from:

using System;
using System.IO;
using System.Text;

namespace E
{
public class Program
{
public static void Test(string s, string fnm, Encoding enc)
{
using(StreamWriter sw = new StreamWriter(@"C:\" + fnm, false,
enc))
{
sw.WriteLine(s);
}
}
public static void Main(string[] args)
{
string s = "ABCabc123ÅÄÖåäö";
Test(s, "utf8.txt", Encoding.UTF8);
Test(s, "utf16.txt", Encoding.GetEncoding("UTF-16"));
Test(s, "cp1252.txt", Encoding.GetEncoding(1252));
}
}
}

I just wonder about what this text that you wrote.
If you store Unicode as as UTF-8 then each Unicode character
will be stores as a variable number of bytes. Characters in
US-ASCII (English alphabet) will be one byte. The extra
characters in the Swedish alphabet will be two bytes. And
you will need more bytes for the Asian languages.

Do I understand you right if I say that you had to use UTF-16 only when you
are dealing with the Asian language and the like.

UTF-8 and UTF-16 both supports all of Unicode.

They just encode them differently.

Usually UTF-8 is the best for external usage, because it is
more compatible with ISO-8859-x.

But there are a few trillion strings stored in memory in UTF-16 !

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top