I'm using about twice as many bytes of memory as the size of the file

T

Tony Johansson

Hi!

I received this text marked with * below as an answer on a mail for several
days ago when I used method CompressFile_1
for compressing file that is located at the bottom.

*This will only work with text files, and you may even lose some information
after going through the
StreamReader and then the StreamWriter in case the file has some characters
that don't fit into the default encoding that you are using. And besides,
you read the whole file into memory before writing the compressed file. In
fact, since the string is stored as Unicode, you are using about twice as
many bytes of memory as the size of the file.*

My question is the text is saying at the end "In fact, since the string is
stored as Unicode, you are using about twice as
many bytes of memory as the size of the file." but I don't understand this.
If I look in the file using a binary viewer I can see that each character is
occupying one byte which I think this means that I use unicode encoding
UTF-8 which is the default because I didn't specified any encoding. So when
I run my program I check the size of the file by using
FileInfo.Length and the length was 1582 and when I check using these two
statements
string data = sourceFile.ReadToEnd();
int sizeData = data.Length;
I can also see that sizedata is 1582 which is the same value as the size of
the file.
So as a summary saying "In fact, since the string is stored as Unicode, you
are using about twice as
many bytes of memory as the size of the file."
is wrong beacuse I have just checked this by comparing the size of the file
by the size of data that is read from the file.


private static void CompressFile_1(string inFile, string utFile)
{
StreamReader sourceFile = File.OpenText(inFile);
string data = sourceFile.ReadToEnd();

FileStream myFileStream = new FileStream(utFile, FileMode.Create,
FileAccess.Write);
GZipStream compStream = new GZipStream(myFileStream,
CompressionMode.Compress);
StreamWriter streamWriter = new StreamWriter(compStream);

streamWriter.Write(data);
streamWriter.Close();
sourceFile.Close();
}

//Tony
 
P

Patrice

So as a summary saying "In fact, since the string is stored as Unicode,
you
are using about twice as
many bytes of memory as the size of the file."
is wrong beacuse I have just checked this by comparing the size of the
file
by the size of data that is read from the file.

See "utf-8 encoding" itself as a "compression scheme" for unicode such as
jpg, bmp, tiff for images or zip, rar etc... for general files.

This is not because the file is compressed on disk that you are using the
exact same bytes once this content is loaded into memory...

The purpose of an "encoding" is to translate back and forth between a given
representation (a unicode string, a bitmap in your graphics card, a docx
document shown on your screen) and a particular representation (ut-8 or
utf-16 or utf-32 file bmp , png, gif images, zipped files which is what is
behind docx) etc...
 
A

Alberto Poblacion

Tony Johansson said:
[...] If I look in the file using a binary viewer I can see that each
character is
occupying one byte which I think this means that I use unicode encoding
UTF-8 which is the default because I didn't specified any encoding.

UTF-8 will use only one byte for ASCII characters, but may use more than one
byte for other characters. So yes, if you are using only ASCII characters,
the file will contain as many bytes as characters.
So when
I run my program I check the size of the file by using
FileInfo.Length and the length was 1582 and when I check using these two
statements
string data = sourceFile.ReadToEnd();
int sizeData = data.Length;
I can also see that sizedata is 1582 which is the same value as the size
of
the file.

The Length property of a String returns the number of *characters*, not
the number of *bytes*. Those 1582 characters of the string are stored in
memory using 3164 bytes. This is because the Strings in .Net use 16 bits to
store every character.
 
P

Peter Duniho

Patrice said:
See "utf-8 encoding" itself as a "compression scheme" for unicode such as
jpg, bmp, tiff for images or zip, rar etc... for general files.

This is not because the file is compressed on disk that you are using the
exact same bytes once this content is loaded into memory...

Actually, it has the same characteristic as other compression schemes:
the bytes on the disk are not the same as in memory.

Of course, the bytes on disk need to reside in memory temporarily. This
is true for JPEG and UTF-8 alike. But just as JPEG results in the
compressed bytes being expanded into a full bitmap in memory, so too the
UTF-8 bytes get translated into the .NET native format of UTF-16, which
takes more room.

How much more room depends on the level of compression achieved, of
course. But for a lot of examples of text, it easily could be a
doubling or nearly so of the size on disk, once the uncompressed data is
stored in memory.

I think where Tony may be going wrong is expecting the _temporary_ copy
of bytes in memory to take more room than the count of bytes on disk,
which is clearly a mistake. If he were to look at the memory taken by
the actual System.String instance that results from reading UTF-8 text
from disk, he would in fact find an increase in size.

Pete
 
T

Tony Johansson

Peter Duniho said:
Actually, it has the same characteristic as other compression schemes: the
bytes on the disk are not the same as in memory.

Of course, the bytes on disk need to reside in memory temporarily. This
is true for JPEG and UTF-8 alike. But just as JPEG results in the
compressed bytes being expanded into a full bitmap in memory, so too the
UTF-8 bytes get translated into the .NET native format of UTF-16, which
takes more room.

How much more room depends on the level of compression achieved, of
course. But for a lot of examples of text, it easily could be a doubling
or nearly so of the size on disk, once the uncompressed data is stored in
memory.

I think where Tony may be going wrong is expecting the _temporary_ copy of
bytes in memory to take more room than the count of bytes on disk, which
is clearly a mistake. If he were to look at the memory taken by the
actual System.String instance that results from reading UTF-8 text from
disk, he would in fact find an increase in size.

Pete

I forgot that a string takes up two bytes not one.

//Tony
 
R

Random

    The Length property of a String returns the number of *characters*, not
the number of *bytes*. Those 1582 characters of the string are stored in
memory using 3164 bytes. This is because the Strings in .Net use 16 bits to
store every character.

And just to be clear, the .Length property returns the number of "char
objects" in the string. In other words, it returns the length of the
internal UTF-16 buffer. This is almost always the number of
characters, but to be perfectly correct, you'd need to use the
StringInfo helper class to count the number of characters.
 
P

Peter Duniho

Random said:
And just to be clear, the .Length property returns the number of "char
objects" in the string. In other words, it returns the length of the
internal UTF-16 buffer. This is almost always the number of
characters, but to be perfectly correct, you'd need to use the
StringInfo helper class to count the number of characters.

I'm not entirely sure I get what you mean.

It's true that with UTF-16, it's possible to have character pairs that
represent a single typographical character. But given that the question
here is about the memory cost of the string, and given that the Length
property returns the number of UTF-16 elements in the string, regardless
of (for example) surrogate pairs (i.e. the surrogate pairs aren't
considered as a single typographical character for the purpose of the
Length property…"grapheme" in the StringInfo lexicon), it seems to me
that the Length property, multiplied by two, is exactly the information
being asked for here.

Why would you need the StringInfo helper class to do a character count
calculation related to the number of bytes in memory used by the string?
I would think it would just complicate matters.

Pete
 
A

Arne Vajhøj

I received this text marked with * below as an answer on a mail for several
days ago when I used method CompressFile_1
for compressing file that is located at the bottom.

*This will only work with text files, and you may even lose some information
after going through the
StreamReader and then the StreamWriter in case the file has some characters
that don't fit into the default encoding that you are using. And besides,
you read the whole file into memory before writing the compressed file. In
fact, since the string is stored as Unicode, you are using about twice as
many bytes of memory as the size of the file.*

My question is the text is saying at the end "In fact, since the string is
stored as Unicode, you are using about twice as
many bytes of memory as the size of the file." but I don't understand this.
If I look in the file using a binary viewer I can see that each character is
occupying one byte which I think this means that I use unicode encoding
UTF-8 which is the default because I didn't specified any encoding. So when
I run my program I check the size of the file by using
FileInfo.Length and the length was 1582 and when I check using these two
statements
string data = sourceFile.ReadToEnd();
int sizeData = data.Length;
I can also see that sizedata is 1582 which is the same value as the size of
the file.
So as a summary saying "In fact, since the string is stored as Unicode, you
are using about twice as
many bytes of memory as the size of the file."
is wrong beacuse I have just checked this by comparing the size of the file
by the size of data that is read from the file.


private static void CompressFile_1(string inFile, string utFile)
{
StreamReader sourceFile = File.OpenText(inFile);
string data = sourceFile.ReadToEnd();

FileStream myFileStream = new FileStream(utFile, FileMode.Create,
FileAccess.Write);
GZipStream compStream = new GZipStream(myFileStream,
CompressionMode.Compress);
StreamWriter streamWriter = new StreamWriter(compStream);

streamWriter.Write(data);
streamWriter.Close();
sourceFile.Close();
}

size on disk = N bytes
size byte array in memory = N bytes
size string in memory if only English alphabet used = N chars = 2*N bytes

Arne
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top