Unicode Optimization

Tamir Khason · Nov 21, 2004

I have a program will really big embedded text resources.
Because of internationalization I have to save the embedded text in UTF-8,
but it more then triple bigger then the original file.
The last problem is the compiled file size, the REALLY problem is the memory
amount used by the program, because of the embedded unicode file. It loads
the values of the file into some hashes.
With ASCII the program takes about 17k of RAM in runtime
With UTF-8 the program takes at least 70k (!!!) of RAM
1) How (if it is) possible to optimzie the embedded in programm?
2) How it possible at least to shrank the memory usage of the program?

TNX

Jon Skeet [C# MVP] · Nov 21, 2004

Tamir Khason said:
I have a program will really big embedded text resources.
Because of internationalization I have to save the embedded text in UTF-8,
but it more then triple bigger then the original file.
The last problem is the compiled file size, the REALLY problem is the memory
amount used by the program, because of the embedded unicode file. It loads
the values of the file into some hashes.
With ASCII the program takes about 17k of RAM in runtime
With UTF-8 the program takes at least 70k (!!!) of RAM
1) How (if it is) possible to optimzie the embedded in programm?
2) How it possible at least to shrank the memory usage of the program?

70K is hardly huge. However, if you *can* store your file correctly as
ASCII, then the same text in UTF-8 should be exactly the same file, as
every ASCII character has the same encoding in UTF-8.

It's not at all clear exactly what's going on here. Could you post a
short but complete program which demonstrates the problem?

See http://www.pobox.com/~skeet/csharp/complete.html for details of
what I mean by that.

My guess is that something else is going on beyond what you're aware
of.

Tamir Khason · Nov 21, 2004

Jon ,Tnx for reply
Following the sample of such files:
BOF file1---------------------
[Title]
Some Title
[Value]
Whatever
EOF file1---------------------

BOF file2----------------------
[Title]
????? ?????
[Value]
?? ??? ????
EOF file2----------------------

BOF file3---------------------
[Title]
?????????
[Value]
? ????? ???????
EOF file3---------------------

Follow the sample code reading
with those files:
if(_fileEmbedded==null)

{

fs = new FileStream(filePath, FileMode.Open, FileAccess.Read,
FileShare.Read);

sr = new StreamReader(fs, Encoding.UTF8);//Today - Have to change encoding

}

else

{

sr = new StreamReader(_fileEmbedded,Encoding.UTF8);//Today - Have to change
encoding

}

Only UTF8 (while the file save as UTF 8) return correct results

While encoding the files and readers with any other encoding - this does not
works

Jon Skeet [C# MVP] · Nov 21, 2004

Tamir Khason said:
Following the sample of such files:

<snip>

That still doesn't describe the situation adequately. Yes, if you've
saved the file as UTF-8 you obviously need to use Encoding.UTF8 to read
it. You haven't really explained why that's a problem though, or how
you've determined that it *is* a problem.

Tamir Khason · Nov 21, 2004

The problem is following
Current situation: I'm using Unicode (UTF-8) text file as embedded resource
and use it in my application. This works, but the problem is the amount of
memory it need (because of uft-8 huge file)
After some research I tried to convert this file to ANSI encoding and the
file become x3 smaller, as well as RAM needed, BUT the program works BAD due
the encoding issue (we spoke about it in previouse thread about all unicode
strings in .NET, remember?)
I want to do any of those:
1) Optimize unicode
2) Convert it proper into ANSI and use it as ANSI from C#

Please advice

Jon Skeet [C# MVP] · Nov 21, 2004

Tamir Khason said:
The problem is following
Current situation: I'm using Unicode (UTF-8) text file as embedded resource
and use it in my application. This works, but the problem is the amount of
memory it need (because of uft-8 huge file)
After some research I tried to convert this file to ANSI encoding and the
file become x3 smaller

That's very strange in itself. If it's 3x smaller, that suggests you're
losing a lot of data. Which ANSI encoding are you using, and what kind
of proportion of the file is actually in just plain ASCII. Could you
mail me both the UTF-8 and the ANSI files?

as well as RAM needed

It should make no difference to the amount of RAM needed. By the time
you've loaded the strings into memory, they'll be in Unicode anyway.

BUT the program works BAD due
the encoding issue (we spoke about it in previouse thread about all unicode
strings in .NET, remember?)
I want to do any of those:
1) Optimize unicode
2) Convert it proper into ANSI and use it as ANSI from C#

I suspect you'll find that your ANSI file actually doesn't have nearly
as much real data in as the UTF-8 file, which is why it's using less
memory.

Unicode Optimization

Tamir Khason

Jon Skeet [C# MVP]

Tamir Khason

Jon Skeet [C# MVP]

Tamir Khason

Jon Skeet [C# MVP]