C# and encodings

P

Pavel Minaev

Will try it out. But how would/should app behave if it encounters an
unknown character? I thought question marks were used for that?!

What do you mean by "unknown character"? Do you mean a codepoint that
doesn't represent any valid character in Unicode, or do you mean an
invalid UTF-8 sequence? For the former, the sensible way to do is just
try to get the glyph for that codepoint from the selected font as
usual - if the font cannot render it, then it will end up being a
"question mark" anyway. For the latter, it depends on what you're
doing; for example, whenever Notepad encounters an invalid UTF-8
sequence, it seems to render the individual bytes constituting it as
blank squares.
 
G

Göran Andersson

I tried it. I wrote some characters ( via Console app ) into a .txt
file and then opened the file via Notepad. Everything up to 380 or so
was displayed correctly, but from that point only question mark ( or
whatever it uses to signal unknown character … can’t really remember )
was displayed. I admit I only checked about 20 or so characters ( none
of which was displayed ), but based on those I assume it can't display
anything above 385+.

Did you pick character codes that actually are valid code points? The
Unicode codespace contains more than 1.1 million possible code points,
but only about 100.000 of them are used...
 
M

Mihai N.

Did you pick character codes that actually are valid code points? The
Unicode codespace contains more than 1.1 million possible code points,
but only about 100.000 of them are used...

Unalocated characters will show up as squares (null glyphs), not question
marks.
 
B

beginwithl

hi


Probably this is the problem, right there.
Was the file encoded as UTF-8 or UTF-16?

I tried both encodings, and when using UTF-8, I also called GetPreamble
() to specify BOM at the beginning of the file (0xEF 0xBB 0xBF).
Anyways, notepad still displayed empty square brackets



Did you pick character codes that actually are valid code points? The
Unicode codespace contains more than 1.1 million possible code points,
but only about 100.000 of them are used...

I think I did, since I checked at the following URL what character
should particular code point represent:

http://www.columbia.edu/kermit/utf8-t1.html


What do you mean by "unknown character"? Do you mean a codepoint that
doesn't represent any valid character in Unicode, or do you mean an
invalid UTF-8 sequence? For the former, the sensible way to do is just
try to get the glyph for that codepoint from the selected font as
usual - if the font cannot render it, then it will end up being a
"question mark" anyway. For the latter, it depends on what you're
doing; for example, whenever Notepad encounters an invalid UTF-8
sequence, it seems to render the individual bytes constituting it as
blank squares.

Notepad displayed blank squares for characters it couldn’t render,
thus I assume problem is not in having fonts that which don't have
glyphs for certain characters

BTW – I checked the file with Hex editor and everything was OK, so
the problem is with notepad not the file. Here is the code:

void Main()
{
FileStream fs = new FileStream(@"D:\test.txt",
FileMode.Create);
StreamWriter sw = new StreamWriter(fs, Encoding.UTF8);


for (int i = 383; i < 450; i++) // just empty squares
{
sw.Write((char)i);
}
sw.Close();
}
A question mark in Windows doesn't mean "unknown character". It means
a "character for which this fon't doesn't have a glyph". This has
nothing to do with codepages, and everything with font that you're
using. Try changing Notepad font to something like "Arial Unicode MS",
and see what it does then.

I checked what fonts Notepad supports and while it has several Arial
types, none contains the word Unicode



thank you
 
P

Pavel Minaev

I checked what fonts Notepad supports and while it has several Arial
types, none contains the word Unicode

Notepad "supports" all fonts that are installed in Windows. Arial
Unicode MS is a font that gets installed with Microsoft Office, and
its distinctive in that it provides glyphs for _all_ valid Unicode
codepoints in range 0x0000 - 0xFFFF. You can use any other font that
provides glyphs for most characters to test that Notepad does indeed
handle Unicode correctly (for example, you could try the Code2000
shareware font for that purpose).
 
B

beginwithl

I checked what fonts Notepad supports and while it has several Arial
Notepad "supports" all fonts that are installed in Windows. Arial
Unicode MS is a font that gets installed with Microsoft Office, and
its distinctive in that it provides glyphs for _all_ valid Unicode
codepoints in range 0x0000 - 0xFFFF. You can use any other font that
provides glyphs for most characters to test that Notepad does indeed
handle Unicode correctly (for example, you could try the Code2000
shareware font for that purpose).

Truth be told, I only know of the following way to change Notepad’s
font:

Format menu --> Font

In any case, as it was pointed out earlier by you guys, notepad
displays question mark when particular font doesn’t have a glyph for
particular character, and empty square for unallocated character. If
that is the case, then problem is not in using the wrong font, but…?!


cheers
 
M

Mihai N.

In any case, as it was pointed out earlier by you guys, notepad
displays question mark when particular font doesn’t have a glyph for
particular character, and empty square for unallocated character.

Notepad displays square for missing glyph *or* unallocated character.
Question marks mean damaged encoding.
 
B

beginwithl

Alrighty, I will try to figure out how to enable Arial Unicode MS font
on my notepad

thank you for your help
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top