Fonts and character encodings

klem s · Sep 26, 2010

hi

1)
a) Do Fonts know anything about coded character sets ( Unicode, Ascii
…)? In other words, does Font file specify which coded character sets
may use this Font?

b) I assume if Font supports certain coded character set, then any
character encoding (aka code page) for that coded character set can
use this Font?

2)
a) Is Font file also where it is specified to which code point
particular glyph is mapped?

b) Can glyph be mapped to several code points at once? If yes, then
how is correct mapping ( of glyphs to code points ) chosen when some
application tries to map this Font to particular coded character set?

c) Particular Font is basically file with instructions how to draw its
glyphs? So each font has its own set of glyphs ( ie its own set of
instructions how to draw glyphs )?

Thank you

klem s · Sep 27, 2010

Yes. Fonts (font families, actually) associate each glyph with specific
characters in specific character sets.

Are you implying that all fonts in a font family share same glyphs?

That depends on what you mean by "coded character set". In general, I
would consider a coded character set essentially equivalent to a
character encoding or code page. They all mean the same thing: a way to
make a one-to-one association between some numeric value and a
named/described character.

I go by the following definitions:

CHARACTER SET:

“Character set is a set of characters. It is not a way of representing
characters on a page or in a file. Rather, it defines the range of
characters that can be so represented. Unicode is technically not so
much a character set as a coded character set”

CODE POINTS:
“A character set defines a set of characters, but does not specify any
way to refer to them. You have to assign them numbers so that they can
be referred to, and those numbers are termed code points.”

CODED CHARACTER SET:
“Coded character set is A character set which has code points matched
with characters so that the characters can be referred to”

CHARACTER ENCODING / CODE PAGE
“Character encoding is the system by which the characters in a set are
represented in binary form in a file. The Unicode set may be
represented using three encodings: UTF-8, UTF-16 and UTF-32.”

So going by these definitions coded character set is not the same as
character encoding / code page

Using that definition, the answer to your second question is yes, but
only trivially so, since "any character encoding for that coded
character set" is exactly just the one character encoding that's the
same as that coded character set.

If I understand you correctly, you’re implying that if for example
font F supports coded character set C then F implicitly supports all
character encodings for coded character set C ( thus we could say F
doesn’t EXPLICITLY support character encodings for C )?

Real world example - if font F supports Unicode, then it can
implicitly be used with any character encoding for Unicode ( Utf-8,
Utf-16, Utf-32 etc ). Thus, font F doesn’t really distinguish between
different encodingS for C?!

But according to the link you’ve provided http://www.microsoft.com/typography/unicode/cscp.htm:
“An application can determine which codepages a specific font
supports”

This excerpt COULD also be understood as saying that even if font F
supports coded character set C, it doesn’t mean it supports all
character encodings for C. In other words, even if font F supports
Unicode, it doesn’t mean it supports all character encodings for
Unicode?

The main point is that you _can't_ draw strings using other encodings.
No matter what the encoding, it has to be converted to a .NET
System.String first. You can see the consequences of this by comparing
the information that the Windows utility Character Map shows with the
output you see in a .NET program.

I assume you mean comparing visual appearances of glyphs ( for font
F ) in Windows utility Character Map with visual appearances of text
drawn with F in Net apps?

Some of the TT/OT fonts (e.g. Marlett and Bookshelf Symbol 7, to name a
couple that I find on my own Windows 7 installation) don't support
Unicode though. And drawing them from a .NET program only reproduces
their glyphs one-for-one as Character Map shows them for the character
values below 128.

Which makes sense, if you think about it. To support drawing using a
font that doesn't support Unicode, the Unicode character values have to
be converted to a (the) character encoding that the font supports. So,
for example, code point 128 in Unicode is not going to produce the same
output as code point 128 using the character encoding that the font is
actually supporting. Drawing using Unicode code point 128 is going to
result in whatever code point that 128 mapped to in the font's supported
character encoding.

So even if font F supports just coded character set C, we could still
use F with character encodings for coded character set D?! Font F
won’t “complain”, but if D has characters mapped to different code
points than C, then glyphs will be mapped to wrong characters when
used with character encodings for D?

If you know the character encoding that the font supports (*), then you
could select the appropriate Unicode character value to get the glyph
you want from a font that doesn't support Unicode, when drawing using a
Unicode-only API such as .NETs.

“If you know the character encoding that the font supports…”

* You mean if I know coded character set that the font supports?

* Anyhow, are you saying that in Net we have the ability to map
particular Unicode point code ( say Unicode character value 100 ) to
particular font code point? Thus, if font F has a glyph representing
character A at code point 100, while Unicode has character A at code
point 65, then I could somehow map Unicode’s code point 65 to font’s
code point 100?

Some of the TT/OT fonts (e.g. Marlett and Bookshelf Symbol 7, to name a
couple that I find on my own Windows 7 installation) don't support
Unicode though. And drawing them from a .NET program only reproduces
their glyphs one-for-one as Character Map shows them for the character
values below 128.

• So when using font F with coded character set D, Net somehow knows
that only for font values below 128 do glyphs map correctly to
characters in D? And thus when Net detects that we’re matching F with
incompatible coded character set D, it restricts usage of font F only
to glyphs with code points up to 128?

• But you did imply previously that we still have the option to
somehow ( I assume we do that by preventing Net from restricting usage
of font F only to code points below 128 ) manually map glyphs to
characters?

I don't even understand this part of your post. It seems circular in
nature.

My assumption, that at least vector fonts don’t contain actual visual
representation of a characters ( thus they don’t contain actual image
file for each code point ), but instead they contain blueprints on how
to draw glyphs when they are requested by some app, comes from the way
I interpret the following definitions:

"Outline fonts (also called vector fonts) use Bézier curves, drawing
instructions and mathematical formulae to describe each glyph, which
make the character outlines scalable to any size."

"TrueType is an outline font standard. TrueType glyphs are described
with quadratic Bezier curves. It is currently very popular and
implementations exist for all major operating systems."

I appreciate your help

klem s · Sep 28, 2010

No. I just mean that a font file corresponds to a font family in .NET.

I know fonts from same font family share design, but are you saying
that in Net fonts aren't represented as fonts, but only as font
family? Thus, Net doesn't distinguish between fonts, but only between
font families? If so, why?

Alternatively, I suppose one might consider Unicode to be _the_ coded
character set, and treat all character encodings as mappings from binary
data to that coded character set (where the mappings from UTF8, UTF16,
and UTF32 being more trivial than from, for example, various code pages
that might be in use on Windows).

That would be my interpretation, yes. But even more importantly, if
font F supports Unicode, then it can implicitly be used with any
character encoding that can be translated to Unicode. Not just UTF8,
UTF16, and UTF32, but pretty much any code page you want. The OS API
that actually uses the font can deal with translation from the actual
character encoding to the coded character set used by the font.

So conceptually speaking, OS api will take a code page CP ( where for
example code point 100 represents character A ) and map it to font F
( which only supports Unicode ) in such way, that when app using CP
tries to display character A, OS api will make sure that F returns
glyph with code point 65 ( and not code point 100 ) ?!

Yes. The characters for characters 128 and above in Character Map are
mostly remapped to other UTF16 (i.e. .NET string) characters, not even
in the 128-255 range.

What do you mean by “ not even in the 128-255 range”?

I need to ask this :
* Does Net use its own fonts or those installed by the underlying OS
or both?

* Are in Net fonts represented by class ( there is a Font class in
Net, but I think it only represents html fonts, but not fonts in
general )?

[...]
So even if font F supports just coded character set C, we could still
use F with character encodings for coded character set D?! Font F
won’t “complain”, but if D has characters mapped to different code
points than C, then glyphs will be mapped to wrong characters when
used with character encodings for D?

Click to expand...

I guess that's one way to put it. And in the case of .NET, the only
character encoding you get to use is UTF16. So if your font only
supports some other coded character set than Unicode, you have to work
backwards from the coded character set it does support to find the UTF16
character that maps to the character in the font-supported coded
character set you want.

* Anyhow, are you saying that in Net we have the ability to map
particular Unicode point code ( say Unicode character value 100 ) to
particular font code point? Thus, if font F has a glyph representing
character A at code point 100, while Unicode has character A at code
point 65, then I could somehow map Unicode’s code point 65 to font’s
code point 100?

Click to expand...

What you _can_ do is use the existing character encoding support to
determine the Unicode character that corresponds to a character of
interest in some other encoding. See the System.Text.Encoding class.

So, if you know the character in the specific coded character set that
the font is supporting, and you have a character encoding for which that
character is known

What do you mean by “for which that character is known”

(and as I mentioned, I believe in pretty much
everything except Unicode, these are the same things), you can pass that
character to the appropriate Encoding class to get back a proper
System.String corresponding to that character.

at the end you wind up with a System.String instance that
you can then pass to the .NET API. After that, it's out of your hands
and you have to rely on .NET to handle the mapping back from the
System.String instance to the appropriate encoding the font supports.

I assume you’re describing a situation when I want Net app to first
read text from file some_File ( which uses encoding some_Encoding for
coded character set some_CCS ) and then display this text using font
some_Font ( some_Font only supports some_CSS ) ? Thus, the code would
be the following:

FileStream fStream = new FileStream ( "some_File" , FileMode.Open,
FileAccess.Read );
StreamReader readStream = new StreamReader
( fStream,
Encoding.GetEncoding ( "some_Encoding" ) );
string msg = null;
while ( (msg += readStream.ReadLine() ) != null ) ;

* But then what? What method some_Method needs to be called, which in
turn would somehow map msg’s Utf-16 code points to font some_Font?

Or more likely, you'll just do all of your text in the character
encoding that goes with the font you are trying to use, and then use the
appropriate Encoding class to convert to System.String all at once.
This is more awkward, because there's no .NET support for manipulating
strings not in UTF16, other than as raw byte arrays. But if you're just
dealing with pre-generated text, this might not be so bad.

You mean I could instead do the following or are implying something
else:

FileStream fStream = new FileStream ( "some_File",
FileMode.Open, FileAccess.Read );
byte crt;
List<byte> byte_Msg_List = new List<byte>();
while ( ( crt = ( byte ) fStream.ReadByte() ) == 1 )
byte_Msg_List.Add ( crt );

byte[] byteMsg = new byte [ byte_Msg_List.Count ];
byte_Msg_List.CopyTo ( byteMsg );

Once byteMsg is populated, I could manipulate it and when I’m done,
convert it into string ( via Encoding.GetEncoding("some_Encoding
").GetString(byteMsg); ) and again pass this string as parameter into
some_Method?

BTW - If Net also allowed us ( I realize it doesn't allow it ) to map
to a font other encodings besides Utf-16, then I assume we could just
pass byteMsg into some_Method and inform it that data in byteMsg is
already in some_Encoding format?

No, I don't think it's restricting anything. It's just taking the input
System.String, which is UTF16, and translating that to the character
encoding supported by the font, so that it can draw the characters with
that font. A few values above 128 do in fact map to the same
characters, and so ...

Assuming Marlett is character encoding for coded character set M, and
assuming M contains 255 characters, then aren’t all characters
represented by M also represented by Unicode, only at different code
points? If so, then why wouldn’t some_Method be able to map all of M’s
glyphs with correct Utf-16 characters?

Afterall, isn’t ( conceptually speaking ) the job of some_Method to
take a code page CP ( where for example code point 100 represents
character A ) and map it to font F ( which only supports Unicode ) in
such way, that when Net app using CP tries to display character A,
some_Method will make sure that F returns glyph with code point 65
( and not code point 100 ) ?!

Most of glyphs below 128 are just empty squares. Why?

thank you

Jeff Johnson · Sep 28, 2010

[Argh, why do people used quoted-printable?!]

I know fonts from same font family share design, but are you saying
that in Net fonts aren't represented as fonts, but only as font
family? Thus, Net doesn't distinguish between fonts, but only between
font families? If so, why?

What is your definition of "font"? Technically, a font is a SPECIFIC
INSTANCE of a font family (e.g., Times New Roman) and any specific styles,
such as bold, italic, etc., and font size. In other words, Times New Roman
regular 12pt and Times New Roman regular 10pt are two different fonts even
though they differ in nothing but size.

In ye olden days, computers used to need one font file per size (as well as
per unique combination of style) because the fonts were pre-rendered. Adobe
and TrueType fonts came about to allow computers to scale fonts to a given
size, requiring then that only style combinations (regular, italic, bold,
bold italic) needed to have separate files, and even then it was possible to
"fake it."

So now Windows has files like Arial, Arial Italic, Arial Bold, and Arial
Bold Italic. Anything you need to do with Arial you can do from these 4 font
files.

Hmmm, perhaps I understood your question wrong. Let me answer it a different
way. I hope one of these is the answer you're looking for. When you create a
Font object in .NET, you specify all the needed characteristics of the font.
So, going back to the first example, if you need Times New Roman regular
12pt and Times New Roman regular 10pt then you have to create two Font
objects. Were you thinking Pete was telling you that this was not the case?
It is not and he was not.

Jeff Johnson · Sep 28, 2010

However, note that the Marlette font is a special-purpose font, for
drawing UI (and in particular, it was used in very early versions of
Windows).

Really? I'd never heard of it until Windows 95. Did it vanish from 3.x and
make a reappearance?

Jeff Johnson · Sep 28, 2010

I consider Windows 95 a "very early version of Windows".

I don't really know when it first showed up. But as far as I'm concerned,
any of the pre-NT versions of Windows are "very early". That was, after
all, 15 years ago, when Windows was less than half the age it is today.

To me, "very early" is before its popularity soared, and that was Windows
3.0. Depends on how much of a Windows old fart you are, I guess. I started
seriously with 3.1, but I had fleeting experience with 3.0. To me, Windows
95 was the beginning of "the next generation," so I see it as new. I still
think its basic interface is good enough, and that Vista and Windows 7's UIs
are different for the sake of being different, as opposed to being
significantly better.

<Insert obligatory "get the hell off my lawn" quote here>

klem s · Sep 29, 2010

.NET does distinguish between fonts, but as Jeff says, a font is a
specific instance of a font family, with style, size, etc. already chosen..

All of your questions seem to be about the font file itself, which in
.NET is represented as a font family, not an individual font.

In web forms all font families ( and thus fonts ) are represented via
single class FontInfo. But I thought that perhaps Net also has
classes representing specific font families ( class Arial, class
Verdana etc )

.NET does distinguish between different fonts, but not in a way that is
relevant to the questions you've asked, hence the distinction I've made
here.

How does it distinguish between different fonts ( this is more of a
“what’s happening under the hood” type of question )?

As I wrote, after you convert to a System.String (which is what in your
code example the StreamReader class does for you), it's out of your
hands. There is no "some_Method" that needs to be called that maps
UTF16 to the font supporting "some_CCS".

Instead, you just use the normal text-display API and let .NET worry
about it for you.

Assuming you provide .NET with valid characters — that is, characters
that _can_ be represented in the coded character set supported by the
font file — then .NET should just handle that for you. ....

As I mentioned, there is no "some_Method". You just use the normal text
output API, once your text data is in the .NET string format.

And yes, assuming you are using the characters that the font supports,
.NET should transparently handle the translation from Unicode to
whatever code page the font is supporting.

I’m learning Asp.Net, so I’m primarily interested in how things work
there. Anyhow, with web forms you can specify which font a control or
page should use via Font property ( which returns FileInfo object ) or
via css. But these two options won’t help if I want web form to
render page using some Font file of which OS knows nothing about. Thus
I thought that perhaps Net allows “more direct” ( via some_Method or
something similar ) way of specifying which Font file to use.

[Argh, why do people used quoted-printable?!]

Not sure if this is directed at me or even what you’re trying to
say : ).

What is your definition of "font"? Technically, a font is a SPECIFIC
INSTANCE of a font family (e.g., Times New Roman) and any specific styles,
such as bold, italic, etc., and font size. In other words, Times New Roman
regular 12pt and Times New Roman regular 10pt are two different fonts even
though they differ in nothing but size.

In ye olden days, computers used to need one font file per size (as well as
per unique combination of style) because the fonts were pre-rendered. Adobe
and TrueType fonts came about to allow computers to scale fonts to a given
size, requiring then that only style combinations (regular, italic, bold,
bold italic) needed to have separate files, and even then it was possibleto
"fake it."

And now one font file can represents several different fonts ( aka
several specific instances of a font family )

thank you

klem s · Sep 29, 2010

.NET does distinguish between fonts, but as Jeff says, a font is a
specific instance of a font family, with style, size, etc. already chosen..

All of your questions seem to be about the font file itself, which in
.NET is represented as a font family, not an individual font.

In web forms all font families ( and thus fonts ) are represented via
single class FontInfo. But I thought that perhaps Net also has
classes representing specific font families ( class Arial, class
Verdana etc )?

.NET does distinguish between different fonts, but not in a way that is
relevant to the questions you've asked, hence the distinction I've made
here.

How does it distinguish between different fonts ( this is more of a
“what’s happening under the hood” type of question )?

As I wrote, after you convert to a System.String (which is what in your
code example the StreamReader class does for you), it's out of your
hands. There is no "some_Method" that needs to be called that maps
UTF16 to the font supporting "some_CCS".

Instead, you just use the normal text-display API and let .NET worry
about it for you.

Assuming you provide .NET with valid characters — that is, characters
that _can_ be represented in the coded character set supported by the
font file — then .NET should just handle that for you. ....

As I mentioned, there is no "some_Method". You just use the normal text
output API, once your text data is in the .NET string format.

And yes, assuming you are using the characters that the font supports,
.NET should transparently handle the translation from Unicode to
whatever code page the font is supporting.

I’m learning Asp.Net, so I’m primarily interested in how things work
there. Anyhow, with web forms you can specify with which font a text
( displayed within a control or a web page ) should be render with via
Font property ( which returns FileInfo object ) or via css. But these
two options won’t help if I want web form to render page using some
Font file of which OS knows nothing about. Thus I thought that perhaps
Net allows “more direct” ( via some_Method or something similar ) way
of specifying which Font file to use.

[Argh, why do people used quoted-printable?!]

Not sure if this is directed at me or even what you’re trying to
say : ).

What is your definition of "font"? Technically, a font is a SPECIFIC
INSTANCE of a font family (e.g., Times New Roman) and any specific styles,
such as bold, italic, etc., and font size. In other words, Times New Roman
regular 12pt and Times New Roman regular 10pt are two different fonts even
though they differ in nothing but size.

In ye olden days, computers used to need one font file per size (as well as
per unique combination of style) because the fonts were pre-rendered. Adobe
and TrueType fonts came about to allow computers to scale fonts to a given
size, requiring then that only style combinations (regular, italic, bold,
bold italic) needed to have separate files, and even then it was possibleto
"fake it."

And now one font file can represents several different fonts ( aka
several specific instances of a font family )

thank you

klem s · Sep 30, 2010

“The following table lists the supported encodings and their
associated code pages. An asterisk in the last column indicates that
the code page is natively supported by the .NET Framework, regardless
of the underlying platform.”

“The GetEncoding method relies on the underlying platform to support
most code pages. However, the .NET Framework natively supports some
encodings.”

I’m assuming the term “natively” refers to code pages installed
by .Net environment and of which OS knows nothing about?!

A System.Drawing.Font instance, for example, will be a specific size and
style of a given font family.

Then it’s similar to FontInfo class, where specific font is also
specified by setting Bold, Italic and Size properties

BTW – FontInfo.Name sets the primary font name. I assume primary font
name is basically a font family name?

Then you should post your questions in the ASP.NET newsgroup.

I just wanted to get the broader picture of how encodings and fonts
work together

thanx Pete

klem s · Oct 2, 2010

much thanx for all your help

C# and encodings	30	Feb 3, 2009
Determining unsopported characters in a font	3	Jul 27, 2009
Graphics.DrawString character set problem	2	Dec 7, 2009
The font "Arabic Transparent" (artro.ttf) is invalid	2	Jun 13, 2009
Some doubts about Font Fallback and C#	2	Jul 17, 2006
Be consistent with Unicode codepoints!	1	Apr 8, 2010
Font Woes - Problems with the Language Bar and Word 2003	1	Nov 10, 2005
Extracting Unicode characters from RTF	1	Mar 4, 2008

Fonts and character encodings

klem s

klem s

klem s

Jeff Johnson

Jeff Johnson

Jeff Johnson

klem s

klem s

klem s

klem s

Ask a Question

Similar Threads