Asc code question

Bill · Feb 23, 2005

In the New Courier font set, an ASCI code 166 is supposed to
be a solid "blob". I determined that by going to Word and
inserting that symbol and then a copy/paste operation of that
character into a table field in Access and displaying the char
via MsgBox Asc(table fieldname). It was my intent to be able
to set similar table fields to Chr(166) where the TextBox
character set property is New Courier and is bound to such
fields. However, I get a vertical bar with a break in its middle.
Upon using a MsgBox to display that field as Asc(fieldname)
I get 166. That is, I'm "told" by the Asc function that the blob
character is 166 but when I try to use Chr(166) I don't get
the blob.

What am I missing?

Bill

Bill · Feb 23, 2005

AscW(fieldname of field containing the blob) returns 9608.
Yet, if I try to set a field using Chr(9608) I get an invalid
argument. Using ChrW(9608) I get the same character
displayed as Chr(166), as reported earlier. Using ChrB(9608)
I get an overflow. Never having delt with unicodes before, I'm
not sure if that is where my problem lies or not. Anyway, any
ideas would be appreciated.

Thanks,
Bill

John Nurick · Feb 24, 2005

Hi Bill,

In the ordinary "Western" Windows character set, 166 is the "broken bar"
character in all normal fonts including Courier New. Character positions
beyond 255 don't exist because this is a one-byte (8-bit) character set.

On machines with "western" regional settings, the presence of characters
beyond 255 (hex 0x0100) indicates a unicode font and unicode characters.
9608 is Unicode U+2588 (i.e. 0x2588), the "full block" character.

Provided you're using a recent version of Access and the font you're
using in Access includes the characters you need, things usually work
just fine. If you're using Access 97 you;re stuck, because it doens't
support Unicode.

Possibly all that's happening is that you're using Asc() where you
should be using AscW(). The former is 8-bit only, and if you pass it a
Unicode character it will return the position of (hopefully) the nearest
equivalent character in the available 8-bit character set, e.g.
?Asc(ChrW(9608))
166

Likewise, Chr(9608) and ChrB(9608) fail because these to are 8-bit
functions; you need to use ChrW().

Bill · Feb 24, 2005

Hi John,

Since I last posted, I had indeed discovered that I was dealing with
a Unicode font. Previously, it wasn't clear how current PC environments
handled "special" characters. In earlier days, such characters were
accessed on the mainframes using what was called codepages wherein
an 8-bit byte was used in accessing the codepage which was loaded
into the printer hardware when such characters were to be printed.

It would give me a complete understanding if you would explain your
notation "hex 0x0100". There has to be something inherent in the first
8 bits of the 16-bit Unicode that triggers recognition that a Unicode font
is present, unless internal string handling code operates differently than
I would imagine.

You've been a great help, thanks.

Bill

John Nurick · Feb 24, 2005

Bill,

The "0x" prefix is one of the common ways of indicating a hexadecimal
number; in Basic and VB/A code the equivalent is &H. So 0x0100 is
hexadecimal 100 or decimal 255.

Code pages were and are used in the Windows world, but Unicode is
gradually superseding them. However, Unicode is not just a 16-bit
character set: it's a universal character set with multiple encodings.
UTF-16 is the commonest: that encodes almost all characters in 2 8-bit
bytes, some rare ones (there are more than 65536 different characters in
the world's scripts) using (I think) 4 8-bit bytes. There are also
UTF-7, using between one and about five 7-bit bytes per character, UTF-8
and UTF-32. And no, there's nothing special about the first bits of
UTF-16 (although there are some byte sequences that Unicode reserves for
signalling at the start of a text file which encoding it uses).

Hi John,

Since I last posted, I had indeed discovered that I was dealing with
a Unicode font. Previously, it wasn't clear how current PC environments
handled "special" characters. In earlier days, such characters were
accessed on the mainframes using what was called codepages wherein
an 8-bit byte was used in accessing the codepage which was loaded
into the printer hardware when such characters were to be printed.

It would give me a complete understanding if you would explain your
notation "hex 0x0100". There has to be something inherent in the first
8 bits of the 16-bit Unicode that triggers recognition that a Unicode font
is present, unless internal string handling code operates differently than
I would imagine.

You've been a great help, thanks.

Bill

Bill · Feb 25, 2005

John,
Not to belabor the subject, but I'm curious as to how the text
processing code recognizes the presence of a Unicode font
character. If there's nothing inherent in the 1st of the two bytes
to signal that the 2nd byte is byte 2 of the ordered pair. Unless of
course the entire text file is converted to Unicode if any one
Unicode font is present. Were that the case, text files would double
in size for the sake of maybe just a few special characters.

Bill

John Nurick said:
Bill,

The "0x" prefix is one of the common ways of indicating a hexadecimal
number; in Basic and VB/A code the equivalent is &H. So 0x0100 is
hexadecimal 100 or decimal 255.

Code pages were and are used in the Windows world, but Unicode is
gradually superseding them. However, Unicode is not just a 16-bit
character set: it's a universal character set with multiple encodings.
UTF-16 is the commonest: that encodes almost all characters in 2 8-bit
bytes, some rare ones (there are more than 65536 different characters in
the world's scripts) using (I think) 4 8-bit bytes. There are also
UTF-7, using between one and about five 7-bit bytes per character, UTF-8
and UTF-32. And no, there's nothing special about the first bits of
UTF-16 (although there are some byte sequences that Unicode reserves for
signalling at the start of a text file which encoding it uses).

John Nurick · Feb 25, 2005

With text files, Unicode-aware editors look at the first bytes of the
file. If these are one of the sequences used to signal that it's a
Unicode file in a particular encoding, well and good. Otherwise, if
every other byte is 0x00 (null) and the computer's regional settings are
for a "roman" script, then the editor assumes that it's UTF-16. There
are probably more heuristics going on. For the gory details of Unicode,
see the official site, I think it's www.unicode.org.

And you're right: a UTF-16 file of "roman" script is double the size of
an ANSI file containing the exact same script. At least some versions of
Word intermingle 2-byte and 1-byte text fairly promiscuously inside of
their binary .doc files, using some sort of tables and pointers to keep
track of what's where. Access 2000 onwards has a Unicode compression
setting on text and memo fields. I think this uses UTF-8 or something
similar internally, but the details aren't exposed.

John,
Not to belabor the subject, but I'm curious as to how the text
processing code recognizes the presence of a Unicode font
character. If there's nothing inherent in the 1st of the two bytes
to signal that the 2nd byte is byte 2 of the ordered pair. Unless of
course the entire text file is converted to Unicode if any one
Unicode font is present. Were that the case, text files would double
in size for the sake of maybe just a few special characters.

Bill

Bill · Feb 25, 2005

Thanks for the expanded view. Sounds like things have become rather
convoluted.
Bill

John Nurick said:
With text files, Unicode-aware editors look at the first bytes of the
file. If these are one of the sequences used to signal that it's a
Unicode file in a particular encoding, well and good. Otherwise, if
every other byte is 0x00 (null) and the computer's regional settings are
for a "roman" script, then the editor assumes that it's UTF-16. There
are probably more heuristics going on. For the gory details of Unicode,
see the official site, I think it's www.unicode.org.

And you're right: a UTF-16 file of "roman" script is double the size of
an ANSI file containing the exact same script. At least some versions of
Word intermingle 2-byte and 1-byte text fairly promiscuously inside of
their binary .doc files, using some sort of tables and pointers to keep
track of what's where. Access 2000 onwards has a Unicode compression
setting on text and memo fields. I think this uses UTF-8 or something
similar internally, but the details aren't exposed.