Euro symbol, ISO-8859-1, and XML

G

Guest

I am having problems reading an XML file which contains Euros represented as character 128.

The xml header is: <?xml version="1.0" encoding="iso-8859-1"?>

and character 128 is meant to be the euro symbol in this character set, and in ANSICodePage 1252 (which is the active code page on my Windows XP machine).

Using The XmlTextReader to read an element value containing a single euro symbol (character 128), produces a string which displays as a square block in VB .Net. The AscW() function displays 128 for the character, while the Asc() function displays 63 (a question mark!).

Is there something I've misunderstood about encoding, or does this seem like a problem in the XmlTextReader?
 
J

Jon Skeet [C# MVP]

John Bown said:
I am having problems reading an XML file which contains Euros represented as character 128.

The xml header is: <?xml version="1.0" encoding="iso-8859-1"?>

and character 128 is meant to be the euro symbol in this character
set, and in ANSICodePage 1252 (which is the active code page on my
Windows XP machine).

It is in Cp1252, but it isn't in ISO-8859-1.

From http://www.stanford.edu/~laurik/fsmbook/faq/utf8.html:

<quote>
There is no Euro symbol in the part of Unicode that corresponds to ISO-
8859-1.
</quote>

And from http://www.cs.tu-berlin.de/~czyborra/charsets/ (the section on
Latin1):

<quote>
The lack of the new C=-resembling Euro currency symbol U+20AC has
opened the discussion of a new Latin0.
</quote>

You may also find http://www.cs.tut.fi/~jkorpela/chars.html#latin1
useful.
Using The XmlTextReader to read an element value containing a single
euro symbol (character 128), produces a string which displays as a
square block in VB .Net. The AscW() function displays 128 for the
character, while the Asc() function displays 63 (a question mark!).

To be honest, I've never been *entirely* sure what Asc and AscW do
*exactly* (partly due to being a C# programmer) - but personally I'd
avoid them. Cast the character to an integer to find out its Unicode
value as a number - that'll show you what the string *really* contains.
Is there something I've misunderstood about encoding, or does this
seem like a problem in the XmlTextReader?

Nope, just a problem that the encoding you're using doesn't contain the
character you want. I'd suggest using UTF-8, personally.
 
G

Guest

Thanks for the very prompt feedback

The basic question then is 'Is the euro symbol in ISO-8859-1, or isn't it?' For every (oldish) article that says it isn't, I managed to find a (newish) article that says it is

One article I found (2003): http://www.bcs.org.uk/siggroup/nalatran/wiggd.htm states

.......This has been achieved for Western European countries (and others) by allocating characters to those values in a byte from 160 to 255 (Values from 128 to 159 were not allocated initially as it was thought they might cause a clash with the corresponding control characters in the basic ASCII character set from 0 to 31. However, 128 has now been allocated to the Euro sign, €). These are encapsulated in the ISO 8859-1 (Latin 1) standard character set in what is sometimes referred to as the upper half of the set as shown below.

The Euro sign, which is a relatively recent introduction, has now also been included officially in ISO 8859-15, an alternative Western European character set, with the value of 164, so there are now two values for the Euro, 128 for ISO 8859-1 and 164 for ISO 8859-15. If your keyboard has a Euro symbol on it what code you get may depend on the character set you are using. To be certain, use ISO 8859-15, otherwise you will need to experiment with ISO 8859-1.....

So what's the real answer. Microsoft articles tend to deny the existence of character 128 in ISO 8859-1. Without forking out good money to actually buy the standard from ISO, is there a more definitive check
 
J

Jon Skeet [C# MVP]

John Bown said:
Thanks for the very prompt feedback.

The basic question then is 'Is the euro symbol in ISO-8859-1, or
isn't it?' For every (oldish) article that says it isn't, I managed
to find a (newish) article that says it is.

I find it hard to believe that ISO-8859-1 itself has changed - after
all, who's going to tell all the systems which are using ISO-8859-1 and
were developed before the change?

It at least seems reasonable to suggest that assuming that the Euro
character *is* in ISO-8859-1 is going to cause problems for some
systems.

Put it this way: if someone suddenly said that US-ASCII was changing,
how many systems do you think would take that change seriously?

Hmm... the fact that it uses the term "Extended ASCII" puts me off it
to start with. On the other hand, the rest seems reasonable, and the
site location itself suggests an air of authority.

Of course, you could always email the author and ask for some evidence
of this (seemingly unlikely, IMO) change.
So what's the real answer. Microsoft articles tend to deny the
existence of character 128 in ISO 8859-1. Without forking out good
money to actually buy the standard from ISO, is there a more
definitive check?

Unfortunately not, as far as I can see - and 61 CHF is quite a lot of
money to spend just to get a standard! To my mind, the idea of an
internet standard which isn't free and open is somewhat bizarre, but
there we go.

Interestingly the ISO store only lists it as a 1998 document, which is
pretty old...

All told it's a strange situation.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top