Checking character - problem in non-English languages?

Jon · Mar 7, 2008

I'm checking a character in a string for whitespace (spacebar or tab) or comment character (I use #
for this, signifying that anything following it is a comment so should be ignored). My C# code is:

string str=????
int ch=???? //character position in string
if(str[ch]==' ' || str[ch]=='\t' || str[ch]=='#'){
//delimiter found
????
}

It's what I've been using for many years (mainly in C - I've recently converted it to C#). It's
occurred to me that there might be problems with the above for non-English languages.

For instance, I know that in MS Word there are different types of spacebar (eg non-breaking space)
and also different lengths of spacebar (eg em-space, en-space). I wondered if these also exist in
Unicode, and if I should be checking for them. The same goes for tab and #.

If there are problems, how can I fix the above code?

Jon Skeet [C# MVP] · Mar 7, 2008

I'm checking a character in a string for whitespace (spacebar or tab) or comment character (I use #
for this, signifying that anything following it is a comment so should be ignored). My C# code is:

string str=????
int ch=???? //character position in string
if(str[ch]==' ' || str[ch]=='\t' || str[ch]=='#'){
//delimiter found
????
}

It's what I've been using for many years (mainly in C - I've recently converted it to C#). It's
occurred to me that there might be problems with the above for non-English languages.

For instance, I know that in MS Word there are different types of spacebar (eg non-breaking space)
and also different lengths of spacebar (eg em-space, en-space). I wondered if these also exist in
Unicode, and if I should be checking for them. The same goes for tab and #.

If there are problems, how can I fix the above code?

Well, what's the actual context here? If it's a plain text file, it's
likely to just contain a normal space. I wouldn't worry about that.

On the other hand, there's always Char.IsWhiteSpace.

Jon · Mar 7, 2008

Thanks for your reply Jon,

It is a normal text file - a sort of a configuration file, although I can't guarantee how the end
user will generate it (eg Notepad or maybe a different editor).

Are you implying that normal text files use 1-byte characters rather than unicode characters (which
I assume are 2-byte)?

Thanks for the tip on using Char.IsWhiteSpace.

Jon

I'm checking a character in a string for whitespace (spacebar or tab) or comment character (I use
#
for this, signifying that anything following it is a comment so should be ignored). My C# code is:

string str=????
int ch=???? //character position in string
if(str[ch]==' ' || str[ch]=='\t' || str[ch]=='#'){
//delimiter found
????
}

It's what I've been using for many years (mainly in C - I've recently converted it to C#). It's
occurred to me that there might be problems with the above for non-English languages.

For instance, I know that in MS Word there are different types of spacebar (eg non-breaking space)
and also different lengths of spacebar (eg em-space, en-space). I wondered if these also exist in
Unicode, and if I should be checking for them. The same goes for tab and #.

If there are problems, how can I fix the above code?

Well, what's the actual context here? If it's a plain text file, it's
likely to just contain a normal space. I wouldn't worry about that.

On the other hand, there's always Char.IsWhiteSpace.

Jon Skeet [C# MVP] · Mar 7, 2008

Thanks for your reply Jon,

It is a normal text file - a sort of a configuration file, although
I can't guarantee how the end
user will generate it (eg Notepad or maybe a different editor).

Are you implying that normal text files use 1-byte characters rather
than unicode characters (which I assume are 2-byte)?

No - I'm implying that tools like notepad usually won't generate
"fancy" whitespace, regardless of which encoding they save the file in.

If you're confronted with a Word document, that might have different
kinds of spaces in - but in a plaintext document you're *likely* to
just have normal spaces.

Jon · Mar 7, 2008

OK, I understand - thanks.

Now the following is the bit that concerns me, probably because my knowledge of Unicode isn't that
great.

I assume that Unicode is effectively lots of code pages, most of which are 256 bytes in length. So,
0-255 is US English, 256-511 is some other language, etc. Some of these won't even be Latin-based.
Some (eg Chinese) will use more than 256 characters.

If so, then presumably there are lots of space characters in Unicode. Perhaps even lots of #
characters in Unicode. Let's say that French is 256-511 (I've no idea where it really is). Is there
a # in US English, and one also in French? Or is there only one # in the entire Unicode.

If someone in France, with their PC set up with French locale, writes a normal text file, then gives
it to me in the UK which I use with my PC set to UK English locale, will the French person's #
character be the same character as the one that I'm checking for?

Jon

Thanks for your reply Jon,

It is a normal text file - a sort of a configuration file, although
I can't guarantee how the end
user will generate it (eg Notepad or maybe a different editor).

Are you implying that normal text files use 1-byte characters rather
than unicode characters (which I assume are 2-byte)?

No - I'm implying that tools like notepad usually won't generate
"fancy" whitespace, regardless of which encoding they save the file in.

If you're confronted with a Word document, that might have different
kinds of spaces in - but in a plaintext document you're *likely* to
just have normal spaces.

Jon Skeet [C# MVP] · Mar 7, 2008

OK, I understand - thanks.

Now the following is the bit that concerns me, probably because my knowledge of Unicode isn't that
great.

I assume that Unicode is effectively lots of code pages, most of which are 256 bytes
in length. So, 0-255 is US English, 256-511 is some other language, etc.

Not really...

If so, then presumably there are lots of space characters in Unicode. Perhaps even lots of #
characters in Unicode. Let's say that French is 256-511 (I've no idea where it really is). Is there
a # in US English, and one also in French? Or is there only one # in the entire Unicode.

No, that's not the way it works.

See http://pobox.com/~skeet/csharp/unicode.html for an overview of
Unicode, and the difference between an *encoding* and a character set.

Basically whatever encoding the file is in, you'll end up with Unicode
strings in memory when you read the file in - it's up to you to
specify the encoding.

When you've loaded the file, only " " means space. There are various
different kinds of spaces (non-breaking, wide etc) but only one
"normal" one, U+0032.

Jon

Martin Bonner · Mar 7, 2008

OK, I understand - thanks.

Now the following is the bit that concerns me, probably because my
knowledge of Unicode isn't that great.

I assume that Unicode is effectively lots of code pages, most of which
are 256 bytes in length.

No. It is *one* code page, about 4 million characters (I think)
long. The four million characters are divided into different regions
(Latin letters, Greek, Hangul, etc).

So, 0-255 is US English, 256-511 is some other language,
etc. Some of these won't even be Latin-based.
Some (eg Chinese) will use more than 256 characters.

If so, then presumably there are lots of space characters in Unicode.

Absolutely not. There is space, and then a few other white space
characters (non-breaking, thin, etc).

Perhaps even lots of # characters in Unicode. Let's say that French is
256-511 (I've no idea where it really is). Is there
a # in US English, and one also in French? Or is there only one # in the
entire Unicode.

Just the one in the entire Unicode. French is written with a
combination of Basic Latin (aka ASCII) and Latin-1 supplement.

If someone in France, with their PC set up with French locale, writes
a normal text file, then gives it to me in the UK which I use with my PC
set to UK English locale, will the French person's #
character be the same character as the one that I'm checking for?

Yes.

Jon · Mar 10, 2008

Thanks Jon, that has cleared up my misconceptions of Unicode. Sorry for the late reply (just moved
house, no internet yet so can't reply weekends).

I started to look through your link on Friday and will continue today.

Jon

OK, I understand - thanks.

Now the following is the bit that concerns me, probably because my knowledge of Unicode isn't that
great.

I assume that Unicode is effectively lots of code pages, most of which are 256 bytes
in length. So, 0-255 is US English, 256-511 is some other language, etc.

Not really...

If so, then presumably there are lots of space characters in Unicode. Perhaps even lots of #
characters in Unicode. Let's say that French is 256-511 (I've no idea where it really is). Is
there
a # in US English, and one also in French? Or is there only one # in the entire Unicode.

No, that's not the way it works.

See http://pobox.com/~skeet/csharp/unicode.html for an overview of
Unicode, and the difference between an *encoding* and a character set.

Basically whatever encoding the file is in, you'll end up with Unicode
strings in memory when you read the file in - it's up to you to
specify the encoding.

When you've loaded the file, only " " means space. There are various
different kinds of spaces (non-breaking, wide etc) but only one
"normal" one, U+0032.

Jon

Jon · Mar 10, 2008

Thanks Martin, it's now clear to me.

Jon

OK, I understand - thanks.

Now the following is the bit that concerns me, probably because my
knowledge of Unicode isn't that great.

I assume that Unicode is effectively lots of code pages, most of which
are 256 bytes in length.

No. It is *one* code page, about 4 million characters (I think)
long. The four million characters are divided into different regions
(Latin letters, Greek, Hangul, etc).

So, 0-255 is US English, 256-511 is some other language,
etc. Some of these won't even be Latin-based.
Some (eg Chinese) will use more than 256 characters.

If so, then presumably there are lots of space characters in Unicode.

Absolutely not. There is space, and then a few other white space
characters (non-breaking, thin, etc).

Perhaps even lots of # characters in Unicode. Let's say that French is
256-511 (I've no idea where it really is). Is there
a # in US English, and one also in French? Or is there only one # in the
entire Unicode.

Just the one in the entire Unicode. French is written with a
combination of Basic Latin (aka ASCII) and Latin-1 supplement.

If someone in France, with their PC set up with French locale, writes
a normal text file, then gives it to me in the UK which I use with my PC
set to UK English locale, will the French person's #
character be the same character as the one that I'm checking for?

Yes.

Checking character - problem in non-English languages?

Jon

Jon Skeet [C# MVP]

Jon

Jon Skeet [C# MVP]

Jon

Jon Skeet [C# MVP]

Martin Bonner

Jon

Jon