Checking character - problem in non-English languages?

  • Thread starter Thread starter Jon
  • Start date Start date
J

Jon

I'm checking a character in a string for whitespace (spacebar or tab) or comment character (I use #
for this, signifying that anything following it is a comment so should be ignored). My C# code is:

string str=????
int ch=???? //character position in string
if(str[ch]==' ' || str[ch]=='\t' || str[ch]=='#'){
//delimiter found
????
}

It's what I've been using for many years (mainly in C - I've recently converted it to C#). It's
occurred to me that there might be problems with the above for non-English languages.

For instance, I know that in MS Word there are different types of spacebar (eg non-breaking space)
and also different lengths of spacebar (eg em-space, en-space). I wondered if these also exist in
Unicode, and if I should be checking for them. The same goes for tab and #.

If there are problems, how can I fix the above code?
 
I'm checking a character in a string for whitespace (spacebar or tab) or comment character (I use #
for this, signifying that anything following it is a comment so should be ignored). My C# code is:

string str=????
int ch=???? //character position in string
if(str[ch]==' ' || str[ch]=='\t' || str[ch]=='#'){
//delimiter found
????
}

It's what I've been using for many years (mainly in C - I've recently converted it to C#). It's
occurred to me that there might be problems with the above for non-English languages.

For instance, I know that in MS Word there are different types of spacebar (eg non-breaking space)
and also different lengths of spacebar (eg em-space, en-space). I wondered if these also exist in
Unicode, and if I should be checking for them. The same goes for tab and #.

If there are problems, how can I fix the above code?

Well, what's the actual context here? If it's a plain text file, it's
likely to just contain a normal space. I wouldn't worry about that.

On the other hand, there's always Char.IsWhiteSpace.
 
Thanks for your reply Jon,

It is a normal text file - a sort of a configuration file, although I can't guarantee how the end
user will generate it (eg Notepad or maybe a different editor).

Are you implying that normal text files use 1-byte characters rather than unicode characters (which
I assume are 2-byte)?

Thanks for the tip on using Char.IsWhiteSpace.

Jon



I'm checking a character in a string for whitespace (spacebar or tab) or comment character (I use
#
for this, signifying that anything following it is a comment so should be ignored). My C# code is:

string str=????
int ch=???? //character position in string
if(str[ch]==' ' || str[ch]=='\t' || str[ch]=='#'){
//delimiter found
????
}

It's what I've been using for many years (mainly in C - I've recently converted it to C#). It's
occurred to me that there might be problems with the above for non-English languages.

For instance, I know that in MS Word there are different types of spacebar (eg non-breaking space)
and also different lengths of spacebar (eg em-space, en-space). I wondered if these also exist in
Unicode, and if I should be checking for them. The same goes for tab and #.

If there are problems, how can I fix the above code?

Well, what's the actual context here? If it's a plain text file, it's
likely to just contain a normal space. I wouldn't worry about that.

On the other hand, there's always Char.IsWhiteSpace.
 
Thanks for your reply Jon,

It is a normal text file - a sort of a configuration file, although
I can't guarantee how the end
user will generate it (eg Notepad or maybe a different editor).

Are you implying that normal text files use 1-byte characters rather
than unicode characters (which I assume are 2-byte)?

No - I'm implying that tools like notepad usually won't generate
"fancy" whitespace, regardless of which encoding they save the file in.

If you're confronted with a Word document, that might have different
kinds of spaces in - but in a plaintext document you're *likely* to
just have normal spaces.
 
OK, I understand - thanks.

Now the following is the bit that concerns me, probably because my knowledge of Unicode isn't that
great.

I assume that Unicode is effectively lots of code pages, most of which are 256 bytes in length. So,
0-255 is US English, 256-511 is some other language, etc. Some of these won't even be Latin-based.
Some (eg Chinese) will use more than 256 characters.

If so, then presumably there are lots of space characters in Unicode. Perhaps even lots of #
characters in Unicode. Let's say that French is 256-511 (I've no idea where it really is). Is there
a # in US English, and one also in French? Or is there only one # in the entire Unicode.

If someone in France, with their PC set up with French locale, writes a normal text file, then gives
it to me in the UK which I use with my PC set to UK English locale, will the French person's #
character be the same character as the one that I'm checking for?

Jon


Thanks for your reply Jon,

It is a normal text file - a sort of a configuration file, although
I can't guarantee how the end
user will generate it (eg Notepad or maybe a different editor).

Are you implying that normal text files use 1-byte characters rather
than unicode characters (which I assume are 2-byte)?

No - I'm implying that tools like notepad usually won't generate
"fancy" whitespace, regardless of which encoding they save the file in.

If you're confronted with a Word document, that might have different
kinds of spaces in - but in a plaintext document you're *likely* to
just have normal spaces.
 
OK, I understand - thanks.

Now the following is the bit that concerns me, probably because my knowledge of Unicode isn't that
great.

I assume that Unicode is effectively lots of code pages, most of which are 256 bytes
in length. So, 0-255 is US English, 256-511 is some other language, etc.

Not really...
If so, then presumably there are lots of space characters in Unicode. Perhaps even lots of #
characters in Unicode. Let's say that French is 256-511 (I've no idea where it really is). Is there
a # in US English, and one also in French? Or is there only one # in the entire Unicode.

No, that's not the way it works.

See http://pobox.com/~skeet/csharp/unicode.html for an overview of
Unicode, and the difference between an *encoding* and a character set.

Basically whatever encoding the file is in, you'll end up with Unicode
strings in memory when you read the file in - it's up to you to
specify the encoding.

When you've loaded the file, only " " means space. There are various
different kinds of spaces (non-breaking, wide etc) but only one
"normal" one, U+0032.

Jon
 
OK, I understand - thanks.

Now the following is the bit that concerns me, probably because my
knowledge of Unicode isn't that great.

I assume that Unicode is effectively lots of code pages, most of which
are 256 bytes in length.

No. It is *one* code page, about 4 million characters (I think)
long. The four million characters are divided into different regions
(Latin letters, Greek, Hangul, etc).
So, 0-255 is US English, 256-511 is some other language,
etc. Some of these won't even be Latin-based.
Some (eg Chinese) will use more than 256 characters.

If so, then presumably there are lots of space characters in Unicode.

Absolutely not. There is space, and then a few other white space
characters (non-breaking, thin, etc).
Perhaps even lots of # characters in Unicode. Let's say that French is
256-511 (I've no idea where it really is). Is there
a # in US English, and one also in French? Or is there only one # in the
entire Unicode.
Just the one in the entire Unicode. French is written with a
combination of Basic Latin (aka ASCII) and Latin-1 supplement.
If someone in France, with their PC set up with French locale, writes
a normal text file, then gives it to me in the UK which I use with my PC
set to UK English locale, will the French person's #
character be the same character as the one that I'm checking for?
Yes.
 
Thanks Jon, that has cleared up my misconceptions of Unicode. Sorry for the late reply (just moved
house, no internet yet so can't reply weekends).

I started to look through your link on Friday and will continue today.

Jon


OK, I understand - thanks.

Now the following is the bit that concerns me, probably because my knowledge of Unicode isn't that
great.

I assume that Unicode is effectively lots of code pages, most of which are 256 bytes
in length. So, 0-255 is US English, 256-511 is some other language, etc.

Not really...
If so, then presumably there are lots of space characters in Unicode. Perhaps even lots of #
characters in Unicode. Let's say that French is 256-511 (I've no idea where it really is). Is
there
a # in US English, and one also in French? Or is there only one # in the entire Unicode.

No, that's not the way it works.

See http://pobox.com/~skeet/csharp/unicode.html for an overview of
Unicode, and the difference between an *encoding* and a character set.

Basically whatever encoding the file is in, you'll end up with Unicode
strings in memory when you read the file in - it's up to you to
specify the encoding.

When you've loaded the file, only " " means space. There are various
different kinds of spaces (non-breaking, wide etc) but only one
"normal" one, U+0032.

Jon
 
Thanks Martin, it's now clear to me.

Jon

OK, I understand - thanks.

Now the following is the bit that concerns me, probably because my
knowledge of Unicode isn't that great.

I assume that Unicode is effectively lots of code pages, most of which
are 256 bytes in length.

No. It is *one* code page, about 4 million characters (I think)
long. The four million characters are divided into different regions
(Latin letters, Greek, Hangul, etc).
So, 0-255 is US English, 256-511 is some other language,
etc. Some of these won't even be Latin-based.
Some (eg Chinese) will use more than 256 characters.

If so, then presumably there are lots of space characters in Unicode.

Absolutely not. There is space, and then a few other white space
characters (non-breaking, thin, etc).
Perhaps even lots of # characters in Unicode. Let's say that French is
256-511 (I've no idea where it really is). Is there
a # in US English, and one also in French? Or is there only one # in the
entire Unicode.
Just the one in the entire Unicode. French is written with a
combination of Basic Latin (aka ASCII) and Latin-1 supplement.
If someone in France, with their PC set up with French locale, writes
a normal text file, then gives it to me in the UK which I use with my PC
set to UK English locale, will the French person's #
character be the same character as the one that I'm checking for?
Yes.
 
Back
Top