Determining encoding (Thai)

J

John J. Hughes II

I have enable Thai language support on English WinXP Pro SP1 via the IME.

I have an old application where I enter the Thai data on a form and then
save it to the SQL server as varchar. This works fine in of itself.

Now I have a C# application that needs to read it. I get garbage.

I am trying to determine how to change the encoding from the one format to
the other and part of the problem is I don't know what the two formats are.

Any suggestions? I am pretty sure Excel wants UNICODE and it's in two byte
something.

///////////////////////////////
One of many version of code I have tried:

Encoding a = Encoding.UTF7;
Encoding u = Encoding.UTF8;

Decoder d = Encoding.UTF7.GetDecoder();

byte[] aBytes = a.GetBytes((string)dr["Location"]);
char[] c = new char[aBytes.Length];
d.GetChars(aBytes, 0, aBytes.Length, c, 0);

string uStr = new string(c);

oSheet.Cells[row, 8] = uStr;


Regards,
John
 
J

Jon Skeet [C# MVP]

John J. Hughes II said:
I have enable Thai language support on English WinXP Pro SP1 via the IME.

I have an old application where I enter the Thai data on a form and then
save it to the SQL server as varchar. This works fine in of itself.

Now I have a C# application that needs to read it. I get garbage.

I am trying to determine how to change the encoding from the one format to
the other and part of the problem is I don't know what the two formats are.

You shouldn't need to know anything about the database format - it
should be converted for you.
Any suggestions? I am pretty sure Excel wants UNICODE and it's in two byte
something.

That would suggest Encoding.Unicode for output.
///////////////////////////////
One of many version of code I have tried:

Encoding a = Encoding.UTF7;
Encoding u = Encoding.UTF8;

Decoder d = Encoding.UTF7.GetDecoder();

byte[] aBytes = a.GetBytes((string)dr["Location"]);
char[] c = new char[aBytes.Length];
d.GetChars(aBytes, 0, aBytes.Length, c, 0);

string uStr = new string(c);

oSheet.Cells[row, 8] = uStr;

Ah, if you're using an Excel COM object, you shouldn't need to worry
about that side either. Everything should be fine. If it's not, there's
a chance you've got corrupt data in your database - it's possible that
your old app is reading and writing in a bad way which means that *it*
can read the data its written, but it's not in the database correctly
really. The easiest way to check this out is to use SQL Query Analyzer
- does *that* show the correct results?
 
J

John J. Hughes II

Hum, the problem seems to be reading the data... I guess it's likely that
old application is doing this in an incorrect way. The compiler I was using
at the time seem to do everything in an incorrect way.

From Query Analyzer:

¿Ë¡´àÒÊ ¿Ë¡´àÒÊ

As you can see it's unreadable. I am assuming this is stored as DBCS but I
am not sure how to tell. I know the Query Analyzer will display the data
correctly if I store it using UNICODE in an nvarchar or nchar field but I
did not think the Query Analyzer would display DBCS in a varchar or char
field.

Unfortuatly I have to find a way of getting the data out of the SQL and
displayed using C#.

Regards,
John
 
J

Jon Skeet [C# MVP]

John J. Hughes II said:
Hum, the problem seems to be reading the data... I guess it's likely that
old application is doing this in an incorrect way. The compiler I was using
at the time seem to do everything in an incorrect way.

Was it an ASP application by any chance? That's where I've seen the
problem before.
From Query Analyzer:

¿Ë¡´àÒÊ ¿Ë¡´àÒÊ

As you can see it's unreadable. I am assuming this is stored as DBCS butI
am not sure how to tell. I know the Query Analyzer will display the data
correctly if I store it using UNICODE in an nvarchar or nchar field but I
did not think the Query Analyzer would display DBCS in a varchar or char
field.

Unfortuatly I have to find a way of getting the data out of the SQL and
displayed using C#.

Okay. It sounds like the old app has indeed written bad data into the
database. I would expect query analyzer to get the character set right
for a varchar.

So, you need to find out what encoding the original uses, and then
"recode" it. Essentially, the process being used is something like:

1) Original character data entered by the user
2) Data is encoded from characters to bytes and stored in the database
(encoding 1)
3) Reading the database, the bytes are being reinterpreted as
characters in a different encoding (encoding 2)

You then need to "undo" the two levels of encoding/decoding here. If
you can work out the two encodings, you should be able to do something
like

string realText = encoding1.GetString (encoding2.GetBytes(bogusText));

Working out encoding1 and encoding2 is the difficult bit :(

Now, do you still use the existing application in everyday life? If
not, the easiest thing *might* be to try to read all the data from the
original application, write it (carefully!) to an nvarchar field
instead, and then in C# just read it as normal.

That's a good long-term fix - but obviously doesn't work if the
original app needs to keep working as it is. If that's the case, you
need to work out what the two encodings used are. One of them may well
be Encoding.Default, and you might have a look at what the database
properties are to see if that specifies a character encoding somewhere.
Sorry not to be of more use - I know the general principles, but don't
know much about the real specifics of SQL Server 2K in particular :(
 
J

John J. Hughes II

Thanks for the help.

Yes the old application is still being used and I really need to get the
data out of it into Excel.

It is using Clarion 5 if that means anything to you, it's from softvelocity.
I don't think they use ASP in any form but do have an MSSQL driver that it's
using. My impression is that they are using a Windows text box for editing
(more then likely an out of date API call). When the user types in the IME
it show correctly in the text box if I set the code page to 222 and use the
correct font. If the code page or font are incorrect only question marks
are shown. It's also interesting to note if I do a copy and paste to
notepad it comes up with the same garbage but I can type the Thai into
notepad without problem.

I have tried setting the code page to 222 in C# but it says there is no such
thing (Invalid or unsupported code page type). Code page 874 which the VS
help says is for Thai just returns junk.

I have tried the following to no avail. Is there another choice?

I assume encoding1 will always be unicode since that is what I want as an
end result. For encoding2 I have tried ASCII, BigEndianUnicode, UTF7, UTF8,
and Unicode (did not expect the last to work but :)... Mostly I seem to be
ending up with what looks like Koran or maybe Chinese.

Encoding encoding1 = Encoding.Unicode;
Encoding encoding2 = Encoding.ASCII; // BigEndianUnicode, UTF7, UTF8, and
Unicode

string realText = encoding1.GetString
(encoding2.GetBytes((string)dr["Location"]));

Regards,
John
 
J

Jon Skeet [C# MVP]

John J. Hughes II said:
Thanks for the help.

Yes the old application is still being used and I really need to get the
data out of it into Excel.

It is using Clarion 5 if that means anything to you, it's from softvelocity.
I don't think they use ASP in any form but do have an MSSQL driver that it's
using. My impression is that they are using a Windows text box for editing
(more then likely an out of date API call). When the user types in the IME
it show correctly in the text box if I set the code page to 222 and use the
correct font. If the code page or font are incorrect only question marks
are shown. It's also interesting to note if I do a copy and paste to
notepad it comes up with the same garbage but I can type the Thai into
notepad without problem.

Interesting... I wonder if we can write a code page 222 encoding...
I'll investigate.
I have tried setting the code page to 222 in C# but it says there is no such
thing (Invalid or unsupported code page type). Code page 874 which the VS
help says is for Thai just returns junk.

I have tried the following to no avail. Is there another choice?

I assume encoding1 will always be unicode since that is what I want as an
end result. For encoding2 I have tried ASCII, BigEndianUnicode, UTF7, UTF8,
and Unicode (did not expect the last to work but :)... Mostly I seem to be
ending up with what looks like Koran or maybe Chinese.

No, you don't necessarily want to be using Unicode for encoding1.
Although Unicode characters will be the result, that doesn't mean that
you want to be using the UCS-2 (Unicode) encoding to do the conversion.
Encoding encoding1 = Encoding.Unicode;
Encoding encoding2 = Encoding.ASCII; // BigEndianUnicode, UTF7, UTF8, and
Unicode

string realText = encoding1.GetString
(encoding2.GetBytes((string)dr["Location"]));

When you get the right encodings, that should give the right results.

The best way of testing this is to work out what Unicode characters
should be in the result, and test that in the actual result - don't
bother saving it in Excel and testing it there, as you may have other
problems at that stage, and we don't want to get confused.
 
J

John J. Hughes II

News from the front. If I change the font in Excel to "AngsanaDSE" which is
a free font I found on the web Excel displays the data correctly. I am not
sure why I did not try this in the first place. Code page only seems to
affect the size of the font in the old application but the selected font
seems to change how it's displayed.

I am going to try to get this solution past the costomer.

Thanks again for the help,
John

Jon Skeet said:
John J. Hughes II said:
Thanks for the help.

Yes the old application is still being used and I really need to get the
data out of it into Excel.

It is using Clarion 5 if that means anything to you, it's from softvelocity.
I don't think they use ASP in any form but do have an MSSQL driver that it's
using. My impression is that they are using a Windows text box for editing
(more then likely an out of date API call). When the user types in the IME
it show correctly in the text box if I set the code page to 222 and use the
correct font. If the code page or font are incorrect only question marks
are shown. It's also interesting to note if I do a copy and paste to
notepad it comes up with the same garbage but I can type the Thai into
notepad without problem.

Interesting... I wonder if we can write a code page 222 encoding...
I'll investigate.
I have tried setting the code page to 222 in C# but it says there is no such
thing (Invalid or unsupported code page type). Code page 874 which the VS
help says is for Thai just returns junk.

I have tried the following to no avail. Is there another choice?

I assume encoding1 will always be unicode since that is what I want as an
end result. For encoding2 I have tried ASCII, BigEndianUnicode, UTF7, UTF8,
and Unicode (did not expect the last to work but :)... Mostly I seem to be
ending up with what looks like Koran or maybe Chinese.

No, you don't necessarily want to be using Unicode for encoding1.
Although Unicode characters will be the result, that doesn't mean that
you want to be using the UCS-2 (Unicode) encoding to do the conversion.
Encoding encoding1 = Encoding.Unicode;
Encoding encoding2 = Encoding.ASCII; // BigEndianUnicode, UTF7, UTF8, and
Unicode

string realText = encoding1.GetString
(encoding2.GetBytes((string)dr["Location"]));

When you get the right encodings, that should give the right results.

The best way of testing this is to work out what Unicode characters
should be in the result, and test that in the actual result - don't
bother saving it in Excel and testing it there, as you may have other
problems at that stage, and we don't want to get confused.
 
J

Jon Skeet [C# MVP]

John J. Hughes II said:
News from the front. If I change the font in Excel to "AngsanaDSE" which is
a free font I found on the web Excel displays the data correctly. I am not
sure why I did not try this in the first place. Code page only seems to
affect the size of the font in the old application but the selected font
seems to change how it's displayed.

I am going to try to get this solution past the costomer.

So is this without trying any change in encoding at all? That's
encouraging. Have you checked whether the appropriate characters are in
the string as specified on www.unicode.org?
 
J

John J. Hughes II

Well I am pretty sure it's not UNICODE but just to look. The Thai UNICODE
seems to all start with 0x0E??. The data I am getting from the SQL all
seems to start with 0xb?. Now if it's DBCS then the the first one below is
0xBFA1. I may look into the font tomorrow and see what values it has.

Sample string byte decoded.
BF,A1,B4,BF,CB,A1,D3,E4,E8,E4,D3,B9,BE,C3,E4,D3,BE,B9,D0,C2,E0,B9,BF,B4,A1,C
B,E0,BF,C2,D3,E4,BE,A4,E9,E8,D0,C2,B9,C3,BF,E0,E8,B4,C2,BE,D3,B5,BF,A4,E9,D0
,E0,BF,B9,C3,B4,CB,A1,D5,E0,BF,B4,A1,


Regards,
John
 
J

Jon Skeet [C# MVP]

John J. Hughes II said:
Well I am pretty sure it's not UNICODE but just to look. The Thai UNICODE
seems to all start with 0x0E??. The data I am getting from the SQL all
seems to start with 0xb?. Now if it's DBCS then the the first one below is
0xBFA1. I may look into the font tomorrow and see what values it has.

Sample string byte decoded.
BF,A1,B4,BF,CB,A1,D3,E4,E8,E4,D3,B9,BE,C3,E4,D3,BE,B9,D0,C2,E0,B9,BF,B4,A1,C
B,E0,BF,C2,D3,E4,BE,A4,E9,E8,D0,C2,B9,C3,BF,E0,E8,B4,C2,BE,D3,B5,BF,A4,E9,D0
,E0,BF,B9,C3,B4,CB,A1,D5,E0,BF,B4,A1,

Sorry, I should have been clear - I don't mean how it is byte-decoded,
I mean how it is in the Unicode string after you've read it from the
database. You can print out the contents of the string with something
like:

foreach (char c in theString)
{
Console.WriteLine ("{0:x4}", (int)c);
}
 
J

John J. Hughes II

Well what I had done is basically the same as you suggested. There is no
value in the upper byte. If I switch to your method I get 0x00BF for the
first value instead of 0xBF is all. I know I said byte in my message but I
was using int32. Since there was no upper value I am assuing the data is
being recorded as a byte and that is how C# is reading it even if a char is
longer then a byte.

string s = (string)dr["Location"];
System.Diagnostics.Debug.WriteLine("line");
foreach(char c in s)
{
System.Diagnostics.Debug.Write(Convert.ToInt32(c).ToString("X") + ",");
}
System.Diagnostics.Debug.WriteLine("");

Using your method:

00bf
00a1
00b4
00bf
00cb
00a1
00d3
00e4
00e8
00e4
00d3
00b9
00be
00c3
00e4
00d3
00be
00b9
00d0
00c2
00e0
00b9
00bf
00b4
00a1
00cb
00e0
00bf
00c2
00d3
00e4
00be
00a4
00e9
00e8
00d0
00c2
00b9
00c3
00bf
00e0
00e8
00b4
00c2
00be
00d3
00b5
00bf
00a4
00e9
00d0
00e0
00bf
00b9
00c3
00b4
00cb
00a1
00d5
00e0
00bf
00b4
00a1
00bf
00cb
00a1
00b4
00bf
00cb
00a1
00b4
00cb
00a1
00b4
00bf
00cb
00a1
00b4
00cb
00bf
00a1
00b4
004c
0061
0070
0020
0070
0068
006f
006e
0065
0020
0075
0073
0065
0020
0066
006f
0072
0020
0074
0065
0073
0074
0069
006e
0067
0020
0074
0068
0069
0073
0020
0052
0041
004d
0053
002c
0020
0077
006f
0077
0021

Regards,
John
 
J

Jon Skeet [C# MVP]

John J. Hughes II said:
Well what I had done is basically the same as you suggested. There is no
value in the upper byte. If I switch to your method I get 0x00BF for the
first value instead of 0xBF is all. I know I said byte in my message but I
was using int32. Since there was no upper value I am assuing the data is
being recorded as a byte and that is how C# is reading it even if a char is
longer then a byte.

Well, it's being stored as a 16-bit entity that just happens to be less
than 256.

<snip>

Okay, those definitely don't look like Thai characters to me, so the
decoding is still going awry. Unfortunately, I haven't been able to
find a spec for code page 222 :(

http://www.unicode.org/charts/PDF/U0E00.pdf gives the Thai characters
as far as Unicode is concerned (there may be more; I'm not sure). If
you could write a record with the old app containing each Thai
character, you could work out a mapping yourself - or you could stick
with the solution you've got.

Let me know if you want any more help - sorry I haven't been able to
give a really good solution :(
 
J

John J. Hughes II

Thanks again for the help.

I think I now understand what's happend at least.

As I think I said changing to code page 222 does not matter. I had done
that because the Clarion compiler's help file says it's the code page for
Thai. Basically leaving the windows default code page show the same values
as code page 222. The only real difference is the size of the char which
might just mean that somewhere in the code page it says the font's default
is larger or something.

Usig the symbol view in MS word I have been browsing the "AngsanaDSE" font
which seems to work. 0xb4 is a Thai char in that font instead of the value
which is supposed to be in that position. So basically they have mapped the
Thai char in the upper part of the font. Unfortuatly just adding the 0xe00
in front of the value read does not fix the mapping. If I do need to fix
this correctly in the future I will have to remap the whole font it seems.

Regards,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top