StreamReader / StreamWriter Encoding

J

Jaroslav Jakes

Hi,

please help.

Sounds so simple. We receive textfiles (customer orders) as e-mail
attachment. These textfiles contain a simple structure of orders, like:
custno, itemno, qty, text

Since these textfile are made on different systems, the field "text" causes
some trouble.

Characters like ä, ö, ü are not convertet in each case correctly.

The source code looks like:

- open (streamread, encoding = default, detect encoding = true) textfile
- convert to a new structure
- write (streamwriter) new textfile

What would you suggest? How could we "detect" the encoding of the file in
order to convert the text-field correctly?

Thanks and regards - Jari
 
?

=?ISO-8859-2?Q?Marcin_Grz=EAbski?=

Hi Jaroslav,

I can recommend to use a byte (characted) histogram to determine
frequency of occuring character codes (from 128 to 255).
If you will compare those values to GERMAN, POLISH (or any other
encoding) "special" codes then you can guess source encoding.

It can be more sophistricated (e.g. dictionary-based) algorithm
to eliminate errors.

HTH
Marcin
 
J

Jaroslav Jakes

Hi Marcin,

do you have a link for samples or further description? Sorry, don't know,
how to do that...

Thanks and regards - Jari
 
?

=?ISO-8859-2?Q?Marcin_Grz=EAbski?=

hmmm...
I don't know any links or samples but i'm sure that your problem
occured at this group some time ago.

I can show you a concept of this alghorithm:

int germanEncodingCounter=0;
int polishEncodingCounter=0;
byte[] bytesOfText; // a table with bytes of text file

// i don't know a german char-codes so i used a random numbers
for(int i=0; i<bytesOfText.Length; i++) {
swith( butesOfText ) {
case 170:
germanEncodingCounter++;
break;
case 163:
germanEncodingCounter++;
polishEncodingCounter++; // £
break;
case 175:
polishEncodingCounter++; // ¯
break;
}
}

if( polishEncodingCounter>0
|| germanEncodingCounter>0 ) {
if( germanEncodingCounter>polishEncodingCounter ) {
// it looks like a german encoding
}
else if( polishEncodingCounter>germanEncodingCounter ) {
// it looks like a polish encoding
}
else {
// i'm confused??
}
}
else {
// encoding not found!
}

HTH
Marcin
 
J

Jaroslav Jakes

Hi Marcin,

thanks! I understood what I am to do...

Regards - Jari

Marcin Grzêbski said:
hmmm...
I don't know any links or samples but i'm sure that your problem
occured at this group some time ago.

I can show you a concept of this alghorithm:

int germanEncodingCounter=0;
int polishEncodingCounter=0;
byte[] bytesOfText; // a table with bytes of text file

// i don't know a german char-codes so i used a random numbers
for(int i=0; i<bytesOfText.Length; i++) {
swith( butesOfText ) {
case 170:
germanEncodingCounter++;
break;
case 163:
germanEncodingCounter++;
polishEncodingCounter++; // £
break;
case 175:
polishEncodingCounter++; // ¯
break;
}
}

if( polishEncodingCounter>0
|| germanEncodingCounter>0 ) {
if( germanEncodingCounter>polishEncodingCounter ) {
// it looks like a german encoding
}
else if( polishEncodingCounter>germanEncodingCounter ) {
// it looks like a polish encoding
}
else {
// i'm confused??
}
}
else {
// encoding not found!
}

HTH
Marcin
Hi Marcin,

do you have a link for samples or further description? Sorry, don't know,
how to do that...

Thanks and regards - Jari
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top