StreamReader / StreamWriter Encoding

Jaroslav Jakes · Jan 24, 2005

Hi,

please help.

Sounds so simple. We receive textfiles (customer orders) as e-mail
attachment. These textfiles contain a simple structure of orders, like:
custno, itemno, qty, text

Since these textfile are made on different systems, the field "text" causes
some trouble.

Characters like ä, ö, ü are not convertet in each case correctly.

The source code looks like:

- open (streamread, encoding = default, detect encoding = true) textfile
- convert to a new structure
- write (streamwriter) new textfile

What would you suggest? How could we "detect" the encoding of the file in
order to convert the text-field correctly?

Thanks and regards - Jari

=?ISO-8859-2?Q?Marcin_Grz=EAbski?= · Jan 24, 2005

Hi Jaroslav,

I can recommend to use a byte (characted) histogram to determine
frequency of occuring character codes (from 128 to 255).
If you will compare those values to GERMAN, POLISH (or any other
encoding) "special" codes then you can guess source encoding.

It can be more sophistricated (e.g. dictionary-based) algorithm
to eliminate errors.

HTH
Marcin

Jaroslav Jakes · Jan 24, 2005

Hi Marcin,

do you have a link for samples or further description? Sorry, don't know,
how to do that...

Thanks and regards - Jari

=?ISO-8859-2?Q?Marcin_Grz=EAbski?= · Jan 24, 2005

hmmm...
I don't know any links or samples but i'm sure that your problem
occured at this group some time ago.

I can show you a concept of this alghorithm:

int germanEncodingCounter=0;
int polishEncodingCounter=0;
byte[] bytesOfText; // a table with bytes of text file

// i don't know a german char-codes so i used a random numbers
for(int i=0; i<bytesOfText.Length; i++) {
swith( butesOfText ) {
case 170:
germanEncodingCounter++;
break;
case 163:
germanEncodingCounter++;
polishEncodingCounter++; // £
break;
case 175:
polishEncodingCounter++; // ¯
break;
}
}

if( polishEncodingCounter>0
|| germanEncodingCounter>0 ) {
if( germanEncodingCounter>polishEncodingCounter ) {
// it looks like a german encoding
}
else if( polishEncodingCounter>germanEncodingCounter ) {
// it looks like a polish encoding
}
else {
// i'm confused??
}
}
else {
// encoding not found!
}

HTH
Marcin

Jaroslav Jakes · Jan 24, 2005

Hi Marcin,

thanks! I understood what I am to do...

Regards - Jari

Marcin Grzêbski said:
hmmm...
I don't know any links or samples but i'm sure that your problem
occured at this group some time ago.

I can show you a concept of this alghorithm:

int germanEncodingCounter=0;
int polishEncodingCounter=0;
byte[] bytesOfText; // a table with bytes of text file

// i don't know a german char-codes so i used a random numbers
for(int i=0; i<bytesOfText.Length; i++) {
swith( butesOfText ) {
case 170:
germanEncodingCounter++;
break;
case 163:
germanEncodingCounter++;
polishEncodingCounter++; // £
break;
case 175:
polishEncodingCounter++; // ¯
break;
}
}

if( polishEncodingCounter>0
|| germanEncodingCounter>0 ) {
if( germanEncodingCounter>polishEncodingCounter ) {
// it looks like a german encoding
}
else if( polishEncodingCounter>germanEncodingCounter ) {
// it looks like a polish encoding
}
else {
// i'm confused??
}
}
else {
// encoding not found!
}

HTH
Marcin

Hi Marcin,

do you have a link for samples or further description? Sorry, don't know,
how to do that...

Thanks and regards - Jari

Click to expand...

Problem with encoding....	3	May 11, 2004
Converting text and detecting encoding	3	Jul 4, 2006
Encoding Umlaut becomes ? when using System.Text.Encoding.ASCII	3	Nov 5, 2005
Detect characterset of text file	3	Aug 25, 2004
How to read html files AS IS. Encoding seems to change the characters.	14	Mar 30, 2007
Junk characters when using StreamReader and StreamWriter	5	Jun 20, 2007
StreamReader problem with certain characters	1	Nov 1, 2003
StreamReader reads, but StreamWriter doesn't write...	2	Feb 9, 2004

StreamReader / StreamWriter Encoding

Jaroslav Jakes

=?ISO-8859-2?Q?Marcin_Grz=EAbski?=

Jaroslav Jakes

=?ISO-8859-2?Q?Marcin_Grz=EAbski?=

Jaroslav Jakes

Ask a Question

Similar Threads