Trying to get an ASCII string.

H

Heandel

Hello,

I have an application that extracts ASCII strings from a binary file
(characters from 0 to 255). Depending on the windows code page of the user,
the characters above 127 do not "mean" the same : that is normal.

My problem now consists in saving that string in text format while keeping
every character on a BYTE and not .Net CHAR

Example :

Let's say a russian guy uses my program, the tool extracts C0 F0 E3 E5 ED F2
00.
(Assuming buffer contains the binary data, this is how I extract the string
: )

StringBuilder sb = new StringBuilder();

for (int i = 0; i < buffer.Length; i++)
sb.Append((char)buffer);

Now I want to save it in TEXT format and I get this in the text file : D1 8E
D0 9F D0 A6 D0 95 D0 9C D0

I want to write the same string I read !

Can anyone help me ?

Thanks,

Heandel
 
P

Peter Duniho

Heandel said:
[...]
StringBuilder sb = new StringBuilder();

for (int i = 0; i < buffer.Length; i++)
sb.Append((char)buffer);

Now I want to save it in TEXT format and I get this in the text file :
D1 8E D0 9F D0 A6 D0 95 D0 9C D0

I want to write the same string I read !

Can anyone help me ?


Without a concise-but-complete code example, we have no way to know
important facts such as the type of the "buffer" variable and how you
are saving the data from the StringBuilder back to a file.

That said, IMHO you have two valid options:

– Don't do any character conversion at all. Retain the original byte
data and write it when you want to save it again.

– Do your character conversions correctly. This means using the
System.Text.Encoding class and selecting the appropriate character
encoding for your input. Likewise when writing the text, using Encoding
again to convert the text data back to bytes.

I prefer the latter, but you can successfully follow the former approach
if you like. What you can't do is make up your own random
binary-to-text-to-binary conversion routines and expect them to work
without completely reproducing the implementation that already exists in
the Encoding class.

Pete
 
H

Heandel

Peter Duniho said:
Without a concise-but-complete code example, we have no way to know
important facts such as the type of the "buffer" variable and how you are
saving the data from the StringBuilder back to a file.

That said, IMHO you have two valid options:

– Don't do any character conversion at all. Retain the original byte
data and write it when you want to save it again.

– Do your character conversions correctly. This means using the
System.Text.Encoding class and selecting the appropriate character
encoding for your input. Likewise when writing the text, using Encoding
again to convert the text data back to bytes.

I prefer the latter, but you can successfully follow the former approach
if you like. What you can't do is make up your own random
binary-to-text-to-binary conversion routines and expect them to work
without completely reproducing the implementation that already exists in
the Encoding class.

Pete

Okay, I'll try being clearer. Here is a small example :

byte[] rawData = { 0xD1, 0x8E, 0xD0, 0x9F, 0xD0, 0xA6, 0xD0, 0x95, 0xD0,
0x9C, 0xD0 }; // That is a russian word, in 8-bits ASCII (call it "Extended
ASCII" or whatever you like)

We get a string out of this ("manually", because none of the Encoding
classes can do the work, as none is 8 bits)
string s = Encoding.GetEncoding(1251).GetString(rawData); // Codepage
Windows 1251 is Cyrillic

Now let's assume I save the string :
File.WriteAllText("somepath.txt", s);

somepath.txt will contain the string encoded in Unicode, 2 bytes per char,
while I want to save it as 1 byte per char, like the source string.

Do I explain myself better ?

Thanks again for you help,

Heandel
 
H

Heandel

Whoops, my sample is wrong, the original russian word is :

byte[] rawData = { 0xC0, 0xF0, 0xE3, 0xE5, 0xED, 0xF2, 0x00 };



After saving it looks like :

byte[] rawData = { 0xD1, 0x8E, 0xD0, 0x9F, 0xD0, 0xA6, 0xD0, 0x95, 0xD0,
0x9C, 0xD0 };

Excuse me, wrong copy paste ;)
 
A

Adam Clauss

Heandel said:
We get a string out of this ("manually", because none of the Encoding
classes can do the work, as none is 8 bits)
string s = Encoding.GetEncoding(1251).GetString(rawData); // Codepage
Windows 1251 is Cyrillic

Now let's assume I save the string :
File.WriteAllText("somepath.txt", s);

Check the documentation of that overload of File.WriteAllText
(http://msdn.microsoft.com/en-us/library/ms143375.aspx), first sentence
under "Remarks":
"This method uses UTF-8 encoding..."

So, if you do not want to write-out in UTF-8, use the other overload of
File.WriteAllText, and pass it the same encoding you used when reading
the string?

-Adam
 
A

Arne Vajhøj

Whoops, my sample is wrong, the original russian word is :

byte[] rawData = { 0xC0, 0xF0, 0xE3, 0xE5, 0xED, 0xF2, 0x00 };

After saving it looks like :

byte[] rawData = { 0xD1, 0x8E, 0xD0, 0x9F, 0xD0, 0xA6, 0xD0, 0x95, 0xD0,
0x9C, 0xD0 };

Example showing various stuff:

using System;
using System.IO;
using System.Text;
using System.Windows.Forms;

namespace E
{
public class Program
{
public static void Main(string[] args)
{
byte[] byt = { 0xC0, 0xF0, 0xE3, 0xE5, 0xED, 0xF2, 0x00 };
string str = Encoding.GetEncoding(1251).GetString(byt);
MessageBox.Show(str);

Console.WriteLine(Encoding.GetEncoding(1251).GetBytes(str).Length);
using(Stream stm = new FileStream(@"C:\z.1",
FileMode.Create, FileAccess.Write))
{
stm.Write(byt, 0, byt.Length);
}
Console.WriteLine((new FileInfo(@"C:\z.1")).Length);
using(StreamWriter sw = new StreamWriter(@"C:\z.2", false,
Encoding.GetEncoding(1251)))
{
sw.Write(str);
}
Console.WriteLine((new FileInfo(@"C:\z.2")).Length);
Console.ReadKey();
}
}
}

Arne
 
P

Peter Duniho

Heandel said:
Okay, I'll try being clearer. Here is a small example :

byte[] rawData = { 0xD1, 0x8E, 0xD0, 0x9F, 0xD0, 0xA6, 0xD0, 0x95, 0xD0,
0x9C, 0xD0 }; // That is a russian word, in 8-bits ASCII (call it
"Extended ASCII" or whatever you like)

We get a string out of this ("manually", because none of the Encoding
classes can do the work, as none is 8 bits)
string s = Encoding.GetEncoding(1251).GetString(rawData); // Codepage
Windows 1251 is Cyrillic

The above claim that you have to convert the string from bytes manually
– doesn't make sense. It's true that (ignoring the bad copy/paste) your
8-bit encoding can't be directly entered into a Unicode string. But
those _characters_ certainly can be represented in Unicode; that's what
the Encoding.GetString() method does for you. So instead of defining
the string as bytes from the 8-bit code page 1251, just define it in
Unicode using the correct Unicode characters instead.

This can be done by inserting the appropriate UTF-16 literals in the
string. For example, your string (not including the null-terminator,
which may or may not be important):

string str = "\u0410\u0440\u0433\u0435\u043d\u0442";

Or, if you are willing for your source files to be encoded in a format
supporting those characters (e.g. UTF-8, UTF-16, etc.…you might even be
able to use the Windows CP-1251 encoding, if the compiler supports it; I
don't know off the top of my head if it does), you can even just type
the actual characters into the source code directly.
Now let's assume I save the string :
File.WriteAllText("somepath.txt", s);

somepath.txt will contain the string encoded in Unicode, 2 bytes per
char, while I want to save it as 1 byte per char, like the source string.

So use Encoding to get the same bytes back from the string (see the
GetBytes() method). Once the string is encoded as Unicode (i.e. UTF-16,
which is the Unicode encoding used by .NET strings and characters) this
would be the correct approach regardless of the original source of the
string.

In your example, it's simply a matter of providing the correct Encoding
instance to the File.WriteAllText() method:

File.WriteAllText("somepath.txt", s, Encoding.GetEncoding(1251));

That will reverse your input back to the 1251 codepage. (Of course, it
begs the question as to why you want to keep using a non-Unicode
encoding, but presumably you have some important justification for that).

Pete
 
H

Harlan Messinger

Heandel said:
Peter Duniho said:
Without a concise-but-complete code example, we have no way to know
important facts such as the type of the "buffer" variable and how you
are saving the data from the StringBuilder back to a file.

That said, IMHO you have two valid options:

– Don't do any character conversion at all. Retain the original
byte data and write it when you want to save it again.

– Do your character conversions correctly. This means using the
System.Text.Encoding class and selecting the appropriate character
encoding for your input. Likewise when writing the text, using
Encoding again to convert the text data back to bytes.

I prefer the latter, but you can successfully follow the former
approach if you like. What you can't do is make up your own random
binary-to-text-to-binary conversion routines and expect them to work
without completely reproducing the implementation that already exists
in the Encoding class.

Pete

Okay, I'll try being clearer. Here is a small example :

byte[] rawData = { 0xD1, 0x8E, 0xD0, 0x9F, 0xD0, 0xA6, 0xD0, 0x95, 0xD0,
0x9C, 0xD0 }; // That is a russian word, in 8-bits ASCII (call it
"Extended ASCII" or whatever you like)

"Extended ASCII" doesn't really mean anything; this is one of numerous
8-bit encodings that have the ASCII set as a subset. You needn't have
been at a loss for the name of this one, because it's called ... wait
for it ...
We get a string out of this ("manually", because none of the Encoding
classes can do the work, as none is 8 bits)
string s = Encoding.GetEncoding(1251).GetString(rawData); // Codepage
Windows 1251 is Cyrillic

.... Windows code page 1251.
Now let's assume I save the string :
File.WriteAllText("somepath.txt", s);

You read it into C# strings using the applicable encoding, but then when
you wrote it out to a file without specifying the encoding, you were
puzzled that it didn't use the encoding that you didn't specify. You
have to specify the encoding on output, just as you do on input, if you
want something other than UTF-8.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top