Reading an Ascii string

J

John

Hi,

I'm a beginner is using C# and .net.

I have big legacy files that stores various values (ints, bytes, strings) and want to read them into
a C# programme so that I can store them in a database. The files are written by a late 1980's PC
Pascal programme, for which I don't have the source code. I've managed to reverse engineer the file
format.

The strings are stored as Ascii in the file, with the first byte indicating the string length, and
the rest are the Ascii (ie 8-bit) characters. The string length is always 0, 20 or 40 characters
(never any more) and strings are end-padded with space characters where necessary.

What is the best way to quickly read a string and get rid of the space padding at the end? To make
sure I can read them correctly, I'll put them in a text box. I assume the string used in a test box
uses 16-bit characters (unicode?) but I may be wrong here. When I'm happy I can read them correctly,
I'll get rid of the text box and store them directly in the database. Is it best to store it in the
database as unicode? I'm tempted to use Ascii for efficiency.

I was thinking of using a binary reader (_br) to extract from the file. That should be fine for
everything, but I don't know how to cope the the Ascii strings.
 
M

Morten Wennevik

Hi John,

First, ASCII is 7-bit, and any character above 127 will need the proper encoding to be read right.
I'm assuming the characters are stored as 8 bit.

You can either read the FileStream directly or as a single string from atextbox.

You will need a loop, which in this case would be something like

index = 0
ArrayList strings

while(index < length of data)
{
numbytes = data[index]
index++

copy the next numbytes bytes to a new string
strings.Add(newstring)

index+= numbytes

remove space padding if needed
index++ if needed
}

It may be easiest to treat the data as a char array or as a string, in which case a textbox should be easy enough. Using a FileStream you wouldneed to read the file as ASCII.

File.ReadAllText(filepath, System.Text.Encoding.ASCII); (C# 2.0)

If the file isn't ASCII, but uses all 8 bits for data, then you need to figure out the correct encoding by trial and error.
 
K

Kevin Spencer

Check out the System.Text Namespace, specifically the Encoding, Encoder, and
Decoder classes.

--
HTH,

Kevin Spencer
Microsoft MVP
Professional Chicken Salad Alchemist

Big thicks are made up of lots of little thins.
 
?

=?ISO-8859-1?Q?G=F6ran_Andersson?=

John said:
Hi,

I'm a beginner is using C# and .net.

I have big legacy files that stores various values (ints, bytes, strings) and want to read them into
a C# programme so that I can store them in a database. The files are written by a late 1980's PC
Pascal programme, for which I don't have the source code. I've managed to reverse engineer the file
format.

The strings are stored as Ascii in the file, with the first byte indicating the string length, and
the rest are the Ascii (ie 8-bit) characters.

Yes, that's how strings are stored in Pascal.
The string length is always 0, 20 or 40 characters
(never any more) and strings are end-padded with space characters where necessary.

Does the length include the padding or not? If it does, you just have to
trim the string. If it doesn't, you have to calculate how much padding
there is from the length of the string, and skip that number of bytes.
What is the best way to quickly read a string and get rid of the space padding at the end?

Read the length using ReadByte, then use the ReadChars method to get the
string. You get an array of Char, if you want a string just create one
from the array.
To make
sure I can read them correctly, I'll put them in a text box. I assume the string used in a test box
uses 16-bit characters (unicode?) but I may be wrong here. When I'm happy I can read them correctly,
I'll get rid of the text box and store them directly in the database. Is it best to store it in the
database as unicode? I'm tempted to use Ascii for efficiency.

I was thinking of using a binary reader (_br) to extract from the file. That should be fine for
everything, but I don't know how to cope the the Ascii strings.

Yes, a BinaryReader is exactly what I would suggest to use.

Specify the encoding when you create the BinaryReader, that way it can
handle reading chars, and you don't have to read bytes and decode them.

ASCII encoding won't work if the strings contains extended characters
(above 127). Use Encoding.GetEncoding(850) to get the encoding for
extended ASCII.
 
J

Jon Skeet [C# MVP]

ASCII encoding won't work if the strings contains extended characters
(above 127). Use Encoding.GetEncoding(850) to get the encoding for
extended ASCII.

Well, use GetEncoding(850) to get one particular form of "extended
ASCII". Several different code pages have been called "extended ASCII"
over the course of time.
 
J

John

Thanks for all your replies.

Just to clarify...

"ASCII is 7-bit, and any character above 127 will need the proper encoding to be read right. I'm
assuming the characters are stored as 8 bit."

Sorry for being inprecise. When said "Ascii (ie 8-bit) characters", I meant that they are stored as
bytes rather than 16-bit quantities as unicode requires. The Ascii characters are all 7-bit from a
quick glance but really I ought to write a quick test programme to check this, which would find and
flag up non-ascii characters. Then I could try to deduce what the encoding is. The original
DOS-based [stock control] programme that created the file was by a UK company and this software was
used on a PC in the UK to generate the files. Is there a standard type of encoding for the UK? Is
this likely to be extended ASCII, for which Göran Andersson suggested using
Encoding.GetEncoding(850)? Thanks Jon Skeet for your comment about extended Ascii. I suppose that
there could potentially be things like a letter e with an acute accent, and I don't want to mangle
these. Once I've discovered the encoding, I'm pretty certain that it will be consistent across all
the files.

Göran Andersson: "Does the length include the padding or not?"

It does include padding, so a string with three characters appears as byte 0x14 (ie 20) followed by
the three characters followed by 17 space characters. I will trim the string.



"John" <-> wrote in message Hi,

I'm a beginner is using C# and .net.

I have big legacy files that stores various values (ints, bytes, strings) and want to read them into
a C# programme so that I can store them in a database. The files are written by a late 1980's PC
Pascal programme, for which I don't have the source code. I've managed to reverse engineer the file
format.

The strings are stored as Ascii in the file, with the first byte indicating the string length, and
the rest are the Ascii (ie 8-bit) characters. The string length is always 0, 20 or 40 characters
(never any more) and strings are end-padded with space characters where necessary.

What is the best way to quickly read a string and get rid of the space padding at the end? To make
sure I can read them correctly, I'll put them in a text box. I assume the string used in a test box
uses 16-bit characters (unicode?) but I may be wrong here. When I'm happy I can read them correctly,
I'll get rid of the text box and store them directly in the database. Is it best to store it in the
database as unicode? I'm tempted to use Ascii for efficiency.

I was thinking of using a binary reader (_br) to extract from the file. That should be fine for
everything, but I don't know how to cope the the Ascii strings.
 
J

John

Thanks, that's very helpful.

"Read the length using ReadByte, then use the ReadChars method to get the string. You get an array
of Char, if you want a string just create one from the array."

I've just tried this. How do I create a string from an array of char?

The following didn't work - toString() could not do the conversion:
string str;
char[] charArray;
for(...){
str = charArray.ToString();
}


This worked, but it seems very inefficient to have to create a new string every time:
char[] charArray;
for(...){
str = new string(charArray);
}



Göran Andersson said:
Hi,

I'm a beginner is using C# and .net.

I have big legacy files that stores various values (ints, bytes, strings) and want to read them
into
a C# programme so that I can store them in a database. The files are written by a late 1980's PC
Pascal programme, for which I don't have the source code. I've managed to reverse engineer the
file
format.

The strings are stored as Ascii in the file, with the first byte indicating the string length, and
the rest are the Ascii (ie 8-bit) characters.

Yes, that's how strings are stored in Pascal.
The string length is always 0, 20 or 40 characters
(never any more) and strings are end-padded with space characters where necessary.

Does the length include the padding or not? If it does, you just have to
trim the string. If it doesn't, you have to calculate how much padding
there is from the length of the string, and skip that number of bytes.
What is the best way to quickly read a string and get rid of the space padding at the end?

Read the length using ReadByte, then use the ReadChars method to get the
string. You get an array of Char, if you want a string just create one
from the array.
To make
sure I can read them correctly, I'll put them in a text box. I assume the string used in a test
box
uses 16-bit characters (unicode?) but I may be wrong here. When I'm happy I can read them
correctly,
I'll get rid of the text box and store them directly in the database. Is it best to store it in
the
database as unicode? I'm tempted to use Ascii for efficiency.

I was thinking of using a binary reader (_br) to extract from the file. That should be fine for
everything, but I don't know how to cope the the Ascii strings.

Yes, a BinaryReader is exactly what I would suggest to use.

Specify the encoding when you create the BinaryReader, that way it can
handle reading chars, and you don't have to read bytes and decode them.

ASCII encoding won't work if the strings contains extended characters
(above 127). Use Encoding.GetEncoding(850) to get the encoding for
extended ASCII.
 
M

Morten Wennevik

String s = new String(chararray);


Thanks, that's very helpful.

"Read the length using ReadByte, then use the ReadChars method to get the string. You get an array
of Char, if you want a string just create one from the array."

I've just tried this. How do I create a string from an array of char?

The following didn't work - toString() could not do the conversion:
string str;
char[] charArray;
for(...){
str = charArray.ToString();
}


This worked, but it seems very inefficient to have to create a new string every time:
char[] charArray;
for(...){
str = new string(charArray);
}



Göran Andersson said:
Hi,

I'm a beginner is using C# and .net.

I have big legacy files that stores various values (ints, bytes, strings) and want to read them
into
a C# programme so that I can store them in a database. The files are written by a late 1980's PC
Pascal programme, for which I don't have the source code. I've managed to reverse engineer the
file
format.

The strings are stored as Ascii in the file, with the first byte indicating the string length, and
the rest are the Ascii (ie 8-bit) characters.

Yes, that's how strings are stored in Pascal.
The string length is always 0, 20 or 40 characters
(never any more) and strings are end-padded with space characters where necessary.

Does the length include the padding or not? If it does, you just have to
trim the string. If it doesn't, you have to calculate how much padding
there is from the length of the string, and skip that number of bytes.
What is the best way to quickly read a string and get rid of the space padding at the end?

Read the length using ReadByte, then use the ReadChars method to get the
string. You get an array of Char, if you want a string just create one
from the array.
To make
sure I can read them correctly, I'll put them in a text box. I assume the string used in a test
box
uses 16-bit characters (unicode?) but I may be wrong here. When I'm happy I can read them
correctly,
I'll get rid of the text box and store them directly in the database. Is it best to store it in
the
database as unicode? I'm tempted to use Ascii for efficiency.

I was thinking of using a binary reader (_br) to extract from the file. That should be fine for
everything, but I don't know how to cope the the Ascii strings.

Yes, a BinaryReader is exactly what I would suggest to use.

Specify the encoding when you create the BinaryReader, that way it can
handle reading chars, and you don't have to read bytes and decode them.

ASCII encoding won't work if the strings contains extended characters
(above 127). Use Encoding.GetEncoding(850) to get the encoding for
extended ASCII.
 
M

Morten Wennevik

Sorry, a bit fast on the send button there.

new String(charArray) is the way to go. It is not any less efficient than ToString would be since ToString would create a new string as well.

Array.ToString is not overridden and will merely return the object type.



String s = new String(chararray);


Thanks, that's very helpful.

"Read the length using ReadByte, then use the ReadChars method to get the string. You get an array
of Char, if you want a string just create one from the array."

I've just tried this. How do I create a string from an array of char?

The following didn't work - toString() could not do the conversion:
string str;
char[] charArray;
for(...){
str = charArray.ToString();
}


This worked, but it seems very inefficient to have to create a new string every time:
char[] charArray;
for(...){
str = new string(charArray);
}



Göran Andersson said:
Hi,

I'm a beginner is using C# and .net.

I have big legacy files that stores various values (ints, bytes, strings) and want to read them
into
a C# programme so that I can store them in a database. The files are written by a late 1980's PC
Pascal programme, for which I don't have the source code. I've managed to reverse engineer the
file
format.

The strings are stored as Ascii in the file, with the first byte indicating the string length, and
the rest are the Ascii (ie 8-bit) characters.

Yes, that's how strings are stored in Pascal.
The string length is always 0, 20 or 40 characters
(never any more) and strings are end-padded with space characters where necessary.

Does the length include the padding or not? If it does, you just have to
trim the string. If it doesn't, you have to calculate how much padding
there is from the length of the string, and skip that number of bytes.
What is the best way to quickly read a string and get rid of the space padding at the end?

Read the length using ReadByte, then use the ReadChars method to get the
string. You get an array of Char, if you want a string just create one
from the array.
To make
sure I can read them correctly, I'll put them in a text box. I assume the string used in a test
box
uses 16-bit characters (unicode?) but I may be wrong here. When I'm happy I can read them
correctly,
I'll get rid of the text box and store them directly in the database. Is it best to store it in
the
database as unicode? I'm tempted to use Ascii for efficiency.

I was thinking of using a binary reader (_br) to extract from the file. That should be fine for
everything, but I don't know how to cope the the Ascii strings.

Yes, a BinaryReader is exactly what I would suggest to use.

Specify the encoding when you create the BinaryReader, that way it can
handle reading chars, and you don't have to read bytes and decode them.

ASCII encoding won't work if the strings contains extended characters
(above 127). Use Encoding.GetEncoding(850) to get the encoding for
extended ASCII.
 
?

=?ISO-8859-1?Q?G=F6ran_Andersson?=

Here you can see the most common DOS code pages:

http://en.wikipedia.org/wiki/Codepage

Codepage 850 is the most likely for a brittish computer.
Thanks for all your replies.

Just to clarify...

"ASCII is 7-bit, and any character above 127 will need the proper encoding to be read right. I'm
assuming the characters are stored as 8 bit."

Sorry for being inprecise. When said "Ascii (ie 8-bit) characters", I meant that they are stored as
bytes rather than 16-bit quantities as unicode requires. The Ascii characters are all 7-bit from a
quick glance but really I ought to write a quick test programme to check this, which would find and
flag up non-ascii characters. Then I could try to deduce what the encoding is. The original
DOS-based [stock control] programme that created the file was by a UK company and this software was
used on a PC in the UK to generate the files. Is there a standard type of encoding for the UK? Is
this likely to be extended ASCII, for which Göran Andersson suggested using
Encoding.GetEncoding(850)? Thanks Jon Skeet for your comment about extended Ascii. I suppose that
there could potentially be things like a letter e with an acute accent, and I don't want to mangle
these. Once I've discovered the encoding, I'm pretty certain that it will be consistent across all
the files.

Göran Andersson: "Does the length include the padding or not?"

It does include padding, so a string with three characters appears as byte 0x14 (ie 20) followed by
the three characters followed by 17 space characters. I will trim the string.



"John" <-> wrote in message Hi,

I'm a beginner is using C# and .net.

I have big legacy files that stores various values (ints, bytes, strings) and want to read them into
a C# programme so that I can store them in a database. The files are written by a late 1980's PC
Pascal programme, for which I don't have the source code. I've managed to reverse engineer the file
format.

The strings are stored as Ascii in the file, with the first byte indicating the string length, and
the rest are the Ascii (ie 8-bit) characters. The string length is always 0, 20 or 40 characters
(never any more) and strings are end-padded with space characters where necessary.

What is the best way to quickly read a string and get rid of the space padding at the end? To make
sure I can read them correctly, I'll put them in a text box. I assume the string used in a test box
uses 16-bit characters (unicode?) but I may be wrong here. When I'm happy I can read them correctly,
I'll get rid of the text box and store them directly in the database. Is it best to store it in the
database as unicode? I'm tempted to use Ascii for efficiency.

I was thinking of using a binary reader (_br) to extract from the file. That should be fine for
everything, but I don't know how to cope the the Ascii strings.
 
J

John

Thanks Morten,

Does this mean that if I have say 1 million strings to read in, a new string must be allocated for
each? This must add substantial overhead. If so, what is the earliest time the the strings can be
garbage collected by the runtime - within the for loop or (sounds very inefficient) at the end of
the for loop?

I'm from a C background where I knew exactly what was happening, so what appears to me to be
happening with the C# code looks extremely inefficient, although my knowledge of what goes on behind
the scenes is poor, so I may be missing something.

I know that stringbuilder doesn't allocated new strings whenever a change is made to a string, so
would it be possible (and more efficient) to use this instead?



Sorry, a bit fast on the send button there.

new String(charArray) is the way to go. It is not any less efficient than ToString would be since
ToString would create a new string as well.

Array.ToString is not overridden and will merely return the object type.



String s = new String(chararray);


Thanks, that's very helpful.

"Read the length using ReadByte, then use the ReadChars method to get the string. You get an
array
of Char, if you want a string just create one from the array."

I've just tried this. How do I create a string from an array of char?

The following didn't work - toString() could not do the conversion:
string str;
char[] charArray;
for(...){
str = charArray.ToString();
}


This worked, but it seems very inefficient to have to create a new string every time:
char[] charArray;
for(...){
str = new string(charArray);
}
 
M

Morten Wennevik

Basically, each time you modify or read a string, a new string is created. Strings are unique in this matter. Once they are created they cannot be changed. Read Jon Skeet's article:

[Strings in .NET and C#]
http://www.yoda.arachsys.com/csharp/strings.html

If your goal is to assemble a long string out of several smaller ones you might benefit from using other methods. Storing all the characters inan array until the complete sentence is read before turning it to string etc.

The StringBuilder is much better at concatenating several strings than string = string + string, but you won't notice any difference unless you have many strings to concatenate.

If a string might be split into N blocks, I would probably use ReadBytesand store the result in a premade array using Array.Copy/CopyTo or Buffer.BlockCopy.
 
?

=?ISO-8859-1?Q?G=F6ran_Andersson?=

John said:
Thanks Morten,

Does this mean that if I have say 1 million strings to read in, a new string must be allocated for
each?
Yes.

This must add substantial overhead. If so, what is the earliest time the the strings can be
garbage collected by the runtime - within the for loop or (sounds very inefficient) at the end of
the for loop?

They can be garbage collected as soon as they are not used any more. If
you are creating the strings in a loop, it means that all strings that
you created can be collected, except the last one that you are still using.
I'm from a C background where I knew exactly what was happening, so what appears to me to be
happening with the C# code looks extremely inefficient, although my knowledge of what goes on behind
the scenes is poor, so I may be missing something.

It's normal for a .NET program to allocate and release quite some memory
during execution. The garbage collector handles the deallocation, and
also moves allocated objects so that the memory doesn't get fragmented.

To allocate and release objects is faster in a garbage collected
environment than in a traditional heap that uses reference counters.
I know that stringbuilder doesn't allocated new strings whenever a change is made to a string, so
would it be possible (and more efficient) to use this instead?

I don't think that using a StringBuilder just to avoid creating objects
is going to make any big difference in performance.

If you are using a StringBuilder anyway, there is an override of the
Append method that takes a char array, so then you wouldn't need the
step of creating the string from the array.
Sorry, a bit fast on the send button there.

new String(charArray) is the way to go. It is not any less efficient than ToString would be since
ToString would create a new string as well.

Array.ToString is not overridden and will merely return the object type.



String s = new String(chararray);


Thanks, that's very helpful.

"Read the length using ReadByte, then use the ReadChars method to get the string. You get an
array
of Char, if you want a string just create one from the array."

I've just tried this. How do I create a string from an array of char?

The following didn't work - toString() could not do the conversion:
string str;
char[] charArray;
for(...){
str = charArray.ToString();
}


This worked, but it seems very inefficient to have to create a new string every time:
char[] charArray;
for(...){
str = new string(charArray);
}
 
J

Jon Skeet [C# MVP]

Morten Wennevik said:
Basically, each time you modify or read a string, a new string is
created. Strings are unique in this matter.

They're not particularly unique - it's easy to create your own
immutable type, and often that can be a really good idea. (It makes it
easy to be thread-safe etc.)
 
L

Larry Lard

John said:
Thanks for all your replies.

Just to clarify...

"ASCII is 7-bit, and any character above 127 will need the proper encoding to be read right. I'm
assuming the characters are stored as 8 bit."

Sorry for being inprecise. When said "Ascii (ie 8-bit) characters", I meant that they are stored as
bytes rather than 16-bit quantities as unicode requires. The Ascii characters are all 7-bit from a
quick glance but really I ought to write a quick test programme to check this, which would find and
flag up non-ascii characters. Then I could try to deduce what the encoding is.

You will have to forgive the group for jumping on your comment... we
see a LOT of people who are completely oblivious to the many
complexities of encoding, so when someone says "ascii (ie 8-bit)" our
tripwires our tripped... it's rare (but nice) to have someone such as
yourself who actually knows that there are potential problems in store
in situations like this.
 
J

John

Thanks Larry, and everyone else for your very helpful comments. I'm pleased that the group jumped on
my comments. I've learnt quite a bit.

....but I can't help thinking that the way that C# handles strings is inefficient.

Göran Andersson commented: "to allocate and release objects is faster in a garbage collected
environment than in a traditional heap that uses reference counters."

OK, but in C, I would only allocate the string once, and it would be a fixed-length string of 41
characters (max num characters is 40 plus one extra character for the null terminator). I wouldn't
keep allocating it and de-allocating it - I would use it repeatedly, and if I'm using it a million
times in a for loop, then surely this would be much much faster than what C# does, since it doesn't
need to be allocated and de-allocated each time. Or am I missing something here?

Thanks again everyone for your help,

John


Thanks for all your replies.

Just to clarify...

"ASCII is 7-bit, and any character above 127 will need the proper encoding to be read right. I'm
assuming the characters are stored as 8 bit."

Sorry for being inprecise. When said "Ascii (ie 8-bit) characters", I meant that they are stored
as
bytes rather than 16-bit quantities as unicode requires. The Ascii characters are all 7-bit from a
quick glance but really I ought to write a quick test programme to check this, which would find
and
flag up non-ascii characters. Then I could try to deduce what the encoding is.

You will have to forgive the group for jumping on your comment... we
see a LOT of people who are completely oblivious to the many
complexities of encoding, so when someone says "ascii (ie 8-bit)" our
tripwires our tripped... it's rare (but nice) to have someone such as
yourself who actually knows that there are potential problems in store
in situations like this.
 
J

Jon Skeet [C# MVP]

Thanks Larry, and everyone else for your very helpful comments. I'm
pleased that the group jumped on my comments. I've learnt quite a
bit.

...but I can't help thinking that the way that C# handles strings is
inefficient.

Göran Andersson commented: "to allocate and release objects is faster
in a garbage collected environment than in a traditional heap that
uses reference counters."

OK, but in C, I would only allocate the string once, and it would be
a fixed-length string of 41 characters (max num characters is 40 plus
one extra character for the null terminator). I wouldn't keep
allocating it and de-allocating it - I would use it repeatedly, and
if I'm using it a million times in a for loop, then surely this would
be much much faster than what C# does, since it doesn't need to be
allocated and de-allocated each time. Or am I missing something here?

Yes - what's involved in allocating 40 characters. On the managed heap,
that involves increasing a pointer by 80 bytes, checking whether or not
it's exceeded the boundaries of the generation, and (assuming it
hasn't) zeroing out the memory. It's not a lot of work.
 
?

=?ISO-8859-1?Q?G=F6ran_Andersson?=

Jon said:
Yes - what's involved in allocating 40 characters. On the managed heap,
that involves increasing a pointer by 80 bytes, checking whether or not
it's exceeded the boundaries of the generation, and (assuming it
hasn't) zeroing out the memory. It's not a lot of work.

Indeed not a lot of work.

I made a simple test by creating a 40 character array, and create
strings from that array in a loop. Creating a million strings took about
200 ms on my laptop (Pentium M 2.13 GHz).

Actually, as I was using the regular clock to time it, I had to make it
ten million strings to get a reasonable execution time. That means that
the 200 ms includes garbage collections also, not just allocating the
memory.
 
J

John

Thanks again everyone. That's reassured me.

Best wishes,

John

Göran Andersson said:
Yes - what's involved in allocating 40 characters. On the managed heap,
that involves increasing a pointer by 80 bytes, checking whether or not
it's exceeded the boundaries of the generation, and (assuming it
hasn't) zeroing out the memory. It's not a lot of work.

Indeed not a lot of work.

I made a simple test by creating a 40 character array, and create
strings from that array in a loop. Creating a million strings took about
200 ms on my laptop (Pentium M 2.13 GHz).

Actually, as I was using the regular clock to time it, I had to make it
ten million strings to get a reasonable execution time. That means that
the 200 ms includes garbage collections also, not just allocating the
memory.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top