Compress ASCII text as Hex?

B

Ben Bloom

Hi -

I was speaking with someone who mentioned that it's possible to encode
an ascii string as hex(?) in order to fit more data into the same # of
chars. Can anyone enlighten me?

The scenario is - I've got a CSV with a field that has a 16 character
limit. I need to fit potentially 24 ASCII characters into it.

Thanks.
-Ben
 
N

Nicholas Paldino [.NET/C# MVP]

Ben,

You can't do that unless you limit the range of characters that can be
used in the 24 character string. Without doing that, you have to accept the
full range of characters and you can't just squeeze them in there without
some loss.

Hope this helps.
 
B

Ben Bloom

Thanks Nicholas,

The 24 character string is a concatenation of a number (8-10 digits, I
believe) and two other string fields. Would I have more success if I
tried to shrink the number only?

-Ben
 
G

Guest

if you are using a subset of characters, try fit 2 characters into character
written to the csv file,
say for example you were only interested in the character codes from 0-127,
you could write the string "me" i.e. hex codes 6d and 65, into one character
[pseudo]
char c = 0x6d65
[/pseudo]

and write that single char to the text file,
then when you read it, you breake it up again.
hope that helps
 
J

Jon Skeet [C# MVP]

<"=?Utf-8?B?QnJpYW4gS2VhdGluZyBFSTlGWEI=?=" <csharp at
briankeating.net> said:
if you are using a subset of characters, try fit 2 characters into character
written to the csv file,
say for example you were only interested in the character codes from 0-127,
you could write the string "me" i.e. hex codes 6d and 65, into one character
[pseudo]
char c = 0x6d65
[/pseudo]

and write that single char to the text file,
then when you read it, you breake it up again.

Note that that will only work if your CSV file is written in a Unicode-
supporting encoding. There's also no absolute guarantee that it won't
end up forming invalid characters, or characters which the reader might
normalize to a different but equivalent form as far as Unicode is
concerned. I doubt that it'll be a problem, but it's worth bearing in
mind.
 
G

Guest

Yes your right,
Encoding could prevent a problem but my description was slightly actually
more than slightly incorrect,
if we were limited the the 0-127 characters for the ascii table then we
would be using 7 bits to represent a character, therefore for every 7
characters we could squeeze in an extra char.
More trouble than it's worth i guess.

regards
Brian.
Jon Skeet said:
<"=?Utf-8?B?QnJpYW4gS2VhdGluZyBFSTlGWEI=?=" <csharp at
briankeating.net> said:
if you are using a subset of characters, try fit 2 characters into character
written to the csv file,
say for example you were only interested in the character codes from 0-127,
you could write the string "me" i.e. hex codes 6d and 65, into one character
[pseudo]
char c = 0x6d65
[/pseudo]

and write that single char to the text file,
then when you read it, you breake it up again.

Note that that will only work if your CSV file is written in a Unicode-
supporting encoding. There's also no absolute guarantee that it won't
end up forming invalid characters, or characters which the reader might
normalize to a different but equivalent form as far as Unicode is
concerned. I doubt that it'll be a problem, but it's worth bearing in
mind.
 
J

Jon Skeet [C# MVP]

<"=?Utf-8?B?QnJpYW4gS2VhdGluZyBFSTlGWEI=?=" <csharp at
briankeating.net> said:
Yes your right,
Encoding could prevent a problem but my description was slightly actually
more than slightly incorrect,
if we were limited the the 0-127 characters for the ascii table then we
would be using 7 bits to represent a character, therefore for every 7
characters we could squeeze in an extra char.
More trouble than it's worth i guess.

Certainly when the only necessity is to squeeze 24 characters into 16
:)
 
J

James Curran

There's a method, but it's a bit snarky....

There an encoding format code BASE64 (also known as UUEncoding in some
quarters). It take fully binary data (0-255) and converts it a set of 64
printable characters (digits, uppercase, lowercase plus two symbols + and
/). Since email messages are required to be pure printable text (due to
some ancient hardware, which are almost certainly no longer on the 'net),
all attachments are BASE64 encoded. It converts 3 binary bytes into 4
characters, so encoded blocks increase 33% in size.

So, what does this effect you? Well, as long as your "encoded" string meets
the criteria of Base64 encoding, you can "decode" it into a smaller block of
binary data. 4 characters will become 3 bytes, or in your case, 20
characters can become 15 bytes.

string origString = "123456,abcdef,ghijkl"; // 20 character CSV text

string prepareText = origString.Replace(',', '+'); // Replace commas with
plus signs
byte[] compressedText = Convert.FromBase64String(prepareText);
Console.WriteLine("Length of Conpressed text = {0}", compressedText.Length);
// Save compressedText to your store.
// :
// Later read it back
string alteredText = Convert.ToBase64String(compressedText);
string finalString = alteredText.Replace('+', ',');

Console.WriteLine("Text: {0}, this {1} the same as the original",
finalString, finalString == origString ? "IS" : "IS NOT");

Running the above, I get:
Length of Conpressed text = 15
Text: 123456,abcdef,ghijkl, this IS the same as the original



--
Truth,
James Curran
[erstwhile VC++ MVP]
Home: www.noveltheory.com Work: www.njtheater.com
Blog: www.honestillusion.com Day Job: www.partsearch.com
 
J

Jon Skeet [C# MVP]

James Curran said:
There's a method, but it's a bit snarky....

There an encoding format code BASE64 (also known as UUEncoding in some
quarters). It take fully binary data (0-255) and converts it a set of 64
printable characters (digits, uppercase, lowercase plus two symbols + and
/). Since email messages are required to be pure printable text (due to
some ancient hardware, which are almost certainly no longer on the 'net),
all attachments are BASE64 encoded. It converts 3 binary bytes into 4
characters, so encoded blocks increase 33% in size.

So, what does this effect you? Well, as long as your "encoded" string meets
the criteria of Base64 encoding, you can "decode" it into a smaller block of
binary data. 4 characters will become 3 bytes, or in your case, 20
characters can become 15 bytes.

Yes... it does mean you can only have 63 distinct characters though
(IIRC, '=' is used for end padding, which you also need to work out).

It also doesn't get 24 characters down to 16 :( Possibly a combination
of that (if it all applies appropriately) with something clever to do
with the 8 digits (which can be represented as a 4 byte integer, which
should help) could help.

It all sounds like something which should be redesigned rather than
munged like this though...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top