using gzlib from c#

Jon Skeet [C# MVP] · Dec 6, 2004

Bonj said:
No! Because I didn't actually stop and think "oooh.., what's a million
divided by thirty?" ah - about thirty-three thousand... mmm., that looks
wrong. I just ran the query and didn't at the time see any reason why it
wasn't right.

*Anything* being compressed to 30 bytes is pretty unusual...

No, because I wasn't particularly bothered about the exact compression ratio.

Fair enough.

I don't know - I can't see the point in that. I want to find out exactly how
big it's getting compressed to *in the database* - and ideally, I want the
database to tell me this info.

Well, I'm sure there's a way that the database can tell you how many
bytes are in the blob - but finding out how much actual disk space that
means is a different matter. That sounds somewhat more advanced. Maybe
I'm misunderstanding you though.

OK, so it's obviously not 30 bytes. I see that now. But I've no doubts that
it's compressing it to at least some degree, probably in the region of
10%-25%, and come to that, I've no reason to believe its losing anything or
compressing it any less than the managed library would.

I suspect it's doing pretty much exactly the same as the managed
library would, to be honest. I'd still be interested to find out what
went wrong with the managed code though. If you find the actual code
which failed, please let me know.

Stefan Simek · Dec 6, 2004

Hi Jon,

Reading this thread, I think I can guess what went wrong when Bonj was
trying to use the #ziplib. There is a little issue one can run into when not
reading the docs - the ZipInputStream.Read() doesn't always return the
amount of data requested, only a part. So the following doesn't work:

byte[] decompressed = new byte[decompressedSize];

zipStream.Read(decompressed, 0, decompressedSize);

You need to call the Read method in a loop to get all the data, so I just
guess this might have been the problem...

Hope you'll sleep better now

Stefan

Jon Skeet [C# MVP] · Dec 6, 2004

Stefan Simek said:
Reading this thread, I think I can guess what went wrong when Bonj was
trying to use the #ziplib. There is a little issue one can run into when not
reading the docs - the ZipInputStream.Read() doesn't always return the
amount of data requested, only a part.

That's exactly the problem I predicted a few articles ago (Dec 3rd,
17:33 GMT) - it's horrendously common. The Stream.Read docs should have
in big, big writing "DO NOT IGNORE THE RETURN VALUE!"

Hope you'll sleep better now

I'd rather actually *see* the bug though...

Carl Daniel [VC++ MVP] · Dec 6, 2004

Jon said:
I'm sure there is a way of doing it in T-SQL, of course, but I don't
know it off the top of my head, I'm afraid.

datalength()

-cd

Bonj · Dec 6, 2004

*Anything* being compressed to 30 bytes is pretty unusual...

Well, what about some data that was originally, say, about 120 bytes?

Well, I'm sure there's a way that the database can tell you how many
bytes are in the blob - but finding out how much actual disk space that
means is a different matter. That sounds somewhat more advanced. Maybe
I'm misunderstanding you though.

I know how to find out how much actual disk space the database is using, I
can even find out how much disk space the table is apparently using. I just
wanted to make sure it wasn't doing anything silly like allocating disk
space for the blobs only in pages of, say, 2KB min. per blob, 'cos if it is,
I want to find some other way of storing them other than in SQL that is more
efficient. Whilst I have no reason to suspect that is the case, I do want to
make sure.

I suspect it's doing pretty much exactly the same as the managed
library would, to be honest. I'd still be interested to find out what
went wrong with the managed code though. If you find the actual code
which failed, please let me know.

I will do, if I find it.

Jon Skeet [C# MVP] · Dec 6, 2004

Well, what about some data that was originally, say, about 120 bytes?

If it's got a *lot* of redundancy in, maybe - but I just tried zipping
a file which consisted only of the letters "o" and "x", typed
reasonably randomly - and the size of the file within the zip file was
32 bytes.

The above paragraph (218 bytes) only compresses to 148 bytes.
Basically, longer files tend to have more redundancy in - compression
doesn't tend to work well on small files.

I know how to find out how much actual disk space the database is using, I
can even find out how much disk space the table is apparently using. I just
wanted to make sure it wasn't doing anything silly like allocating disk
space for the blobs only in pages of, say, 2KB min. per blob, 'cos if it is,
I want to find some other way of storing them other than in SQL that is more
efficient. Whilst I have no reason to suspect that is the case, I do want to
make sure.

Right. I suspect blobs will be the most efficient way, but that you may
be able to tune the column in terms of page size or something similar.

I will do, if I find it.

Thanks.

Willy Denoyette [MVP] · Dec 6, 2004

This is not my point, I don't care what you do with your code, My point is
"did you try to compile it with a version 1.1 compiler" for which it was
authored anyway.
Now you are compiling with a pre-beta version of a non released product (
CTP October - VS Express), the errors you get are due to:
1. Wrong command-line to compile the VB files (errors).
2. Stricter compiler checkings in v2.0 for implicit convertion (warnings).

Willy.

Jerry Coffin · Dec 11, 2004

[ ... ]

No. I've told you that the restoration ISN'T failing. Believe me on this.
I've seen the file reproduced.
However I have discovered that my method of determining it is only 30 bytes
is flawed, I have done a test that proves that the particular test I did is
returning 30 when I have deliberately the actual data to more than 30 bytes.
However I have no reason to suspect that it isn't compressing, when I did it
on a test file that was 55KB it compressed it to 5KB.
But yes you are right, 1MB to 30 bytes is ridiculous. So those of you that
are clammering in a rush to steal my algorithm, don't bother, it isn't
*actually* that good.

[ ... ]

Perhaps it would be useful to step back from things for a moment.

Compression works by removing redundancy from data. To work
correctly, it can only remove as much redundancy as the data has to
start with -- and no real form is likely to remove _all_ the
redundancy either.

That makes one question obvious: how much redundancy does the data
have to start with. Shannon studied this a long time ago, and
concluded that in normal English text, you get about one bit of
entropy per character -- I.e. all but about one bit of each character
is redundant.

This means that with normal English represented as 8-bit
ANSI characters, about the best compression you can _possibly_ hope
for is that the compressed version be about 12.5% as large as the
original. If you have an algorithm that _seems_ to compress a lot
better than this, there are only a couple of possibilities: either
you're not measuring things correctly and not getting what you think
you are, or else the input you're giving it has substantially more
redundancy than normal English.

Now, it's true that computer programs (for example) typically ARE
somewhat more redundant than English -- you have a smaller
vocabulary, often include a lot more or less predictable indentation,
etc. At the same time, the 12.5% is a rough approximation of a
theoretical optimum, NOT something that's likely to be achieved on a
regular basis.

Of course, if your input is substantially different from English
text, that theoretical optimum changes (to at least some degree). It
seems surprisingly consistent across most natural languages -- higher
for some and lower for others, but the variations are fairly small.

Other types of input do often have much higher redundancy -- audio
and video streams for a couple of the most obvious examples. These,
however, have enough redundancy that most file formats for either use
at least some sort of compression.

using gzlib from c#

Jon Skeet [C# MVP]

Stefan Simek

Jon Skeet [C# MVP]

Carl Daniel [VC++ MVP]

Bonj

Jon Skeet [C# MVP]

Willy Denoyette [MVP]

Jerry Coffin