using gzlib from c#

  • Thread starter Thread starter Bonj
  • Start date Start date
The native one does come with a specification that says the output buffer
must be 0.1% larger than the input, plus 12 bytes. So I allocate ((1.001) *
inputsize) + 12), but the actual number of bytes received is returned.
 
Jon Skeet said:
Could you give us the exact URL?

I *think* it could have been this one
http://www.planetsourcecode.com/vb/scripts/ShowCode.asp?txtCodeId=2115&lngWId=10

I suspect it will actually take *less* effort to get the managed code
working, given my experiences with interop...

Well it'd have gone negative and would have to be *giving me back* effort
then, because I've already got the unmanaged code working, after the first
attempt.
An entirely managed solution, which means you don't require as many
security permissions to use it

Well, you apparently wouldn't get that if you used the
SuppressUnmanagedCodeSecurityAttribute().
, you don't need to worry about leaking
memory in unmanaged code

Right - so you're saying the managed version is bug-free, but the unmanaged
one has got memory leaks?
, the threading implications are easier to
understand, you can step in with the debugger if necessary, etc.


And that's what suggests that it's the use of it which is wrong rather
than the library.


Why not apply the same logic to everything else though? Why bother
using FileStream when you can just p/invoke fopen somehow? Managed code
is simply easier to use.

That may be the case - however, it's not the using of it that I'm concerned
about, it's what's actually going on inside it. Too many bad programs are
out there that haven't been coded especially badly but lazy programmers have
used incompatible components, and I'm determined that mine isn't going to
fall victim to that. To that end, although I normally wouldn't have a
problem with, say, a third party component (that I had "wheeled in" to do
some task that I couldn't do well myself) that had been around for, say, two
years, written from the ground up in managed code in its own right, and
would in an instance such as this normally use it over a similarly-evolved
unmanaged component, given the same functionality and suitability.
Writing my own algorithm was not really a viable option for this, as it
would have been too much of an effort given the size of the project, and
like you otherwise point out, things like encryption and compression are
hard, so it is better to use the algorithm of someone who has studied this
as a profession (or at least to as good a standard as.) Not wanting to
boast, but since it's something that I myself can't personally do (well), it
seems to me that it is perhaps generally quite a hard algorithm to construct
to a reasonable robust standard - so it makes sense to find the best library
that has had the most work done on its algorithm.
To this end, I researched this zlib thing. It seemed like what had happened,
is that some people had honed an algorithm over a number of versions. This
seemed like a pointer that it was something that would probably be
trustworthy. It then seems that when managed code had come out, some *other*
people had taken the source for the original unmanaged algorithm (since they
made it available) and "converted" it to .NET. Now, a lot of this conversion
process can only be presumed to be doing work to 'bend' the interfaces of
the code to be able to marshal byte arrays to streams, and the rest to to
copy the existing (probably C) code into managed syntax that can work with
the new object-oriented interface.
It's not this object-oriented interface I have beef with, it's the damage
that would have been done to the algorithm by the process of converting the
syntax to managed code *aswell* as giving it a new interface. I decided
that, best case scenario, this might not be true at all - they may have done
it very well. So I thought, 'well, if the first test I do on it works - then
they have proved me wrong. But otherwise, my suspicions are confirmed'. So I
did the test, and on uncompression half the text file that I'd pasted into
the test program was missing. I looked at the test program to see if I could
spot any errors, and if so fix them to prove the anticipated success of the
library. I couldn't , so I thought that no matter what the interfaces are
like, if I can't spot any problems with someone else's (fairly easy to
read) code, then I don't see how I can write my own program to use this
library.
Well, you're saying that his code is buggy...

Well - maybe that's a bit strong. I'm not saying it's buggy - I'm just
saying that it *possibly* is. The reason it doesn't work could be a number
of things. But if I've got one solution using unmanaged code that 'works',
and one using pure managed that
'doesn't-work-yet-but-might-if-I-faff-around-with-it', I'm going to use the
one that works now.
No, it's done because streams are the idiomatic way of working with
what are naturally streams of data. The fact that it *does* use streams
is what enabled me to allow a Compact Framework application I was
working on to accept compressed data with about 5 lines of code - I
just wrapped the Stream I was given back by HttpWebResponse in a
GzipInputStream. Very easy, and elegant.

That's great - you've obviously got it to work to such an extent that it
will work on a pocket pc, so for it's sake and your sake I'm pleased. But
there is no reason why it would be any better for me than unmanaged code at
all.
Enjoy your time with interop then. I know I wouldn't...

Well, I don't see why I wouldn't - you see, remember this is a service that
I only decided to build for just my own home PC, because a whole C file got
wiped in a power cut - it will be built once, never touched again, and just
left there choodling away, keeping all source code backed up. It's not a big
distributed application for work or anything that *must* be as elegant as
possible.
 
Well, your very funny kind of person.

And? That in itself doesn't make me a bad programmer.
First, you are going to write your own encryption,

Well no, actually that's where you're wrong. If you'd read the whole of my
recent thread about encryption "how good an encryption algorithm is this"
(100+ posts) you'd have discovered that I was trying to solicit ideas as to
how practicable an idea it was to use my own algorithm, but was convinced
that it wouldn't be a good idea, and decided to use DPAPI instead.
own compression

The very existence of this thread contradicts that.
and whatever else, because you think you'll do better than people whose
products are widespread.

Wrong again. The difference being, that I'm willing to use *some* components
developed by other people that are widespread, but I'm discerning in that I
don't believe that just because a component is widespread, it's
automatically going to be better than my own component. And at the end of
the day, please don't insult my intelligence by claiming that there is
no-one at all out there that's less intelligent than I am or a worse
programmer than I am, that has managed to widely distribute their component,
and therefore that I should put my trust in any component I find in the
internet, safe in the knowledge that because it's made it onto the internet
it *must* be better than anything I could cook up myself.

On the other hand, you won't use a library because it's only TWO YEARWS
since people begun to use it?

No.
I won't use it because it hasn't successfully proved itself to me in a nice
short time.

Second arrogance: I don't see why I should spend very much of my *own* time
researching some interface or other that is *only* related to a particular
piece of *somebody else*'s work.
Like I say - compression is quite a hard thing to do. And it's not entirely
unfeasible that even very clever people have failed to do it well - I'm not
saying SharpDevelop fit into this category - it might just be the fact that
it was unlucky that I happened to download a bad test project.

And also,

I won't use a library because it's only two years since people began to use
it *when there's an equivalent one that's been around for a lot longer than
two years*.
I'm not one a ".NET snob" - I don't think that all unmanaged code is to be
untrusted because it doesn't have the benefit of a GC and therefore might
succumb to memory leaks. I recognise the benefit of .NET - which is that the
fact that because it's got a GC means you can write code faster - not
necessarily better. If I have to free() / delete stuff that I've malloc'd /
new'd doesn't mean that I'm going to forget to do that just because it's
unessential to the operation of the program, and therefore making the
program worse, it just means that I'm going to have to spend a bit of time
checking that I have made such clean-up calls.
This, and other RAD characteristics of .NET enables you to write complex
functionality in a short time. Since this is a project just for myself, then
I am more than happy to take advantage of .NET for it* - however I
personally feel that using something *just* because it is written in .NET is
a silly decision, and one that would make me a ".NET snob".

You've admitted that you think the only reason why I should use the
SharpDevelop version of the zlib library is so I can have my project purely
in managed code. And I'm answering back by saying the only reason I'm using
the older, unmanaged version of the zlib software is so I can have my
project purely in *good* code.
Just for fun, I've used the SharpZipLib yesterday for the first time. I've
made several tests and it didn't fail me even once. It works perfectly for
anything I could try.

How glib of you. ( I mean, how gzlib of you.)


*If you want to see why I wouldn't use C# for anything I plan to sell, see
http://groups.google.com/groups?q=g:thl2660319408d&dq=&hl=en&lr=&selm=#WPRkX$PCHA.2568@tkmsftngp10

Stefan Simek said:
My fingers got tied up... ;)

... *two years* > Stefan

Stefan Simek said:
Well, your very funny kind of person.

First, you are going to write your own encryption, own compression and
whatever else, because you think you'll do better than people whose
products are widespread.

On the other hand, you won't use a library because it's only TWO YEARWS

Bonj said:
In addition, SharpZipLib is developed by "SharpDevelop" - I've tried the
"SharpDevelop" "IDE" before and found it appalling, so I don't actually
trust
anything written by them.
It set up a load of bogus users with random strings as their names, and
made
them owners of its directory structure, I had to fiddle around removing
them
all after I'd uninstalled it. Cygwin did exactly the same thing (only to
its
own directory structure though). (GNU people trying their hand at
Microsoft
stuff? What's that all about anyway?)
And whoever's written "SharpZipLib" - however well they've tried to
convert
it, it can only be a maximum of two years old - as that's how old C# is.
The
unmanaged version looks like it's been around a lot longer than that,
and
seems to have gone through numerous versions and bug fixes.


:

You can avoid using unsafe code:

[DllImport("zlib1.dll", CallingConvention=CallingConvention.Cdecl)]
static extern short compress(IntPtr dest, ref uint destLen, IntPtr
source,
uint sourceLen);

short compress(byte[] dest, ref uint destLen, byte[] source, uint
sourceLen)
{
// do some checking on null references...

GCHandle hDest = GCHandle.Alloc(dest, GCHandleType.pinned);
GCHandle hSource = GCHandle.Alloc(source, GCHandleType.pinned);

try
{
return compress(hDest.AddrOfPinnedObject(), ref destLen,
hSource.AddrOfPinnedObject(), sourceLen);
}
finally
{
hSource.Free();
hDest.Free();
}
}

But I would recommend using the SharpZipLib anyway.

HTH,
Stefan

"Bonj" <benjtaylor at hotpop d0t com> wrote in message
Thanks, that seems to work.

"Bonj" <benjtaylor at hotpop d0t com> wrote in message
I tried it with
[DllImport("zlib1.dll", CallingConvention=CallingConvention.Cdecl)]
static extern Int16 compress(ref byte[] dest, ref uint destlen,
byte[]
source, int sourcelen);
and it still does the same thing, bombs on the call to that method.

I also tried it with the return value as int (Int32) and that did
the
same thing aswell.


"Bonj" <benjtaylor at hotpop d0t com> wrote in message
I downloaded the gzlib library from zlib in order to do compression.
(http://www.gzip.org/zlib)
The prototype of the compression function seems to be
int compress (Bytef *dest, uLongf *destLen, const Bytef *source,
uLong
sourceLen);
It is meant to be called by C, but I would rather use it from C#.

So I wrote the following C# program to test it, but it failed to
work.
The call to compress doesn't return or throw an exception, it
simply
bombs the program. I'm probably calling it wrong, but have no idea
why.

This is the program:
using System;
using System.IO;
using System.Security;
using System.Runtime.InteropServices;

class class1
{
// int compress (Bytef *dest, uLongf *destLen, const Bytef *source,
uLong sourceLen);
[SuppressUnmanagedCodeSecurityAttribute()]
[DllImport("zlib1.dll")]
static extern Int16 compress(ref byte[] dest, ref uint destlen,
byte[]
source, int sourcelen);

static void Main(string[] args)
{
if(args.Length != 2)
{
Console.WriteLine("Usage: testgz <inputfile> <outputfile>");
return;
}
try
{
int filelen = (int)((new FileInfo(args[0])).Length);
uint outputlen = (uint)Math.Ceiling(1.001 * filelen) + 12;
using(FileStream fsr = new FileStream(args[0], FileMode.Open,
FileAccess.Read, FileShare.Read))
{
byte[] inputbytes = new byte[filelen],
outputbytes = new byte[outputlen];
fsr.Read(inputbytes, 0, filelen);
if(compress(ref outputbytes, ref outputlen, inputbytes, filelen)
==
0)
using(FileStream fsw = new FileStream(args[1], FileMode.Create,
FileAccess.Write, FileShare.Read))
{
fsw.Write(outputbytes, 0, (int)outputlen);
fsw.Close();
}
fsr.Close();
}
}
catch(Exception ex)
{
Console.WriteLine(ex.ToString());
}
}
}
 
Although it was a good suggestion you gave me to use AddrOfPinnedPointer to
avoid unsafe code, thanks

Stefan Simek said:
My fingers got tied up... ;)

... *two years* since people begun to use it?

Just for fun, I've used the SharpZipLib yesterday for the first time. I've
made several tests and it didn't fail me even once. It works perfectly for
anything I could try.

Stefan

Stefan Simek said:
Well, your very funny kind of person.

First, you are going to write your own encryption, own compression and
whatever else, because you think you'll do better than people whose
products are widespread.

On the other hand, you won't use a library because it's only TWO YEARWS

Bonj said:
In addition, SharpZipLib is developed by "SharpDevelop" - I've tried the
"SharpDevelop" "IDE" before and found it appalling, so I don't actually
trust
anything written by them.
It set up a load of bogus users with random strings as their names, and
made
them owners of its directory structure, I had to fiddle around removing
them
all after I'd uninstalled it. Cygwin did exactly the same thing (only to
its
own directory structure though). (GNU people trying their hand at
Microsoft
stuff? What's that all about anyway?)
And whoever's written "SharpZipLib" - however well they've tried to
convert
it, it can only be a maximum of two years old - as that's how old C# is.
The
unmanaged version looks like it's been around a lot longer than that,
and
seems to have gone through numerous versions and bug fixes.


:

You can avoid using unsafe code:

[DllImport("zlib1.dll", CallingConvention=CallingConvention.Cdecl)]
static extern short compress(IntPtr dest, ref uint destLen, IntPtr
source,
uint sourceLen);

short compress(byte[] dest, ref uint destLen, byte[] source, uint
sourceLen)
{
// do some checking on null references...

GCHandle hDest = GCHandle.Alloc(dest, GCHandleType.pinned);
GCHandle hSource = GCHandle.Alloc(source, GCHandleType.pinned);

try
{
return compress(hDest.AddrOfPinnedObject(), ref destLen,
hSource.AddrOfPinnedObject(), sourceLen);
}
finally
{
hSource.Free();
hDest.Free();
}
}

But I would recommend using the SharpZipLib anyway.

HTH,
Stefan

"Bonj" <benjtaylor at hotpop d0t com> wrote in message
Thanks, that seems to work.

"Bonj" <benjtaylor at hotpop d0t com> wrote in message
I tried it with
[DllImport("zlib1.dll", CallingConvention=CallingConvention.Cdecl)]
static extern Int16 compress(ref byte[] dest, ref uint destlen,
byte[]
source, int sourcelen);
and it still does the same thing, bombs on the call to that method.

I also tried it with the return value as int (Int32) and that did
the
same thing aswell.


"Bonj" <benjtaylor at hotpop d0t com> wrote in message
I downloaded the gzlib library from zlib in order to do compression.
(http://www.gzip.org/zlib)
The prototype of the compression function seems to be
int compress (Bytef *dest, uLongf *destLen, const Bytef *source,
uLong
sourceLen);
It is meant to be called by C, but I would rather use it from C#.

So I wrote the following C# program to test it, but it failed to
work.
The call to compress doesn't return or throw an exception, it
simply
bombs the program. I'm probably calling it wrong, but have no idea
why.

This is the program:
using System;
using System.IO;
using System.Security;
using System.Runtime.InteropServices;

class class1
{
// int compress (Bytef *dest, uLongf *destLen, const Bytef *source,
uLong sourceLen);
[SuppressUnmanagedCodeSecurityAttribute()]
[DllImport("zlib1.dll")]
static extern Int16 compress(ref byte[] dest, ref uint destlen,
byte[]
source, int sourcelen);

static void Main(string[] args)
{
if(args.Length != 2)
{
Console.WriteLine("Usage: testgz <inputfile> <outputfile>");
return;
}
try
{
int filelen = (int)((new FileInfo(args[0])).Length);
uint outputlen = (uint)Math.Ceiling(1.001 * filelen) + 12;
using(FileStream fsr = new FileStream(args[0], FileMode.Open,
FileAccess.Read, FileShare.Read))
{
byte[] inputbytes = new byte[filelen],
outputbytes = new byte[outputlen];
fsr.Read(inputbytes, 0, filelen);
if(compress(ref outputbytes, ref outputlen, inputbytes, filelen)
==
0)
using(FileStream fsw = new FileStream(args[1], FileMode.Create,
FileAccess.Write, FileShare.Read))
{
fsw.Write(outputbytes, 0, (int)outputlen);
fsw.Close();
}
fsr.Close();
}
}
catch(Exception ex)
{
Console.WriteLine(ex.ToString());
}
}
}
 

Right. Could you email me the file that you were trying to compress,
and how you used the VB.NET project to compress it? I would like to get
to the bottom of this if possible, just for my own personal reasons.

Having had a quick look at it, it doesn't look like *nice* code (he's
resizing his own arrays painstakingly rather than using a MemoryStream,
for instance) and calling Flush on the stream but not Close, but I
can't immediately see any actual bugs. (Admittedly I'm not as adept at
reading VB.NET as I might be, so there might be a bug I've missed.)
Well it'd have gone negative and would have to be *giving me back* effort
then, because I've already got the unmanaged code working, after the first
attempt.

Just because you've got it working to start with doesn't mean that's
the end of the matter, of course...
Well, you apparently wouldn't get that if you used the
SuppressUnmanagedCodeSecurityAttribute().

I think you need to read the docs for that more carefully:

<quote>
The demand for the UnmanagedCode permission will still occur at link
time. For example, if function A calls function B and function B is
marked with SuppressUnmanagedCodeSecurityAttribute, function A will be
checked for unmanaged code permission during just-in-time compilation,
but not subsequently during run time.
Right - so you're saying the managed version is bug-free, but the unmanaged
one has got memory leaks?

Nope - I'm saying that when you're calling from managed code to
unmanaged code, it's fairly easy to leak memory (or corrupt memory).

Well - maybe that's a bit strong. I'm not saying it's buggy

You *did* say it's buggy. You wrote:

<quote>
Nah - I've tried it, and it's not lossless.
</quote>

It's clearly *meant* to be lossless, and you're saying it isn't,
therefore you're saying it's buggy.
That's great - you've obviously got it to work to such an extent that it
will work on a pocket pc, so for it's sake and your sake I'm pleased. But
there is no reason why it would be any better for me than unmanaged code at
all.

Except for all the reasons I've already mentioned...
Well, I don't see why I wouldn't

Because I've had problems with interop before. It can be a tricky
business - in particular, if and when it *does* go wrong, working out
exactly *how* it went wrong can be very nasty.
 
Right. Could you email me the file that you were trying to compress,
and how you used the VB.NET project to compress it? I would like to get
to the bottom of this if possible, just for my own personal reasons.

Well, I don't think it could have been that one, because it won't even
compile now.
The results I get from compiling it with

vbc /r:icsharpcode.sharpziplib.dll /debug *.vb

are

Microsoft (R) Visual Basic .NET Compiler version 8.0.40607.42
for Microsoft (R) .NET Framework version 2.0.40607.42
Copyright (C) Microsoft Corporation 1987-2003. All rights reserved.

vbc : error BC30420: 'Sub Main' was not found in 'AssemblyInfo'.
C:\Documents and Settings\TheMagicBonj\My
Documents\Installers\Downloaded\VB_NET_Str1712922242004\String Compression w
SharpZipLib\fAbout.vb(124) : error BC30451: Name 'Process' is not declared.

Process.Start("http://www.icsharpcode.net/OpenSource/SharpZipLib/Default.aspx")
~~~~~~~
C:\Documents and Settings\TheMagicBonj\My
Documents\Installers\Downloaded\VB_NET_Str1712922242004\String Compression w
SharpZipLib\fStringCompression.vb(297) : error BC30451: Name 'Process' is
not declared.

Process.Start("http://www.icsharpcode.net/OpenSource/SharpZipLib/Default.aspx")
~~~~~~~
C:\Documents and Settings\TheMagicBonj\My
Documents\Installers\Downloaded\VB_NET_Str1712922242004\String Compression w
SharpZipLib\fStringCompression.vb(329) : warning BC42016: Implicit
conversion from 'Integer' to 'Microsoft.VisualBasic.MsgBoxStyle'.

MsgBox("Text returned uncorrupted.", MsgBoxStyle.OKOnly +
MsgBoxStyle.Information, "Success")
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\Documents and Settings\TheMagicBonj\My
Documents\Installers\Downloaded\VB_NET_Str1712922242004\String Compression w
SharpZipLib\fStringCompression.vb(331) : warning BC42016: Implicit
conversion from 'Integer' to 'Microsoft.VisualBasic.MsgBoxStyle'.

MsgBox("Text returned corrupted.", MsgBoxStyle.OKOnly +
MsgBoxStyle.Exclamation, "Failure")
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

But I *really* still can't see what I would gain by trying to get it
working...
Having had a quick look at it, it doesn't look like *nice* code (he's
resizing his own arrays painstakingly rather than using a MemoryStream,
for instance) and calling Flush on the stream but not Close, but I
can't immediately see any actual bugs. (Admittedly I'm not as adept at
reading VB.NET as I might be, so there might be a bug I've missed.)


Just because you've got it working to start with doesn't mean that's
the end of the matter, of course...

Well... perhaps when (if) it breaks, I'll research the managed version more,
until then...
Perhaps you could point out some way in which I could analyze my program
using the unmanaged version, to see where there may be a problem in it, a
problem that I have deduced doesn't exist due to having seen it doing its
job and thus apparently working properly? The URL of your favourite free
memory leak finder, for instance - that would prove to me that your fears
about there being leaks in the interop were well-founded?

I think you need to read the docs for that more carefully:

<quote>
The demand for the UnmanagedCode permission will still occur at link
time. For example, if function A calls function B and function B is
marked with SuppressUnmanagedCodeSecurityAttribute, function A will be
checked for unmanaged code permission during just-in-time compilation,
but not subsequently during run time.
</quote>

Right, so there will still be some checking, just not as much. Fair enough.
I can cope with that.
Nope - I'm saying that when you're calling from managed code to
unmanaged code, it's fairly easy to leak memory (or corrupt memory).

Well - looking at it logically, all that is happening on the call is that a
piece of memory is being allocated by managed code, and a GCHandle is pinned
to this in order for the unmanaged code to write to that block of memory.
Since the managed code allocated this memory, it knows when the unmanaged
function has returned, but furthermore it also knows explicitly when it is
allowed to mark the memory as GCable again, through a call to GCHandle.Free.
Surely this tells the runtime that that piece of memory is now safely back
on managed shores? If *not*, then (a) is that not a bit dumb of the runtime,
and (b) what is the point of the GCHandle.Free call then. I can't see any
other way for the interaction to be any less watertight, unless the
unmanaged code allocated memory that it did not free, which would imply
something down to the writers of the unmanaged DLL, rather than the
interaction.
Are you saying that *all* interaction between managed and unmanaged code is
to be avoided, becasue of possible memory leaks? If not, then where do you
draw the boundary - what is especially dangerous about this particular
example?
You *did* say it's buggy. You wrote:

<quote>
Nah - I've tried it, and it's not lossless.
</quote>

It's clearly *meant* to be lossless, and you're saying it isn't,
therefore you're saying it's buggy.

Well, OK, I did - if you want to be black and white about it. But it is more
the grey area in between I was getting at - which was that my *perception*
was that it *may* be buggy, it *may* be perfectly sound, it may just be
reasonably unbuggy but easy to write a buggy wrapper round. It may be a
combination of all three. Who knows. I fully accept your point that you
personally and others have used it with much success (and yes, this is
evidence for the case that it is unbuggy) - this proves that it is
*possible* to write code that uses it effectively. But in my opinion, like I
say, it should not be that clear cut. There is a difference between
"*possible* to use effectively " and "foolproof - therefore excellent ".

Because I've had problems with interop before. It can be a tricky
business - in particular, if and when it *does* go wrong, working out
exactly *how* it went wrong can be very nasty.

Like I say before - when *is* it OK?
 
Bonj said:
Well, I don't think it could have been that one, because it won't even
compile now.
The results I get from compiling it with

vbc /r:icsharpcode.sharpziplib.dll /debug *.vb

are
You are compiling this with a Beta version of .NET.
Did you ever try to do this using a released product version?

Willy.
 
You are compiling this with a Beta version of .NET.
Did you ever try to do this using a released product version?

No - not at all. This is a product for myself only, not to sell - so I can
rest assured that anything code I write is backed up, without me having to
do anything and without having to get loads of disk space (aswell as being
able to keep it all in a 2GB-max MSDE database (which before you ask, yes,
it is yukon beta)). No other purpose.
Besides, I wouldn't do a "released product version" in *any* version of
..NET, for the reasons discussed here
http://groups.google.com/groups?q=g:thl2660319408d&dq=&hl=en&lr=&selm=#WPRkX$PCHA.2568@tkmsftngp10
No offence to .NET though - it is *great* for in-house, personal and free
software. It's only saleable applications the above thread is relevant to.
 
Interestingly, it seems ALL text files, no matter how big originally, are
compressed to 26 bytes, which plus a 4-byte header containing the
uncompressed size means they are all 30 bytes in the database. There's a
file that's 956,990 bytes uncompressed, and 30 bytes when compressed.
 
Bonj said:
Interestingly, it seems ALL text files, no matter how big originally,
are compressed to 26 bytes, which plus a 4-byte header containing the
uncompressed size means they are all 30 bytes in the database.
There's a file that's 956,990 bytes uncompressed, and 30 bytes when
compressed.

....which is nonsense. If you're only getting 30 bytes regardless of the
input size, you don't have a working compression system.

-cd
 
Well, I don't think it could have been that one, because it won't even
compile now.

Fair enough.

But I *really* still can't see what I would gain by trying to get it
working...

A wholly managed application, which I believe is an easier beast to
work with.
Well... perhaps when (if) it breaks, I'll research the managed version more,
until then...
Perhaps you could point out some way in which I could analyze my program
using the unmanaged version, to see where there may be a problem in it, a
problem that I have deduced doesn't exist due to having seen it doing its
job and thus apparently working properly? The URL of your favourite free
memory leak finder, for instance - that would prove to me that your fears
about there being leaks in the interop were well-founded?

I don't have a favourite memory leak finder, I'm afraid - but the main
problem is that while you may have it working in a small test
environment, the kind of mistakes which can occur in interop may well
only show themselves in a more stressful situation.
Right, so there will still be some checking, just not as much. Fair enough.
I can cope with that.

There's just as much checking in terms of you need just as much
security permission - it's just that it only gets checked once. In
other words, you'll still find it a pain to run it in various
situations.
Well - looking at it logically, all that is happening on the call is that a
piece of memory is being allocated by managed code, and a GCHandle is pinned
to this in order for the unmanaged code to write to that block of memory.
Since the managed code allocated this memory, it knows when the unmanaged
function has returned, but furthermore it also knows explicitly when it is
allowed to mark the memory as GCable again, through a call to GCHandle.Free.
Surely this tells the runtime that that piece of memory is now safely back
on managed shores? If *not*, then (a) is that not a bit dumb of the runtime,
and (b) what is the point of the GCHandle.Free call then. I can't see any
other way for the interaction to be any less watertight, unless the
unmanaged code allocated memory that it did not free, which would imply
something down to the writers of the unmanaged DLL, rather than the
interaction.

It may well be fine. It may well not. I'm not enough of an expert on
interop to say for sure - and that's the kind of thing that worries me.
Are you saying that *all* interaction between managed and unmanaged code is
to be avoided, becasue of possible memory leaks? If not, then where do you
draw the boundary - what is especially dangerous about this particular
example?

I'm saying that you should avoid it whenever possible - when, for
example, there's a tried, tested and well-accepted library which does
the same thing in managed code.
Well, OK, I did - if you want to be black and white about it. But it is more
the grey area in between I was getting at - which was that my *perception*
was that it *may* be buggy, it *may* be perfectly sound, it may just be
reasonably unbuggy but easy to write a buggy wrapper round. It may be a
combination of all three. Who knows. I fully accept your point that you
personally and others have used it with much success (and yes, this is
evidence for the case that it is unbuggy) - this proves that it is
*possible* to write code that uses it effectively. But in my opinion, like I
say, it should not be that clear cut. There is a difference between
"*possible* to use effectively " and "foolproof - therefore excellent ".

The interesting thing is that when/if we find out where you went wrong
with the managed code, we may find that the same mistake would have
applied to the unmanaged code. Are you saying it's impossible to get
the interop wrong?
Like I say before - when *is* it OK?

When there's no other choice, IMO. And when there *is* no other choice,
hopefully other people have done it before and already written a
managed wrapper around it, which has been tested by lots of other
people. Any time I do interop myself without getting lots of peer
review, I get nervous...
 
So.... what results do you get, how big does it compress to for you?
For example the file instcat.sql which on my system is in c:\windows\system32.
It is about a megabyte in size. When compressed, it's 30 bytes. *And it
restores perfectly*. How *isn't* that working....?

Tell me how I can prove you right ... because currently I don't know.....

I'd *love* for you to tell me a way in which I can prove my system flawed,
then I could believe that getting an all-.NET library and learning how to use
that would be a good idea...but currently I don't!

I suspect it's that the compressor is working at its optimum rate, and is
getting it down as much as it can - but it *needs* at least 26 bytes to store
any file.
 
Bonj said:
So.... what results do you get, how big does it compress to for you?
For example the file instcat.sql which on my system is in c:\windows\system32.
It is about a megabyte in size. When compressed, it's 30 bytes. *And it
restores perfectly*. How *isn't* that working....?

However you're testing the restoration is failing, basically. What's
your current methodology?
Tell me how I can prove you right ... because currently I don't know.....

Common sense should tell you that any compression system which
supposedly compresses several different large files all to 26 bytes is
almost certainly too good to be true.
I'd *love* for you to tell me a way in which I can prove my system flawed,
then I could believe that getting an all-.NET library and learning how to use
that would be a good idea...but currently I don't!

I suspect it's that the compressor is working at its optimum rate, and is
getting it down as much as it can - but it *needs* at least 26 bytes to store
any file.

Zipping the instcat.sql on my box (which is also about a meg)
compresses it down to 64K. That's actually pretty good - 93%
compression - but do you really think it's likely that the method
you're using is likely to be 500 times better? On all text files?

Can you take that compressed file to another machine, decompress it,
and still get the same file out? Try it a web page or something (the
printable version of my threading article is a nice long one -
http://www.pobox.com/~skeet/csharp/threads/printable.shtml). Download
it on computer A, compress it, transfer the compressed file to computer
B, and decompress it - never having visited the page on computer B.
 
I don't have a favourite memory leak finder, I'm afraid - but the main
problem is that while you may have it working in a small test
environment, the kind of mistakes which can occur in interop may well
only show themselves in a more stressful situation.

Well like I say, tell me how I can prove it. I've already built a stresstest
program and run that and it didn't show up any problems
(see my final working program at
http://www.planetsourcecode.com/vb/scripts/ShowCode.asp?txtCodeId=3045&lngWId=10
and you can slag it off all you want, but it'll be wasted breath if I can't
see what I'd *actually* gain in real terms by the alternative - for instance
"Use this - and you'll get average processor usage / file backup time of x%
and typical compression ratio of y%, but use your unmanaged solution and
you'll get rates of p% and q%" is tangible gain in real terms, while "but
you'd have the elegance and manageability of an entirely managed app" ISN'T
tangible gain in real terms.)

I need a tangible way of seeing for myself why a pure .NET application is
better than one that's mixing unmanaged code in order to believe you I'm
afraid.

I'm saying that you should avoid it whenever possible - when, for
example, there's a tried, tested and well-accepted library which does
the same thing in managed code.

But its interface is far too difficult to use!
I mean, what's all this GZip, BZip, PZip, TZip, ZZip, etc....? I bet they
only did that in order to make use of inherited classes for the sake of it.
Why not just have one, simple function?

The interesting thing is that when/if we find out where you went wrong
with the managed code, we may find that the same mistake would have
applied to the unmanaged code. Are you saying it's impossible to get
the interop wrong?

I think that very unlikely, because like I say above, the picture I'm
getting is that the managed code has a complex web of different namespaces
and classes, and the unmanaged one has two simple functions - one for
compress, one for uncompress.
No, course it's not impossible to get interop wrong - I've done it many a
time. But on this occasion, I *think* I happen to have got it right! If you
disagree, check my code at the link above and tell me where. I keep shaking
my head in confusion, dismay, and almost pity, at the constant stream of
replies saying things like
"yes, but an all-managed solution would be more elegant.", and
"oh yes, but you just wait till you get memory leaks further down the line!"
I can't understand it! This thread should have been over long ago (even
though that doesn't alter the fact that it's an interesting argument
nonetheless...) - Stefan Simek gave me the answer I needed, but it drags on
and on.... and not because the answer didn't work, but because id *did*!
I'm flattered - it'd be almost heart-warming that someone actually cares
about my code even more than I do, if only there was only that little bit
more substance about it that made me suck my fingers and think yes, maybe
these people have got a point....!
When there's no other choice, IMO. And when there *is* no other choice,
hopefully other people have done it before and already written a
managed wrapper around it, which has been tested by lots of other
people. Any time I do interop myself without getting lots of peer
review, I get nervous...

I really don't see why there seems to exist this obsession with preferring
to use other people's code rather than your own. Encryption, OK - maybe,
perhaps even compression - they are something that people do solely as a
profession.

But can't/daren't/won't even write a managed wrapper round some unmanaged
code unless somebody else has done exactly that same thing before? There
doesn't exist some job that someone does that is solely to write managed
wrappers round a particular library - because it's not such a saleable
service in and of itself, that's because anyone can do it themselves, and no
one in their right mind would *buy* a wrapper like this when they can do it
themselves.
It would only really be necessary if you were writing code blind and just
submitting it, having to guarantee that it will work - whereas in most parts
of the world, we have invented something that is called *testing* ...

And anyway, why are *you*, an MVP, saying "hopefully other people have done
it before" , i.e. looking to *other* people to have written a managed wrapper
round something? Surely shouldn't *you* be the one setting the example?
 
However you're testing the restoration is failing, basically. What's
your current methodology?

No. I've told you that the restoration ISN'T failing. Believe me on this.
I've seen the file reproduced.
However I have discovered that my method of determining it is only 30 bytes
is flawed, I have done a test that proves that the particular test I did is
returning 30 when I have deliberately the actual data to more than 30 bytes.
However I have no reason to suspect that it isn't compressing, when I did it
on a test file that was 55KB it compressed it to 5KB.
But yes you are right, 1MB to 30 bytes is ridiculous. So those of you that
are clammering in a rush to steal my algorithm, don't bother, it isn't
*actually* that good.
I was doing
select fileversion, len(cast(filedata as varbinary)) from fileversion

which for some reason default to a length of 30. Why, god knows.
Any ideas as to how to find out the actual length of an image field?
 
Any idea how I can run a test that will give me the actual length of data in
the SQL image field?
(Don't say use sp_spaceused, i know about that, and please don't say "don't
use a SQL database" either, because I find it good for catalogueing of files.)
 
Bonj said:
Any idea how I can run a test that will give me the actual length of data in
the SQL image field?
(Don't say use sp_spaceused, i know about that, and please don't say "don't
use a SQL database" either, because I find it good for catalogueing of files.)

I'm sure it's fine for storing files, but I wouldn't use it this early
in the proceedings. Surely you're the one putting the data into the
database in the first place - why not find out at that stage?

I'm sure there is a way of doing it in T-SQL, of course, but I don't
know it off the top of my head, I'm afraid.
 
I'm sure it's fine for storing files, but I wouldn't use it this early
in the proceedings.

I don't know what you mean by early.... I've finished the program, I'm not
just starting it... it's working, finished, choodling away merrily,
drawn-a-line-under, whatever you want to call it! I had always planned to use
SQL.... although you do intrigue me slightly, why would you imply I should
initially not use SQL but perhaps switch to it later? Did you mean so I can
test the validity of the compression routine? If so, then I did this with a
test program. It compressed a file from 55KB to 5KB. Then, an uncompression
program which takes the original size, the compressed filename, and a
filename for the new uncompressed file. I deleted the original file and
restored it, and it was how it was supposed to be... I even did "fc" on them,
and they were identical. So that was that, from then on it was going in SQL
server.
Surely you're the one putting the data into the
database in the first place - why not find out at that stage?

Well, I could. But ideally I want to be able to run a query without having
to store the compressed size in a separate column. But I suppose I could just
do that if it came down to it.
I'm sure there is a way of doing it in T-SQL, of course, but I don't
know it off the top of my head, I'm afraid.

I've posted a question in sqlserver.programming to that effect.
 
Bonj said:
I don't know what you mean by early.... I've finished the program, I'm not
just starting it... it's working, finished, choodling away merrily,
drawn-a-line-under, whatever you want to call it! I had always planned to use
SQL.... although you do intrigue me slightly, why would you imply I should
initially not use SQL but perhaps switch to it later? Did you mean so I can
test the validity of the compression routine? If so, then I did this with a
test program. It compressed a file from 55KB to 5KB. Then, an uncompression
program which takes the original size, the compressed filename, and a
filename for the new uncompressed file. I deleted the original file and
restored it, and it was how it was supposed to be... I even did "fc" on them,
and they were identical. So that was that, from then on it was going in SQL
server.

But if you had a test program which compressed from 55K to 5K, didn't
you get mightily suspicious when you first thought it was compressing a
meg to just 30 bytes? (And in particular, didn't you run the test
program on the same file to check?) I'd have thought you'd reach the
conclusion that the reported size wasn't what you were after much
earlier.
Well, I could. But ideally I want to be able to run a query without having
to store the compressed size in a separate column. But I suppose I could just
do that if it came down to it.

I'm not talking about long-term - just to find out what the compression
ratio is like for a while. It sounds like your test program was doing
that before though - so what happens if you give your test program the
instcat.sql file?
I've posted a question in sqlserver.programming to that effect.

Right.
 
But if you had a test program which compressed from 55K to 5K, didn't
you get mightily suspicious when you first thought it was compressing a
meg to just 30 bytes?

No! Because I didn't actually stop and think "oooh.., what's a million
divided by thirty?" ah - about thirty-three thousand... mmm., that looks
wrong. I just ran the query and didn't at the time see any reason why it
wasn't right.

(And in particular, didn't you run the test
program on the same file to check?) I'd have thought you'd reach the
conclusion that the reported size wasn't what you were after much
earlier.

No, because I wasn't particularly bothered about the exact compression ratio.
I'm not talking about long-term - just to find out what the compression
ratio is like for a while. It sounds like your test program was doing
that before though - so what happens if you give your test program the
instcat.sql file?

I don't know - I can't see the point in that. I want to find out exactly how
big it's getting compressed to *in the database* - and ideally, I want the
database to tell me this info.

OK, so it's obviously not 30 bytes. I see that now. But I've no doubts that
it's compressing it to at least some degree, probably in the region of
10%-25%, and come to that, I've no reason to believe its losing anything or
compressing it any less than the managed library would.
 
Back
Top