System.IO.Compression.GZipStream Decompression Losing Data

I

ian_jacobsen

I'm having some trouble with GZipStream in System.IO.Compression. I
can compress a file without much trouble (at least as far as I can
see), however when I decompress the data multiple problems arise.

1) I'm missing a couple bytes of data after performing the
decompression. I've included my two sample files below. The file
test.txt is the original file, and test_after.txt is the file after
decompression. Notice that the test_after.txt is missing z-9. Why are
these bytes falling off?

2) Notice how I'm performing a Read from the unzipStream during
decompression. On the first Read, the count returned is always 0.
Then if I perform the Read again, the count returned is some positive
integer, however it does not match the size of the original file that
would be expected. This behavior is not consistent with Reads other
stream object, so what's the problem here? Am I crazy, or am I
missing something here?

P.S. - I followed the example that was given on MSDN for GZipStrem
that can be found at:
http://msdn2.microsoft.com/en-us/library/system.io.compression.gzipstream.aspx

Thanks, --Ian


Contents of c:\temp\test.txt (no newline after second line):
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
abcdefghijklmnopqrstuvwxyz0123456789

Contents of c:\temp\test_after.txt (with bytes missing!):
ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
abcdefghijklmnopqrstuvwxy


Sample code from a stub app (see noted comments):

using System.IO;
using System.IO.Compression;

private void button1_Click(object sender, EventArgs e)
{
// Read input file
FileStream fsInput =
new FileStream(@"c:\temp\test.txt", FileMode.Open);
int nInputFileSize = (int)fsInput.Length;
byte[] abInput = new byte[nInputFileSize];
fsInput.Read(abInput, 0, nInputFileSize);

// Compress
MemoryStream ms = new MemoryStream();
GZipStream zipStream = new System.IO.Compression.GZipStream(ms,
CompressionMode.Compress, false);
zipStream.Write(abInput, 0, nInputFileSize);

// Uncompress
ms.Seek(0, SeekOrigin.Begin);
GZipStream unzipStream = new GZipStream(ms,
CompressionMode.Decompress, false);
byte[] abOutput = new byte[nInputFileSize];

// NOTE: I know the next few lines look goofy,
// but the first Read does not fill the buffer
int nCount = 0;
while (nCount == 0 && nInputFileSize > 0)
{
nCount = unzipStream.Read(abOutput, 0, abOutput.Length);
}

if (nCount != nInputFileSize)
{
// NOTE: step over this exception to see the output file
throw new Exception("Uncompressed file size does not match original
file!");
}

FileStream fsOutput =
new FileStream(@"c:\temp\test_after.txt", FileMode.Create);
fsOutput.Write(abOutput, 0, abOutput.Length);
fsOutput.Close();
}
 
M

Marc Gravell

You haven't closed the zip-stream, so it is probably holding some data:

using(GZipStream zipStream = new System.IO.Compression.GZipStream(ms,
CompressionMode.Compress, false)) {
zipStream.Write(abInput, 0, nInputFileSize);
zipStream.Close(); // just for good measure
}

With many streams a Flush() would be enough, but I have previously
posted demonstrable code that shows that some compression streams do
not always respect Flush() completely - presumably an optimisation that
it needs a suitable size buffer at all times.

I will play with your code and re-post when I have tidied it a little.

Marc
 
M

Marc Gravell

et voila; note also that I have used the "leave open" ctor overload so
that I can close the zip-streams without closing the underlying
memory-stream. Note that to get the compressed binary "in memory" just
use ms.ToArray() at the "seek" point; to pre-fill with compressed
binary "in memory" just init ms = new MemoryStream(theByteArray) and
jump to the unzip code.

Marc

using System.IO;
using System.IO.Compression;
using System;

class Program
{
static long CopyStream(Stream input, Stream output)
{
const int BUFFER_SIZE = 1024;
byte[] buffer = new byte[BUFFER_SIZE];
int bytes;
long totalBytes = 0;

while ((bytes = input.Read(buffer, 0, BUFFER_SIZE)) > 0)
{
output.Write(buffer, 0, bytes);
totalBytes += bytes;
}
output.Flush();
return totalBytes;
}
static void Main()
{
// NOTE: would never really do this; this is an example of
// re-using a MemoryStream when reading & writing to / from
// compression streams
long bytesRead, bytesWritten;
string inPath = @"c:\temp\test.txt", outPath =
@"c:\temp\test_after.txt";
using (MemoryStream ms = new MemoryStream())
{
// read file input & compress to ms
using (FileStream fsInput = File.OpenRead(inPath))
using (GZipStream zip = new GZipStream(ms,
CompressionMode.Compress,true))
{
bytesRead = CopyStream(fsInput, zip);
fsInput.Close();
zip.Close();
}

// seek
ms.Position = 0;

// uncompress
using (FileStream fsOutput = File.OpenWrite(outPath))
using (GZipStream unzip = new GZipStream(ms,
CompressionMode.Decompress, true))
{
bytesWritten = CopyStream(unzip, fsOutput);
fsOutput.Close();
unzip.Close();
}
}
// final compare; long winded to prove correct
byte[] input = File.ReadAllBytes(inPath),
output = File.ReadAllBytes(outPath);
int size = input.Length;
if (size == output.Length)
{
for (int i = 0; i < size; i++)
{
if (input != output)
{
Console.WriteLine("Error offset " + i.ToString());
break;
}
}
}
else
{
Console.WriteLine("Wrong size");
}


}
}
 
I

ian_jacobsen

Yeah, opening the GZipStream with "leaveOpen" set to true, and then
calling Close on the stream after the Write fixed the problem. Thanks
a lot!

-- Ian
 
M

Marc Gravell

On a wider issue... IDisposable... actively look for it... if you see
..Dispose(), then in most circumstances you should be "using" that
instance so that you get deterministic clean-up. This applies whenever
the current code /owns/ (if you see what I mean) the instance. In
CopyStream it is just working with the objects, and feels no
ownership... where-as the code that creates an object and that object
dies *within scope* clearly should use "using".

I find the easiest way is to think of IDisposable objects with a baton,
and always ask "who has the baton now"? When you create the object (or
call a factory method etc), you have it. If you use the GZipStream's
"leaveOpen=false" ctor, then you are effectively handing the
MemoryStream's baton to the GZipStream - which you don't want in this
case. In CopyStream you aren't handing over the baton at alll, since
you use the object both before and afterwards. If you hold the baton,
then it is your responsibility to ensure that the object is Dispose()d
appropriately.

Well, OK - maybe that doesn't help you - but it helps me ;-p

Marc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top