J. P. Gilliver (John) said:
J. P. Gilliver (John) wrote: []
So does reading a file. I think he was asking: with today's fast
processors (and memory), does reading a compressed file and then
decompressing it take less time than just reading the uncompressed file?
(I think he's talking about the compression offered by the OS [I
didn't know it still was: I first became aware of that around I
think Windows 3.1 or 95 or so, and never used it then as I didn't
like the idea of how it was done then, which was have all of your
drive C: compressed into one vulnerable file. I presume modern
compression is on a file basis], rather than any .zip or similar -
i. e. such that it's transparent to the user.)
When I tested it here, it was a property that applied to a whole
partition.
I set up a data partition (not C

. And compression is applied to all the
objects on that data partition.
Interesting. Presumably, however, it treats the objects individually,
not rewriting the whole partition every time something is changed, though.
I think it's just the clusters currently being written.
The stuff I was writing, was compressible, and wasn't pathologically bad.
If I'd known in advance, that the data couldn't be compressed, it would
be pretty dumb to find out the hard way, that it wasn't going to save any
space. (I did some tests first, to see how much individual files would
compress. So I knew my 600GB of files would easily fit in the 500GB of
available space.)
I don't know if the NTFS compression scheme is clever enough to leave
uncompressed, things that don't compress well, or goes ahead anyway.
As it takes slightly more space, to store something that doesn't compress
well. And the I/O rate would be pretty "lumpy", if the file system
had to make decisions on the fly, how to store things. So my guess would
be, the compressor does its thing in any case.
*******
http://en.wikipedia.org/wiki/Doublespace
"DoubleSpace is, however, different from such programs in other aspects.
For instance, it compresses whole discs rather than select files.
Furthermore, it hooks into the file routines in the operating system so
that it can handle the compression/decompression (which operates on a
per-cluster basis) transparently to the user and to programs running
on the system."
http://en.wikipedia.org/wiki/NTFS_Compression#File_compression
"NTFS can compress files using LZNT1 algorithm (a variant of the LZ77).
Files are compressed in 16-cluster chunks. With 4 kB clusters, files
are compressed in 64 kB chunks. If the compression reduces 64 kB of data
to 60 kB or less, NTFS treats the unneeded 4 kB pages like empty sparse
file clusters - they are not written. This allows not unreasonable
random-access times."
Those descriptions sound different, but I'm not sure they are. When I
enabled the tick box on that NTFS partition, it applies to the partition,
so you could claim it applies to the "whole disk". I don't know if the DoubleSpace
concept was different, in that it was "visible", or it was just the description
of the thing that wasn't clear about how it worked.
Windows has the ability to unzip a ZIP archive (a thing ending in .zip), but
that's different than a file system compression scheme. The files on my disk,
didn't end up with a new extension of .zip or anything. They still had their
original file names.
And it's not something I left enabled. After the experiment was finished,
and I had my answer, I returned the disk to uncompressed mode. I didn't
leave it that way. If compression takes a whole CPU core, and handles 20MB/sec
with something like video content, it wouldn't be a very good permanent choice.
A highly compressible file (pathologically so), is worthwhile to compress, if
you count keeping a CPU core pegged a good usage of CPU. It would allow
a write rate, faster than the raw disk alone. But not many things are going
to be that compressible. I don't generally work with data files that are all
zeros (or some other repetitive value besides zero).
What I was doing at the time, was converting a Camstudio screen capture movie, into
individual BMP files, for analysis. To discover, that the stupid thing duplicates
captured frames, if it hasn't finished processing the current frame. In Camstudio,
the default screen capture speed is set to 200 frames per second, leaving the
impression the program can actually do that. It turns out, in fact, that the
same frame is repeated about 30 times, before a new frame has been finished
processing and can be written. It means the output "screen movie", contains
30x more data than it's really got. It was actually only capturing at 6 to 7
frames per second. So again, I'd be stupid to leave the tool set at the
default 200, if it had no intention of actually capturing 200 frames per second
and making a "smooth" movie. The movie was far from smooth, and didn't succeed
in capturing mouse movement well. By converting the movie to BMP format,
then computing a checksum on each frame, I was able to determine how many
frames contained identical content, and I could easily see how the Camstudio
algorithm worked. But to get there, I needed 600GB of space, while the
BMP files were being written out. The individual BMPs were compressible, so
the whole thing fit easily on a 500GB compressing drive. And once I understood
what the thing was doing, I could dump the 600GB of BMP files as they were
no longer needed. If I was to use Camstudio again, I'd simply adjust the
default to something more realistic (perhaps a setting of 12 to 14 FPS, if
the actual hardware can only manage to capture 6 to 7 real frames - there'd be
no sense to keep the 200 setting). The wasted space is only evident, if you
attempt to convert the movie to something else. Then it balloons.
Paul