hair-brained code versioning and compression idea

  • Thread starter Thread starter Guest
  • Start date Start date
G

Guest

After losing a bit of code when a power cut happened I decided I wanted a
version control system that is second to none. SourceSafe doesn't fit my
needs (at home, anyway - it does fine at work) as I have to remember to check
in the files, have to keep all related projects in relative directory
structures, etc etc. and various other reasons, such as me not having a
licensed copy and too poor to afford one.

The criteria must be that whenever I save a file, it is backed up. Without
me having to do ANYTHING else. Other than make sure the service is running.
And if it's a new version of an existing file, it has to keep a record of
both versions.
I made a recent post about using my own home-grown encryption recently,
after being over 100 replies I can now see why that's not a good idea.
But I think my own home-grown compression algorithm IS a good idea, as I've
tested it, and after adding only 3 files my compression ratio was about 0.37,
what's more, there's nobody to be "up against" working in the other direction.
I'm not going to use huffman compression or RLE compression - because
they're for bytes. I'm only going to be storing my code files - .cs files,
..c/.cpp files, .h files, even .xml files - so I'm going to use my own type of
compression - word compression.
Basically , the filesystemwatcher in the service watches files, and any text
file that gets created or changed, it puts it into the database.
The definition of a text file is one that doesn't have any zero bytes in it,
so most operating system files will be out straight away.
The interesting bit is the definition of "put it in the database" -
basically using regular expressions, I parse it into words. Then, for each
word, if it's not already in the database, it puts it in the word table, but
if it's already in there, then it just stores a foreign key to the
already-existing row.
Now, you can imagine that words like "System" would be quite popular, hence
wouldn't take up much space. Also, the text between words can be just stored
as another word, so "}\n\t" would probably be quite popular aswell.
To keep the db small enough to always be able to add new/changed files, I'd
use some filtering and purging techniques.

Basically, has anyone got any suggestions on this, or opinions on whether it
will be likely to work, how useful it would be, whether anyone would be
interested in using it it it ever came to production?

And I don't think it will, but if anyone reckons it will sell, let me know
'cos I might aswell start writing it in unmanaged C right now. (The regex bit
already is.) If not, then I might just release the source code to all and
sundry anyway and see what they think.

Opinions / ideas ?
 
I'm not going to use huffman compression or RLE compression - because
they're for bytes. I'm only going to be storing my code files - .cs files,
.c/.cpp files, .h files, even .xml files - so I'm going to use my own type of
compression - word compression.

Now, you can imagine that words like "System" would be quite popular, hence
wouldn't take up much space. Also, the text between words can be just stored
as another word, so "}\n\t" would probably be quite popular aswell.

That sounds very much like what Huffman compression is going to do
anyway. Admittedly, it'll be better if all your files use the same
encoding, but that's likely to be the case anyway - and then a common
word becomes a common byte sequence, so Huffman compression will spot
it and compress it to a short bit sequence.

Have you tried just using things like zip and compared the space saved
by that with the space saved by your algorithm?
 
This all seems like re-inventing the wheel.

Replacing words with more compact tokens is a conceptually simple idea, to
be sure, and would be fun to implement, but I suspect that cheap-to-free
compression libraries can be had that can acheive better compression ratios
with greater reliability, generality, etc. I would not dismiss
general-purpose algorithms as being just for binary files. Compression
ratios of 90% and better are commonly acheived for program text within ZIP
files, for instance. Come to think of it, a ZIP library would be quite a
useful format for what you're suggesting, since all the generations for a
given file could be stored in one ZIP. For example all the generations for
MyProgram.cs could be stored in MyProgram.cs.zip, as 0001MyProgram.cs.zip
.... nnnnMyProgram.cs.zip.

If you insist on a pure tokenizing approach, though, I would have a more
complex definition of a "text" file. I would define it as a file with no
characters outside the ASCII range of 32-127 plus CR,LF, and TAB; and then I
would perform an analysis of typical content and assign single-byte tokens
to the 150 or so most common, reserving, say, ASCII zero as an escape
sequence for longer tokens for less frequently-seen words. In this fashion,
considering the relatively limited vocabulary of a programming language,
with a little tuning you might be able to balance your selection of 1-byte
tokens between the commonality and length of the "words" respresented, to
get rather dramatic compression results. Most, perhaps all, other tokens
could be three bytes. You could assume one space between "words" and
run-length-encode multiple spaces, plus a few other tricks of that nature.

--Bob
 
Bonj said:
After losing a bit of code when a power cut happened ...<snip>

What about a cheap UPS? Beats writing code for a hardware (electrical)
problem.

Jim Hubbard
 
That sounds very much like what Huffman compression is going to do
anyway. Admittedly, it'll be better if all your files use the same
encoding, but that's likely to be the case anyway - and then a common
word becomes a common byte sequence, so Huffman compression will spot
it and compress it to a short bit sequence.

Yes, I'm using ASCII encoding to store the words in the database.
Have you tried just using things like zip and compared the space saved
by that with the space saved by your algorithm?

I've tried one from www.xceedsoft.com which seemed to have code in the
format I want, i.e.a library, but the test application seemed to lose bytes
from the end when decompressing, and it didn't seem to be calling the
library wrongly. (I downloaded the test app off www.planetsourcecode.com.)
A lot of the other samples seem to be MFC applications that are built just
to convert a file to another file, or unix style ones using fopen and
fwrite, that would take just as long to rip out the guts of it and put it
into a library that returns a byte array as it would to design my own....
My own algorithm has the advantages that it recognises common sequences over
as many files as I put in it, not just in each individual file, and I
understand everything that's going into it and can therefore be sure that
it's lossless.
Is a compression ratio of 0.3 - 0.4 pretty good?
 
If you insist on a pure tokenizing approach, though, I would have a more
complex definition of a "text" file. I would define it as a file with no
characters outside the ASCII range of 32-127 plus CR,LF, and TAB; and then
I would perform an analysis of typical content and assign single-byte
tokens to the 150 or so most common, reserving, say, ASCII zero as an
escape sequence for longer tokens for less frequently-seen words. In this
fashion, considering the relatively limited vocabulary of a programming
language, with a little tuning you might be able to balance your selection
of 1-byte tokens between the commonality and length of the "words"
respresented, to get rather dramatic compression results. Most, perhaps
all, other tokens could be three bytes. You could assume one space
between "words" and run-length-encode multiple spaces, plus a few other
tricks of that nature.

Mmm, interesting...
The only thing with variable-length tokens that I can't get my head round,
is how you can get away without storing a whole load of metadata about how
long the tokens are, so the program can know how long the next token it is
to read will be. You've got to remember that I don't for the life of me
understand things like huffman 'trees' - I'm a database programmer by
nature! To me, a file is an ordered stream / array / collection of bytes,
whereas a tree is a hierarchical thing - and seeing the connection between
the two is like trying to see the fourth dimension. I understand the huffman
algorithm to the extent that the idea is to replace long, commonly occurring
tokens with a pointer to that token stored singly in some metadata, but it
stops there unfortunately.

However, I am expecting to perform some "stress-tests" on it to see how many
files it takes to fill the database up to, say, half to two-thirds capacity,
and then I can decide how vigorous the purging / filtering routines need to
be, then when I've done that, I can see whether the purge/filtering levels
are acceptable, as obviously the database can't just keep filling up unto
infinitum...
The idea I have for purging and filtering is that it will be a separate
service. Is this the idea behind service-oriented architecture? If I have
quite clever algorithms for that too, such as, say - when a file is deleted
from disk, purge all but the latest version, and if the database is getting
too full, start to purge the oldest versions of the oldest files (possibly
if the filesystemwatcher will allow it, based on frequency of access)... all
running as a separate service in the same exe. If this could be got to an
acceptable level, maybe it would prove a reliable solution to be able to
offer complete peace of mind that all source-code files are backed up, no
matter what?
 
Yes, I'm using ASCII encoding to store the words in the database.
Right.


I've tried one from www.xceedsoft.com which seemed to have code in the
format I want, i.e.a library, but the test application seemed to lose bytes
from the end when decompressing, and it didn't seem to be calling the
library wrongly. (I downloaded the test app off www.planetsourcecode.com.)
A lot of the other samples seem to be MFC applications that are built just
to convert a file to another file, or unix style ones using fopen and
fwrite, that would take just as long to rip out the guts of it and put it
into a library that returns a byte array as it would to design my own....

Well, the zip library at
http://www.icsharpcode.net/OpenSource/SharpZipLib/Default.aspx works
pretty well.
My own algorithm has the advantages that it recognises common sequences over
as many files as I put in it, not just in each individual file, and I
understand everything that's going into it and can therefore be sure that
it's lossless.

Hmm - one advantage of having each file self-contained, however, is
that if one particular bit goes wrong, that doesn't screw up *all* the
files - just one file. I take your point though. If you implemented
Huffman directly yourself, you could keep something like that though,
by keeping the current code table (or whatever it's called - it's a
while since I've done Huffman stuff).
Is a compression ratio of 0.3 - 0.4 pretty good?

Well, zipping up my MiscUtil sources, the zip file (which includes
directory structure, of course) is 56K and the sources themselves end
up totalling 209K.

One alternative (which most source control systems use) is taking diffs
between the last version of the file in the repository and the new
file. That means that if, like me, you like to save very often, it
doesn't make the repository enormous due to lots of tiny changes.
 
The definition of a text file is one that doesn't have any zero bytes in it,
so most operating system files will be out straight away.
The interesting bit is the definition of "put it in the database" -
basically using regular expressions, I parse it into words. Then, for each
word, if it's not already in the database, it puts it in the word table, but
if it's already in there, then it just stores a foreign key to the
already-existing row.

What you describe is essentially the LZW compression algorithm, which is
the core of the ZIP file format. And there's even a FREE library
implementation of it for .Net
(http://icsharpcode.com/OpenSource/SharpZipLib/Default.aspx).

I also think it's unlikely that you'll get better results with your
manual parse than with it's automatic parsing. You'll be biased be word
boundaries, while it won't be (e.g, you picked "System". It would pick
"System.", "System.Data.O" and "System.Data.Sql")

I believe you can also avoid have to check for zero bytes, by using the
Filter property on the FileSystemWatcher object. The docs are a bit vague
on the matter, but it's claims it can accept "wildcards similarly to the
Windows 'dir' command". Now, dir will accept "*.cs; *.asp; *.resx;*.csproj"
to display all source files in the folder, so, in theory, FileSystemWatcher
should as well.
--
Truth,
James Curran
[erstwhile VC++ MVP]
Home: www.noveltheory.com Work: www.njtheater.com
Blog: www.honestillusion.com Day Job: www.partsearch.com
 
Bonj said:
Mmm, interesting...
The only thing with variable-length tokens that I can't get my head round,
is how you can get away without storing a whole load of metadata about how
long the tokens are, so the program can know how long the next token it is
to read will be. You've got to remember that I don't for the life of me

One method is to use bit markers. Assuming two lengths, one byte and three
bytes, you could define one byte tokens as not having their high order bit
set while three byte sequences do.
 

Actually that's the one that I tried that was lossful. The other one worked,
but didn't compress very well. Can't remember why.
One alternative (which most source control systems use) is taking diffs
between the last version of the file in the repository and the new
file. That means that if, like me, you like to save very often, it
doesn't make the repository enormous due to lots of tiny changes.

How could implement diffs with the database scenario? I mean, if I save a
new version that is only a bit different to the old one. The new version is
not actually going to be storing many new words, just FK rows. But can I get
any better than that do you think? Bearing in mind, i might add text to the
beginning, middle, or end of a file, and I might delete some text and
replace it with different text that leaves the file the same size...?


Thanks for the input once again.
 
Actually that's the one that I tried that was lossful. The other one worked,
but didn't compress very well. Can't remember why.

If you were getting errors with that, it suggests your code (or the
sample you were copying) was wrong - I've used it several times without
any problems.

Could you post a short but complete program which demonstrates the
problem?

See http://www.pobox.com/~skeet/csharp/complete.html for details of
what I mean by that.
How could implement diffs with the database scenario? I mean, if I save a
new version that is only a bit different to the old one. The new version is
not actually going to be storing many new words, just FK rows. But can I get
any better than that do you think? Bearing in mind, i might add text to the
beginning, middle, or end of a file, and I might delete some text and
replace it with different text that leaves the file the same size...?

If you store the complete most recent copy of the file in the database,
then just use a normal diff tool (there are loads around; I have some
C# code to decompress binary diffs in the VCDIFF format, but not a
compressor) to find out the difference between that copy and the new
one; store that diff along with the previous diffs, and replace the old
complete copy with the new one.

Of course, you can compress the diffs as well - zip should work fine
for that, too - but you may well find you don't need to. To be honest,
with disk space so cheap these days, I'm not sure that it's really
worth worrying about compression at all: take a complete copy of the
file every time it changes, and then purge this log appropriately, so
that (say) you have every revision within the last week; for revisions
older than a week but younger than a month you have the latest revision
per day; for revisions older than that you have the latest revision per
week.
 
Bonj said:
But I think my own home-grown compression algorithm IS a good idea, as
I've
tested it, and after adding only 3 files my compression ratio was about
0.37,
what's more, there's nobody to be "up against" working in the other
direction.
I'm not going to use huffman compression or RLE compression - because
they're for bytes. I'm only going to be storing my code files - .cs files,
.c/.cpp files, .h files, even .xml files - so I'm going to use my own type
of
compression - word compression.

Unless you're doing this as a toy for learning, I really think it's foolish
to reinvent the wheel with any thought of doing this better than people who
do this for a living and have bazillions of installations to learn from:
you'll spend all your time building tools you can get replacements for and
not on the things that are unique to your own circumstances. Why not find a
good compression library ( http://www.gzip.org/zlib/ ), study the interface,
then use *it* while you solve the deeper problem. Later, if your time and
inclination remain, you can play around with your own code having some
real-world experience to compare it with.

Regarding, the "deeper problem" of source code control, this smells like
something you'd need to do a filesystem driver for, so it could handle the
versioning for you. I remember using a system years ago (VMS?) that did this
kind of automatic versioning, but it was implemented in the operating
system. A bit of digging suggests that Source Safe has a filesystem-like
module, but it's not clear that I'm reading it right.

The alternative at the user level may be to write a program that plants a
raft of DirectoryChangeNotifications and automatically scans the directory
when something changes. This is hard to efficiency in the face of lots of
updates (especially when you rebuild a project), and I smell lots of locking
issues too. It may be hard to ever do this in anything like realtime without
a filesystem driver.

You could use something like Second Copy 2000 ( http://www.centered.com/ )
that periodically backs up data to another directory or machine: it runs in
the background, doesn't use many resources, and is basically
setup-and-forget. Cheap, too.

But if you want a great version control system (that nevertheless won't
think for you), it's hard to beat Perforce ( http://www.perforce.com ). Just
fantastic software, great SCC integration, and even integration into
Explorer and Office. Available on every platform you could think of.

Steve
 
Unless you're doing this as a toy for learning,

Mmmm , sort of... but my "toys for learning" are only ever things I actually
want, as this is the only way I can actually build up enough motivation to
design 'toys for learning' - if i'm not actually going to want the product, I
don't have enough motivation to do it to a standard that's satisfactory
enough for me to learn anything useful.
I really think it's foolish
to reinvent the wheel

That's fine, agreed: But I've yet to find a round wheel with the right axle
fastenings for this particular truck.
with any thought of doing this better than people who
do this for a living and have bazillions of installations to learn from:
you'll spend all your time building tools you can get replacements for and
not on the things that are unique to your own circumstances. Why not find a
good compression library ( http://www.gzip.org/zlib/ ),

From two seconds looking at the webiste, it reeks of GNU, which apart from
the obvious problem of the GPL, tends to mean that installing it and
configuring it is an art in itself, and therefore impossible for me as a
Microsoft Win32 / .NET / database programmer. But, I won't be prejudiced, so
I will definitely check it out and try to plumb it in and get it working. If
it fails, I'll post the reason why and what I tried to do to overcome it to
this thread.
study the interface,
then use *it* while you solve the deeper problem. Later, if your time and
inclination remain, you can play around with your own code having some
real-world experience to compare it with.

That would sound like a good plan, if the "studying the interface" bit
actually took less time than the "play around with your own code" bit.
Basically, what I'm saying is - I'm not using my own algorithm because I'm
pompous enough to think that it's actually better than one made by zip
experts whose day job it is to design zip strategies and algorithms, I'm
doing it because it's actually easier - I wrote the whole algorithm, all the
SQL, and tested it, all inside yesterday. It would have taken me *at least* a
whole day to understand the interface to somebody else's library.
Regarding, the "deeper problem" of source code control, this smells like
something you'd need to do a filesystem driver for,

I'm going to be using the .NET FileSystemWatcher. There'll be no kernel mode
operation or "filesystem drivers". I'm not that good at low-level stuff!

A bit of digging suggests that Source Safe has a filesystem-like
module, but it's not clear that I'm reading it right.

SourceSafe requires you to *either* use Visual Studio, or explicitly open
its interface and add the files that you want to be backed up. Basically, if
you don't use VS.NET, it's a lot of work. It's not a nonexistent amount even
if you do.
The alternative at the user level may be to write a program that plants a
raft of DirectoryChangeNotifications and automatically scans the directory
when something changes.

Mmmm... sort of, but the FileSystemWatcher can do it at the file level, not
at the directory level. I wouldn't ever have to scan a whole directory.
This is hard to efficiency in the face of lots of
updates (especially when you rebuild a project)

When I rebuild a project, the only files that are going to be changed are
binaries. The program will discard these on two counts:
1) their file extension will be .obj, .dll, .exe, which won't be among the
list of the only file types I want operated on at all.
2) Any files that get through one undergo an additional check to make sure
they are not binary. An .exe, .dll, .obj, .lib etc. file would instantly show
up as binary and be completely ignored from that point on.
, and I smell lots of locking
issues too.

I shouldn't think so. Text editors generally have no problem with another
application opening a file that they have open for read access (it's a bad
text editor that does have a problem with that). My service will only be
requesting read access on the file, as that's all it needs.
It may be hard to ever do this in anything like realtime without
a filesystem driver.

Well, I don't know. I'll have to stress-test it, which I've got running now.

But at the end of the day, I'm after thoughts on "how", "what", "where",
rather than "whether to".
 
Is a compression ratio of 0.3 - 0.4 pretty good?

not really - for text you should do a lot better - and for source code where
there is an event smaller vocabulary, you should see much higher. but then
that depends a lot on the size of the file to be compressed. small files
don't yield near as much.

if you are going to pursue file compression (with cheap disks and fast
interconnect conections, this is quickly becoming a lost art), you should
have a look at the dictionary based schemes.

there is a lot of opportunity when working in application specific
environments, you can seed the dictionary with common words that have high
occurance rates, and achieve incredible compression rates. Early
implementations of the Limpel-Ziv algorithms started their dictionary with
255 entries - one for each of the possible codes. consider this link:
http://www.dogma.net/markn/articles/lzw/lzw.htm

that chews up 8 bits for the index right off the start. for pure text (like
C# source) you could get by with 128 initial entries, then add some common
words - like "for", "int", "string", "class", "using System.", and so on.
you can still limit the output to 12 bit codes plus the new token (4 phrases
per 10 bytes of output) - ergo the longer the phrases, the better the
compression - and you don't have to transmit the initial dictionary!

regards
roy fine
 
Back
Top