Unzipping 2 text files with same fields but different # of records

Hari · Feb 25, 2007

Hi,

I have 2 zipped text files (raw data) with same kind of five fields
delimited by pipe character.

The first one is 195803 KB while the second one is 333065 KB.

I unzipped these files and the first file blows up to 2927031 KB while
the second one just 1733375 KB?

I expected first one to be smaller in size than the second one.

Any thoughts?

Regards,
HP
India

Bill James · Feb 25, 2007

Can depend on a lot of things. The program used to do the zip compression. The amount of compression specified. The number of repeated strings in a file will make a difference in how much it can be compressed.

Larry(LJL269) · Feb 25, 2007

Can depend on a lot of things.

4 example : Zip a bunch of jpg's & u save nada

Just my 2¢ worth. Larry

Any advise given is my attempt to show appreciation for all
the excellent help I've received here but I'm no MVP so it
may only apply NUGS (Normally, Usually, Generally, Sometimes

Hari · Feb 26, 2007

Can depend on a lot of things. The program used to do the zip compression. The amount of compression specified. The number of repeated strings in a file will make a difference in how much it can be compressed.

Ah.. so it must be the number of repeated strings which must be
causing it (program and amount of compression is same). Im actually
extracting addresses (in form of key/ID format) from a database and
the first one is for business while second one is for personal and it
is expected that business records have more "duplicate address ID's"
as compared to personal.

Btw, it is new to me that zip softwares uses the number of times a
particular string itself for compression. How does it even know how
long a string to consider. I mean would it take a complete record or
does it consider/use any concept of field for the same?

hp

Bill James · Feb 26, 2007

I am not a zip technology expert by any means. My understanding is that part of the logic is to find frequently used strings and substitute one character for those, then include a table for converting them back. So, in your business list, if there are many email address ending in "@somebusiness.com", each of those can be represented by one byte and saving 16 bytes for each. Now, an expert may jump in and tell me how wrong my example is, but I believe as a high level example it is OK.

Larry(LJL269) · Feb 27, 2007

It would look at the file as a finite number of (0,1) bits & start by
looking at patterns of length 2.of which there r 4: (00) (10)(01)
(11). What the bits represent is immaterial. Now 8bits (101 10 11
1)=(5 2 3 1) could represent 20bits (00000000001010010111).

HTH-Larry

I mean would it take a complete record or
does it consider/use any concept of field for the same?

Any advise given is my attempt to show appreciation for all
the excellent help I've received here but I'm no MVP so it
may only apply NUGS (Normally, Usually, Generally, Sometimes

Unzipping 2 text files with same fields but different # of records

Hari

Bill James

Larry(LJL269)

Hari

Bill James

Larry(LJL269)