Unzipping 2 text files with same fields but different # of records

  • Thread starter Thread starter Hari
  • Start date Start date
H

Hari

Hi,

I have 2 zipped text files (raw data) with same kind of five fields
delimited by pipe character.

The first one is 195803 KB while the second one is 333065 KB.

I unzipped these files and the first file blows up to 2927031 KB while
the second one just 1733375 KB?

I expected first one to be smaller in size than the second one.

Any thoughts?

Regards,
HP
India
 
Can depend on a lot of things. The program used to do the zip compression. The amount of compression specified. The number of repeated strings in a file will make a difference in how much it can be compressed.
 
Can depend on a lot of things.
4 example : Zip a bunch of jpg's & u save nada

Just my 2¢ worth. Larry

Any advise given is my attempt to show appreciation for all
the excellent help I've received here but I'm no MVP so it
may only apply NUGS (Normally, Usually, Generally, Sometimes :)
 
Can depend on a lot of things. The program used to do the zip compression. The amount of compression specified. The number of repeated strings in a file will make a difference in how much it can be compressed.
Ah.. so it must be the number of repeated strings which must be
causing it (program and amount of compression is same). Im actually
extracting addresses (in form of key/ID format) from a database and
the first one is for business while second one is for personal and it
is expected that business records have more "duplicate address ID's"
as compared to personal.

Btw, it is new to me that zip softwares uses the number of times a
particular string itself for compression. How does it even know how
long a string to consider. I mean would it take a complete record or
does it consider/use any concept of field for the same?

hp
 
I am not a zip technology expert by any means. My understanding is that part of the logic is to find frequently used strings and substitute one character for those, then include a table for converting them back. So, in your business list, if there are many email address ending in "@somebusiness.com", each of those can be represented by one byte and saving 16 bytes for each. Now, an expert may jump in and tell me how wrong my example is, but I believe as a high level example it is OK.
 
It would look at the file as a finite number of (0,1) bits & start by
looking at patterns of length 2.of which there r 4: (00) (10)(01)
(11). What the bits represent is immaterial. Now 8bits (101 10 11
1)=(5 2 3 1) could represent 20bits (00000000001010010111).

HTH-Larry

I mean would it take a complete record or
does it consider/use any concept of field for the same?


Any advise given is my attempt to show appreciation for all
the excellent help I've received here but I'm no MVP so it
may only apply NUGS (Normally, Usually, Generally, Sometimes :)
 
Back
Top