|
AT&T Labs Research pzip command
Glenn Fowler <gsf@research.att.com>
AT&T Labs Research - Florham Park NJ
Fixed length record data, although easy to access, is often viewed
as a waste of space.
Many projects go to great lengths optimizing data schemas to save space,
but complicate the data interface in the process.
pzip
shows that in many cases this view of fixed length data is wrong.
In fact, variable length data may become more compressible when
converted to a sparse, fixed length format.
Intense semantic schema
analysis can be replaced by an automated record partition, resulting in
compression space improvements of 2 to 10 times and decompression speed
improvements of 2 to 3 times over
gzip
for a large class of data.
pzip(1)
compresses and decompresses data files of fixed length rows (records) and
columns (fields). It performs better than
gzip(1)
in space/time on data that
has many (typically > 50%) columns that change at a low rate (columns with a
low rate of change are low frequency; columns with a high rate of change are
high frequency).
The
pzip
compress format is itself gzipped; decompressed data is reorganized
according to the user-specified partition file
before being passed to
gzip.
Low frequency columns are difference
encoded and high frequency column groups are transposed to column-major order.
The
gzip
tables are flushed between each column partition group. This has a
positive space/time effect on the gzip string match and huffman tables.
Two other commands are part of the
pzip
package.
The
pop(1)
command lists column frequencies for fixed length data and the
pin(1)
command induces
pzip
partitions from training data.
This table shows timing and size results for
pzip
and
gzip,
run on a 400MHz sgi mips processor.
| US Census field group 301 / all states |
| COMMAND | SIZE | RATE | REAL | USER | SYS |
|
raw | 342,279,796 | 1.0 | | | |
|
gzip | 31,471,465 | 10.9 | 4m04.22s | 3m58.79s | 0m04.40s |
|
pzip | 17,549,599 | 19.5 | 2m26.95s | 1m53.72s | 0m03.58s |
|
gunzip | | | 0m29.82s | 0m28.63s | 0m00.67s |
|
punzip | | | 0m10.52s | 0m09.81s | 0m00.43s |
|
Generate a partition, letting
pin
determine the row size and high frequency cutoff:
pin test.dat > test.prt
Generate a partition with a 10% high frequency cutoff for 100 byte record
fixed length data and trace the progress:
pin -v -r 100 -h 10% test.dat > test.prt
Compress the data:
pzip -p test.prt test.dat > test.pz
Decompress the data:
pzip test.pz > t
pzip
is part of the
ast-open
package posted at the
AT&T Software Technology
download site.
The binary tarballs contain executables for
pop,
pin,
and
pzip.
The man page for each command can be listed on the standard error
in text form using the
--man
option or in html form using the
--html
option.
Use the
--?help
option for help details.
|
|
Glenn Fowler |
|
|
Information and Software Systems Research |
|
|
AT&T Labs Research |
|
|
Florham Park NJ |
|
|
March 13, 2009 |
|