Abstract Description Results Publications Examples Download Usage


AT&T Labs Research pzip command


Glenn Fowler <gsf@research.att.com>

AT&T Labs Research - Florham Park NJ


Abstract

Fixed length record data, although easy to access, is often viewed as a waste of space. Many projects go to great lengths optimizing data schemas to save space, but complicate the data interface in the process. pzip shows that in many cases this view of fixed length data is wrong. In fact, variable length data may become more compressible when converted to a sparse, fixed length format. Intense semantic schema analysis can be replaced by an automated record partition, resulting in compression space improvements of 2 to 10 times and decompression speed improvements of 2 to 3 times over gzip for a large class of data.


Description

pzip(1) compresses and decompresses data files of fixed length rows (records) and columns (fields). It performs better than gzip(1) in space/time on data that has many (typically > 50%) columns that change at a low rate (columns with a low rate of change are low frequency; columns with a high rate of change are high frequency).

The pzip compress format is itself gzipped; decompressed data is reorganized according to the user-specified partition file before being passed to gzip. Low frequency columns are difference encoded and high frequency column groups are transposed to column-major order. The gzip tables are flushed between each column partition group. This has a positive space/time effect on the gzip string match and huffman tables.

Two other commands are part of the pzip package. The pop(1) command lists column frequencies for fixed length data and the pin(1) command induces pzip partitions from training data.


Results

This table shows timing and size results for pzip and gzip, run on a 400MHz sgi mips processor.

US Census field group 301 / all states
COMMAND    SIZE    RATE    REAL    USER    SYS
raw    342,279,796    1.0               
gzip    31,471,465    10.9    4m04.22s    3m58.79s    0m04.40s
pzip    17,549,599    19.5    2m26.95s    1m53.72s    0m03.58s
gunzip              0m29.82s    0m28.63s    0m00.67s
punzip              0m10.52s    0m09.81s    0m00.43s


Publications


Examples

Generate a partition, letting pin determine the row size and high frequency cutoff:
     pin test.dat > test.prt
Generate a partition with a 10% high frequency cutoff for 100 byte record fixed length data and trace the progress:
     pin -v -r 100 -h 10% test.dat > test.prt
Compress the data:
     pzip -p test.prt test.dat > test.pz
Decompress the data:
     pzip test.pz > t


Download

pzip is part of the ast-open package posted at the AT&T Software Technology download site.


Usage

The binary tarballs contain executables for pop, pin, and pzip. The man page for each command can be listed on the standard error in text form using the --man option or in html form using the --html option. Use the --?help option for help details.


Glenn Fowler
Information and Software Systems Research
AT&T Labs Research
Florham Park NJ
March 13, 2009