Revisiting compression software


  • Sun 29 November 2015
  • misc

Back in the dawn of time, there was a Huffman Coding based compression program called Pack (file suffix .z).

By the mid 80s, an improved LZW based compression program called compress (file suffix .Z) had gained popularity... and a patent. Remember the GIF patent? Yeah. That.

The patent situation was a motivating factor behind the development of the DEFLATE based compression program gzip (file suffix .gz). As an added bonus, it compressed most data better than compress did. Gzip went on to become the go-to compression algorithm for many applications, including http. It's still a favorite today.

By the late 90s, the new hotness was the BHT based compression program bzip2 (file suffix .bz2). It offers slightly better compression than gzip at a fairly substantial cost in terms of cpu time. It's reasonably popular today.

A decade or more ago I became aware of the LZMA/LZMA2 based 7-Zip (file suffix .7z) because Wikimedia foundation was using it for distribution of their snapshots of Wikipedia. It delivers substantially better compression than gzip or bzip2, yet with a superlinear increase in cpu resource use which caused me to dismiss it as unappealing except in the most extreme cases wherein one cares about saving file size or bandwidth above all else. 7zip is not just a compression program - it's an integrated archiver and compressor (along the lines of tar + gzip in one program) - it's more akin to winzip than to any of the software described elsewhere here. If single file compression is possible, I wasn't able to find it in the man page.

Another implementation of LZMA/LZMA2 compression is the more Unix-like xz. It shares command line flag semantics with older Unix compression software, and shares appetite for CPU resources with 7zip.

A funny thing happened though in the past 10 years. Multi-core systems have become the rule rather than the exception, and systems have gratuitously large amounts of RAM by the standards of days gone by. Even the laptop on which I'm typing has four cores and 16gb of RAM.

The popular LZMA/LZMA2 implementations are multithreaded. The older software is not. So we have an interesting situation wherein software that was previously way too slow is actually the fastest software available in terms of wall clock time when baked off against the older stuff.

I didn't bother with pack or compress, but I ran side by side tests of the modern stuff on a 9 Gbyte ZFS snapshot of a Maildir-formatted email archive. This very un-scientific comparison was run under a SmartOS native instance (base-64 15.3.0) on an HP DL160G6 with 72 Gbytes of RAM and 8 cores of Xeon L5520 @ 2.27GHz. Here are the results, in descending order of wall clock time:

[root@sandbox31 ~]# time bzip2 --best < hmail-rs-test.zfssend > hmail-rs-test.zfssend.bz2

real    28m14.673s
user    28m8.995s
sys     0m5.621s
[root@sandbox31 ~]# ls -l hmail-rs-test.zfssend hmail-rs-test.zfssend.bz2
-rw-r--r-- 1 root root 9701109952 Nov 22 11:34 hmail-rs-test.zfssend
-rw-r--r-- 1 root root 3933718257 Nov 22 13:14 hmail-rs-test.zfssend.bz2
[root@sandbox31 ~]# dc
3933718257000 9701109952 / p
405
[root@sandbox31 ~]#



[root@sandbox31 ~]# time gzip --best < hmail-rs-test.zfssend > hmail-rs-test.zfssend.gz

real    14m15.291s
user    14m5.840s
sys     0m9.449s
[root@sandbox31 ~]# ls -l hmail-rs-test.zfssend hmail-rs-test.zfssend.gz
-rw-r--r-- 1 root root 9701109952 Nov 22 11:34 hmail-rs-test.zfssend
-rw-r--r-- 1 root root 4115925101 Nov 22 12:41 hmail-rs-test.zfssend.gz
[root@sandbox31 ~]# dc
4115925101000 9701109952 / p
424
[root@sandbox31 ~]#



[root@sandbox31 ~]# time xz -z -k -9 --threads=8 hmail-rs-test.zfssend

real    13m19.942s
user    96m45.039s
sys     0m27.977s
[root@sandbox31 ~]# ls -l hmail-rs-test.zfssend.xz hmail-rs-test.zfssend
-rw-r--r-- 1 root root 9701109952 Nov 22 11:34 hmail-rs-test.zfssend
-rw-r--r-- 1 root root 3228121140 Nov 22 11:34 hmail-rs-test.zfssend.xz
[root@sandbox31 ~]# dc
3228121140000 9701109952 / p
332
[root@sandbox31 ~]#



[root@sandbox31 ~]# time 7z a hmail-rs-test.zfssend.7z hmail-rs-test.zfssend

7-Zip (a) [64] 9.38 beta  Copyright (c) 1999-2014 Igor Pavlov  2015-01-03
p7zip Version 9.38.1 (locale=C,Utf16=off,HugeFiles=on,16 CPUs)
Scanning

Creating archive hmail-rs-test.zfssend.7z

Compressing  hmail-rs-test.zfssend

Everything is Ok

real    7m27.864s
user    92m36.357s
sys     0m45.193s
[root@sandbox31 ~]# ls -l hmail-rs-test.zfssend.7z hmail-rs-test.zfssend
-rw-r--r-- 1 root root 9701109952 Nov 22 11:34 hmail-rs-test.zfssend
-rw-r--r-- 1 root root 3351916628 Nov 22 11:44 hmail-rs-test.zfssend.7z
[root@sandbox31 ~]# dc
3351916628000 9701109952 / p
345
[root@sandbox31 ~]#

It looks like 7z is the winner on time, but I may go with xz, the winner on space and Unix-ness (and still better than all the legacy stuff on time) for my off-site backup needs. Remember, first compress, then encrypt!