This is a port of Jeff Bonwick's lzjb compression algorithm to pure Python. This compression scheme is used in the ZFS filesystem.
One of its main features is very small memory requirements for decompression. This can make it a suitable choice when adding compression in memory-constrained environments, such as in embedded development.
The name is perhaps not optimal. I didn't want to come up with a "fancy" name that has no meaning. I know of the pylzjb project, which provides Python bindings for a C implementation of lzjb.
This code is starting to feel quite mature and polished. This feeling is helped by the fact that it's very short, the core functions occupy less than 150 lines, including docstrings. The only thing I can think of to do would be more profiling/optimization, but it does seem to work already.
Like any Python package, setup.py is used.
Installation is a two-step process:
$ ./setup.py build$ ./setup.py install
Unlike pylzjb, the module install name for this project is simply lzjb.
I think this makes sense, it should be kind of obvious that the imported module is for Python.
This is open source, distributed under the BSD 2-clause license.
To ensure compatibility with the public C code for LZJB compression, automatic testing is performed. A simple shell script runs python-lzjb against both the C code and itself, on a set of 30 files. The test script emits a simple matrix which quickly shows when something breaks.
This package is designed to work with both Python 2.x and 3.x from the same source. It has been tested on Python 2.7.17 and Python 3.7.5, by running the test script. The test "framework" is rather Unix-centric, apologies. It could/should probably be rewritten in Python to be more portable.
The main goal when implementing this has been correctness and (sort of) clarity by closely following the original C code. On my not-so-hot laptop (Intel® Core™ i5 M 480 @ 2.67GHz) it currently achieves around 1.1 MB/s when compressing.
The package's API is extremely simple.
Data is managed as Python bytearray objects.
There are two groups of functions: size encoding/decoding, and data compression/decompression.
The size functions are mainly intended to help with creating suitable header data for compressed data. They support a simple variable-length integer encoding format which can be used to prepend compressed data with the size of the uncompressed, original, data. The compression/decompressions themselves do not support or expect any header data, that is up to the application to provide.
The text below is extracted from the source code's docstrings by the docbuilder.py program.
##Size encoding##
- size_encode(size, dst = None)
- size_decode(src)
Encodes the given size in little-endian variable-length encoding.
The dst argument can be an existing bytearray to append the size. If it's omitted (or None), a new bytearray is created and used.
Returns the destination bytearray.
Decodes a size (encoded with size_encode()) from the start of src.
Returns a tuple (size, len) where size is the size that was decoded, and len is the number of bytes from src that were consumed.
- compress(src, dst = None)
- decompress(src, dst = None)
Compresses src, the source bytearray.
If dst is not None, it's assumed to be the output bytearray and bytes are appended to it using dst.append(). If it is None, a new bytearray is created.
The destination bytearray is returned.
Decompresses src, a bytearray of compressed data.
The dst argument can be an optional bytearray which will have the output appended. If it's None, a new bytearray is created.
The output bytearray is returned.
This was ported to Python based on:
- The original C code
- The JavaScript port, which adds the inclusion of the uncompressed data size as a prefix
Thanks of course to these authors for contributing their code as open source.