util/compress/libdeflate/README.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283

# Overview

libdeflate is a library for fast, whole-buffer DEFLATE-based compression and
decompression.

The supported formats are:

- DEFLATE (raw)
- zlib (a.k.a. DEFLATE with a zlib wrapper)
- gzip (a.k.a. DEFLATE with a gzip wrapper)

libdeflate is heavily optimized.  It is significantly faster than the zlib
library, both for compression and decompression, and especially on x86
processors.  In addition, libdeflate provides optional high compression modes
that provide a better compression ratio than the zlib's "level 9".

libdeflate itself is a library, but the following command-line programs which
use this library are also provided:

* gzip (or gunzip), a program which mostly behaves like the standard equivalent,
  except that it does not yet have good streaming support and therefore does not
  yet support very large files
* benchmark, a program for benchmarking in-memory compression and decompression

## Table of Contents

- [Building](#building)
  - [For UNIX](#for-unix)
  - [For macOS](#for-macos)
  - [For Windows](#for-windows)
    - [Using Cygwin](#using-cygwin)
    - [Using MSYS2](#using-msys2)
- [API](#api)
- [Bindings for other programming languages](#bindings-for-other-programming-languages)
- [DEFLATE vs. zlib vs. gzip](#deflate-vs-zlib-vs-gzip)
- [Compression levels](#compression-levels)
- [Motivation](#motivation)
- [License](#license)


# Building

## For UNIX

Just run `make`, then (if desired) `make install`.  You need GNU Make and either
GCC or Clang.  GCC is recommended because it builds slightly faster binaries.

By default, the following targets are built: the static library `libdeflate.a`,
the shared library `libdeflate.so`, the `gzip` program, and the `gunzip` program
(which is actually just a hard link to `gzip`).  Benchmarking and test programs
such as `benchmark` are not built by default.  You can run `make help` to
display the available build targets.

There are also many options which can be set on the `make` command line, e.g. to
omit library features or to customize the directories into which `make install`
installs files.  See the Makefile for details.

## For macOS

Prebuilt macOS binaries can be installed with [Homebrew](https://brew.sh):

    brew install libdeflate

But if you need to build the binaries yourself, see the section for UNIX above.

## For Windows

Prebuilt Windows binaries can be downloaded from
https://github.com/ebiggers/libdeflate/releases.  But if you need to build the
binaries yourself, MinGW (gcc) is the recommended compiler to use.  If you're
performing the build *on* Windows (as opposed to cross-compiling for Windows on
Linux, for example), you'll need to follow the directions in **one** of the two
sections below to set up a minimal UNIX-compatible environment using either
Cygwin or MSYS2, then do the build.  (Other MinGW distributions may not work, as
they often omit basic UNIX tools such as `sh`.)

Alternatively, libdeflate may be built using the Visual Studio toolchain by
running `nmake /f Makefile.msc`.  However, while this is supported in the sense
that it will produce working binaries, it is not recommended because the
binaries built with MinGW will be significantly faster.

Also note that 64-bit binaries are faster than 32-bit binaries and should be
preferred whenever possible.

### Using Cygwin

Run the Cygwin installer, available from https://cygwin.com/setup-x86_64.exe.
When you get to the package selection screen, choose the following additional
packages from category "Devel":

- git
- make
- mingw64-i686-binutils
- mingw64-i686-gcc-g++
- mingw64-x86_64-binutils
- mingw64-x86_64-gcc-g++

(You may skip the mingw64-i686 packages if you don't need to build 32-bit
binaries.)

After the installation finishes, open a Cygwin terminal.  Then download
libdeflate's source code (if you haven't already) and `cd` into its directory:

    git clone https://github.com/ebiggers/libdeflate
    cd libdeflate

(Note that it's not required to use `git`; an alternative is to extract a .zip
or .tar.gz archive of the source code downloaded from the releases page.
Also, in case you need to find it in the file browser, note that your home
directory in Cygwin is usually located at `C:\cygwin64\home\<your username>`.)

Then, to build 64-bit binaries:

    make CC=x86_64-w64-mingw32-gcc

or to build 32-bit binaries:

    make CC=i686-w64-mingw32-gcc

### Using MSYS2

Run the MSYS2 installer, available from http://www.msys2.org/.  After
installing, open an MSYS2 shell and run:

    pacman -Syu

Say `y`, then when it's finished, close the shell window and open a new one.
Then run the same command again:

    pacman -Syu

Then, install the packages needed to build libdeflate:

    pacman -S git \
              make \
              mingw-w64-i686-binutils \
              mingw-w64-i686-gcc \
              mingw-w64-x86_64-binutils \
              mingw-w64-x86_64-gcc

(You may skip the mingw-w64-i686 packages if you don't need to build 32-bit
binaries.)

Then download libdeflate's source code (if you haven't already):

    git clone https://github.com/ebiggers/libdeflate

(Note that it's not required to use `git`; an alternative is to extract a .zip
or .tar.gz archive of the source code downloaded from the releases page.
Also, in case you need to find it in the file browser, note that your home
directory in MSYS2 is usually located at `C:\msys64\home\<your username>`.)

Then, to build 64-bit binaries, open "MSYS2 MinGW 64-bit" from the Start menu
and run the following commands:

    cd libdeflate
    make clean
    make

Or to build 32-bit binaries, do the same but use "MSYS2 MinGW 32-bit" instead.

# API

libdeflate has a simple API that is not zlib-compatible.  You can create
compressors and decompressors and use them to compress or decompress buffers.
See libdeflate.h for details.

There is currently no support for streaming.  This has been considered, but it
always significantly increases complexity and slows down fast paths.
Unfortunately, at this point it remains a future TODO.  So: if your application
compresses data in "chunks", say, less than 1 MB in size, then libdeflate is a
great choice for you; that's what it's designed to do.  This is perfect for
certain use cases such as transparent filesystem compression.  But if your
application compresses large files as a single compressed stream, similarly to
the `gzip` program, then libdeflate isn't for you.

Note that with chunk-based compression, you generally should have the
uncompressed size of each chunk stored outside of the compressed data itself.
This enables you to allocate an output buffer of the correct size without
guessing.  However, libdeflate's decompression routines do optionally provide
the actual number of output bytes in case you need it.

Windows developers: note that the calling convention of libdeflate.dll is
"stdcall" -- the same as the Win32 API.  If you call into libdeflate.dll using a
non-C/C++ language, or dynamically using LoadLibrary(), make sure to use the
stdcall convention.  Using the wrong convention may crash your application.
(Note: older versions of libdeflate used the "cdecl" convention instead.)

# Bindings for other programming languages

The libdeflate project itself only provides a C library.  If you need to use
libdeflate from a programming language other than C or C++, consider using the
following bindings:

* C#: [LibDeflate.NET](https://github.com/jzebedee/LibDeflate.NET)
* Go: [go-libdeflate](https://github.com/4kills/go-libdeflate)
* Java: [libdeflate-java](https://github.com/astei/libdeflate-java)
* Julia: [LibDeflate.jl](https://github.com/jakobnissen/LibDeflate.jl)
* Python: [deflate](https://github.com/dcwatson/deflate)
* Ruby: [libdeflate-ruby](https://github.com/kaorimatz/libdeflate-ruby)
* Rust: [libdeflater](https://github.com/adamkewley/libdeflater)

Note: these are third-party projects which haven't necessarily been vetted by
the authors of libdeflate.  Please direct all questions, bugs, and improvements
for these bindings to their authors.

# DEFLATE vs. zlib vs. gzip

The DEFLATE format ([rfc1951](https://www.ietf.org/rfc/rfc1951.txt)), the zlib
format ([rfc1950](https://www.ietf.org/rfc/rfc1950.txt)), and the gzip format
([rfc1952](https://www.ietf.org/rfc/rfc1952.txt)) are commonly confused with
each other as well as with the [zlib software library](http://zlib.net), which
actually supports all three formats.  libdeflate (this library) also supports
all three formats.

Briefly, DEFLATE is a raw compressed stream, whereas zlib and gzip are different
wrappers for this stream.  Both zlib and gzip include checksums, but gzip can
include extra information such as the original filename.  Generally, you should
choose a format as follows:

- If you are compressing whole files with no subdivisions, similar to the `gzip`
  program, you probably should use the gzip format.
- Otherwise, if you don't need the features of the gzip header and footer but do
  still want a checksum for corruption detection, you probably should use the
  zlib format.
- Otherwise, you probably should use raw DEFLATE.  This is ideal if you don't
  need checksums, e.g. because they're simply not needed for your use case or
  because you already compute your own checksums that are stored separately from
  the compressed stream.

Note that gzip and zlib streams can be distinguished from each other based on
their starting bytes, but this is not necessarily true of raw DEFLATE streams.

# Compression levels

An often-underappreciated fact of compression formats such as DEFLATE is that
there are an enormous number of different ways that a given input could be
compressed.  Different algorithms and different amounts of computation time will
result in different compression ratios, while remaining equally compatible with
the decompressor.

For this reason, the commonly used zlib library provides nine compression
levels.  Level 1 is the fastest but provides the worst compression; level 9
provides the best compression but is the slowest.  It defaults to level 6.
libdeflate uses this same design but is designed to improve on both zlib's
performance *and* compression ratio at every compression level.  In addition,
libdeflate's levels go [up to 12](https://xkcd.com/670/) to make room for a
minimum-cost-path based algorithm (sometimes called "optimal parsing") that can
significantly improve on zlib's compression ratio.

If you are using DEFLATE (or zlib, or gzip) in your application, you should test
different levels to see which works best for your application.

# Motivation

Despite DEFLATE's widespread use mainly through the zlib library, in the
compression community this format from the early 1990s is often considered
obsolete.  And in a few significant ways, it is.

So why implement DEFLATE at all, instead of focusing entirely on
bzip2/LZMA/xz/LZ4/LZX/ZSTD/Brotli/LZHAM/LZFSE/[insert cool new format here]?

To do something better, you need to understand what came before.  And it turns
out that most ideas from DEFLATE are still relevant.  Many of the newer formats
share a similar structure as DEFLATE, with different tweaks.  The effects of
trivial but very useful tweaks, such as increasing the sliding window size, are
often confused with the effects of nontrivial but less useful tweaks.  And
actually, many of these formats are similar enough that common algorithms and
optimizations (e.g. those dealing with LZ77 matchfinding) can be reused.

In addition, comparing compressors fairly is difficult because the performance
of a compressor depends heavily on optimizations which are not intrinsic to the
compression format itself.  In this respect, the zlib library sometimes compares
poorly to certain newer code because zlib is not well optimized for modern
processors.  libdeflate addresses this by providing an optimized DEFLATE
implementation which can be used for benchmarking purposes.  And, of course,
real applications can use it as well.

# License

libdeflate is [MIT-licensed](COPYING).

I am not aware of any patents or patent applications relevant to libdeflate.