LZ77 and LZ78 are the names for the two lossless data compression algorithms published in papers by Abraham Lempel and Jacob Ziv in 1977 and 1978. They are also known as LZ1 and LZ2 respectively. These two algorithms form the basis for most of the LZ variations including LZW, LZSS, LZMA and others.
They are both dictionary coders, unlike minimum redundancy coders. LZ77 is the "sliding window" compression algorithm, which was later shown to be equivalent to the explicit dictionary technique first given in LZ78. However, they are only equivalent when the entire data is decompressed. LZ78 decompression allows random access to the input as long as the entire dictionary is available, while LZ77 decompression must always start at the beginning of the input.
LZ77 algorithms achieve compression by replacing portions of the data with references to matching data that have already passed through both encoder and decoder. A match is encoded by a pair of numbers called a length-distance pair, which is equivalent to the statement "each of the next length characters is equal to the character exactly distance characters behind it in the uncompressed stream." (The "distance" is sometimes called the "offset" instead.)
The encoder and decoder must both keep track of some amount of the most recent data, such as the last 2 kB, 4 kB, or 32 kB. The structure in which this data is held is called a sliding window, which is why LZ77 is sometimes called sliding window compression. The encoder needs to keep this data to look for matches, and the decoder needs to keep this data to interpret the matches the encoder refers to. This is why the encoder can use a smaller size sliding window than the decoder, but not vice-versa.
Many documents which talk about LZ77 algorithms describe a length-distance pair as a command to "copy" data from the sliding window: "Go back distance characters in the buffer and copy length characters, starting from that point." While those used to imperative programming may find this model intuitive, it may also make it hard to understand a feature of LZ77 encoding: namely, that it is not only acceptable but frequently useful to have a length-distance pair where the length actually exceeds the distance. As a copy command, this is puzzling: "Go back one character in the buffer and copy seven characters, starting from that point." How can seven characters be copied from the buffer when only one of the specified characters is actually in the buffer? Looking at a length-distance pair as a statement of identity, however, clarifies the confusion: each of the next seven characters is identical to the character that comes one before it. This means that each character can be determined by looking back in the buffer – even if the character looked back to was not in the buffer when the decoding of the current pair began. Since by definition a pair like this will be repeating a sequence of distance characters multiple times, it means that LZ77 incorporates a flexible and easy form of run-length encoding.
Even though all LZ77 algorithms work by definition on the same basic principle, they can vary widely in how they output their encoded data – what values are possible for lengths and distances, for example, and how length-distance pairs are distinguished from literals (single characters encoded as themselves, rather than as part of a length-distance pair.) A few examples:
Full article ▸