The LZSS compression algorithm is one of a number of compression algorithms generally referred to as
"Dictionary Methods". These algorithms rely upon the fact that (in general) an input data buffer will contain repeating
"patterns" or matching sequences of bytes.
The algorithms fall into 2 groups. Systems like LZ78 and LZW scan an input buffer and construct a
"dictionary" of the most commonly occurring byte sequences or
"phrases". This dictionary is pre-pended with the compressed data and the compressed data comprises an array of indices into the dictionary.
A second set is a modification of this in that the data dictionary is implicit in the uncompressed data buffer. All are based upon an algorithm developed and published in 1977 by Abraham Lempel and Jakob Ziv LZ77. A refinement of this algorithm, which is the basis for practically all the later methods in this group, is the LZSS algorithm developed in 1982 by Storer and Szymanski. These methods try to find if the character sequence currently being compressed has already occurred earlier in the input data and then, instead of repeating it, output only a pointer to the earlier occurrence. This is illustrated in the following diagram:
The algorithm searches the window (a buffer moving back from the current position in the input data). It searches for the longest match with the beginning of the look-ahead buffer (a buffer moving forward from the current position in the input data) and outputs a pointer to that match. This pointer indicates a position and length of that data match. It is referred to here as a
"Slice Descriptor".
Since it is possible that not even a one-character match can be found, the output cannot contain just pointers. Accordingly at times it is necessary to write literal octets into the output buffer. A block of literal octets is preceded by a
"Literal Block Identifier" which indicates the length of the literal octet sequence that follows.
The following is provided as an informative example using the input buffer shown below.
Step 1.
Starting position is byte 1 in the input buffer. For octets 1 to 3 there are no octet matches in the window for the look-ahead buffer. So write a literal octet sequence of 3 octets following a literal block header.
Step 2.
Current position is octet 4. Examining the look-ahead buffer and the window a 3 octet match is found beginning 3 octets before (octet 1) and of 3 octets in length. A 2 octet slice descriptor is added to the output buffer. The current position moves to octet 7 of the input buffer.
Step 3.
Current position is octet 7 in the input buffer (0x04). There are no matches in the window for this value so a 2 octet literal sequence is written to the end of the output buffer. The current position moves to octet 8 of the input buffer.
Step 4.
Current position is octet 8 of the input buffer. Comparing the window with the look-ahead buffer reveals a octet match from the current position with octets 1 to 6 of the input buffer. That is a 6 octet sequence beginning 7 octets back from the current position. A two-octet slice descriptor for this match is added to the output buffer. The current position moves to octet 14 of the input buffer (6 octets further on).
Step 5.
Current position is octet 14 of the input buffer. Comparing the window with the look-ahead buffer reveals another 3 octet sequence match (0x01, 0x02, 0x03). This octet sequence occurs several times in the window within the 511 octets that the slice descriptor allows. Therefore several different (but valid) slice descriptors could be written (this would be implementation dependent). However in this example we will reference the initial 3 octets of the input buffer and write a slice descriptor indicating a 3 octet match beginning 13 octets behind the current position.