Technical Notes
OverviewThe speed/size specifier determines the relative importance of speed and compression when initializing the compression system. A value of 0 gives the tightest compression and 255 gives the fastest compression; a value 127 gives a good balance between speed and size (usually about 60% of the maximum speed and about 90% of the maximum compression ratio, or about three times the minimum speed and about twice the minimum compression ratio).
The window length determines the quantity of data that is searched when matching patterns. The data within the sliding window is referred to as the dictionary. After each byte is processed, this window advances or "slides" one byte to include the next byte in the data stream while the byte at the other end of the sliding window is removed from the dictionary. Increasing the window length by one byte increases the memory requirements of the compression system by approximately five bytes. A larger window size almost always gives better compression at the expense of processing speed. However, extending the window length is only helpful when the window length is less than the total number of bytes to be compressed. Data that is mostly cyclic ("AAAAA", for example) can achieve near maximum compression at maximum speed. If the data that will be compressed isknown to be highly cyclic (many databases and .EXE's have very long strings of NULLs), the window length can be set to its minimum value (512) with a minimal impact on the compression ratio. Window lengths from 512 bytes to 32K are supported.
The look-ahead length specifies the maximum number of bytes which can be matched in the dictionary when data patterns are compared. Normally, an excessively large look-ahead buffer is not helpful since most data types do not have many matches longer than 1K or 2K. For example, a book is full of duplicate words occurring in different combinations, but duplicate sentences are much less frequent, duplicate paragraphs are very rare, duplicate pages are almost non-existent, and duplicate chapters do not exist. Extending the look-ahead length by one byte only increases the memory consumption of the compression system by one byte and has no adverse affect on the compression ratio. Look-ahead lengths from 256 bytes to 16K are supported.
The maximum weight determines how often the match frequencies should be suppressed for Huffman encoding. This allows the compression system to adjust to changes in the input data mix. If the input data is expected to change rapidly throughout the compression (a mix of text and binary data, for example), the maximum weight should be set to a low value to allow the compression system to adjust more rapidly to the changes in the data. However, reducing this number will hurt compression on relatively homogeneous data (ASCII text, for example). Maximum weights from 512 to 32K are supported.
The hash size affects the speed of the compression. Increasing the number of bits in the hash speeds compression by distributing the organization of the dictionary (data in the sliding window) over a large buffer. This reduces the amount of data that needs to be processed when a pattern points to a location in the hash buffer. A 9-bit hash requires 1K of memory, and each increment in the number of bits in the hash doubles the memory requirements, with a maximum memory usage of 64K for a 15-bit hash. Hash sizes from 9 bits to 15 bits are supported.
Of these four variables, window length has the greatest effect on both speed and compression ratio. The hash size has the next greatest effect on speed but no effect whatsoever on the compression ratio. The maximum weight has a significant effect on the compression ratio and only a slight effect on compression speed (larger weights increase the speed), but the effect on the compression ratio is completely dependent on the characteristics of the compressed data. The look-ahead length has a minimal effect on compression ratio (it increases slightly as it is increased above 256 bytes) and almost no effect on compression speed.
These functions return three values representing the amount of additional memory needed by the compression system for the given speed/size specifier:
_pack_start and _pack_startc initialize the variables and buffers used by the compression system and return the address and size of the initial input buffer. Both functions perform this initialization based on a supplied speed/size specifier, and both functions require an externally-supplied work buffer. In addition, _pack_startc accepts a pdat structure pointer to specify custom configuration options. If these functions are not given enough memory to meet the requirements of the speed/size specifier, non-custom settings are adjusted (at the expense of the compression ratio) to initialize the compression system with the memory provided. If the work buffer is larger than the minimum required size, the extra memory is used for speed optimizations.
The pdat structure used by the custom initialization functions is defined in COMPRESS.H as follows:
typedef struct {
void far * buf1; /* pointer to user buffer 1 */
int buf1size; /* size of user buffer 1 (in bytes, 0=none) */
void far * buf2; /* pointer to user buffer 2 */
int buf2size; /* size of user buffer 2 (in bytes, 0=none) */
int winlen; /* size of sliding window (512-32k bytes) */
int looklen; /* size of look-ahead buffer (256-16k bytes) */
int maxweight; /* maximum weight of huffman tree (512-32k) */
char hash; /* hash setting (9-15 bits, 15=fastest) */
} pdat;
If winlen, looklen, maxweight, or hash is set to zero, that value is calculated based on the speed/size specifier. If buf1size and/or buf2size is non-zero, the compression system attempts to use the larger buffer as the input buffer and the smaller buffer (if any) for output of compressed data. If the larger buffer is too small to be used as the input buffer, it is used as the output buffer instead. User buffers must be at least 512 bytes in size to be used by the compression system.
To determine the amount of memory that will be required to decompress the data, _unpack_bufsuze or _unpack_bufsizec can be called in conjunction with _get_packprefix after the compression system has been initialized.
Because the compression prefix is required to initialize the decompression system, some applications choose to store the compression prefix separately from the compressed data. The _get_packprefix function is provided to make this possible. This function makes a copy of the compression prefix at a specified address but does not remove the compression prefix from the output data stream. _pack_start or _pack_startc must be called before this function may be used.
The compression prefix contains all the information which the decompression system needs to successfully decompress a compressed data stream. The compression prefix is a structure of type packp (defined in COMPRESS.H) as follows:
typedef struct {
char len; /* prefix length, in bytes */
int ver; /* compression system version number (varies) */
int winlen; /* window length */
int looklen; /* look-ahead length */
int maxweight; /* maximum Huffman tree weight */
} packp;
In addition, the compression prefix is always followed by a byte value of 0xFF in the compressed data stream. The fields in the packp structure are defined as follows:
Each time _pack_continue returns, it indicates one of three possible conditions: 1) more input is needed; 2) output needs to be flushed; or 3) compression is complete. When more input is needed or the output needs to be flushed, _pack_continue returns the address and size of the input or output buffer. After input data has been requested, _pack_continue accepts an input value which indicates the number of input bytes placed in the input buffer. When this input value is zero, all remaining data in the input buffer is compressed and flushed. The "compression complete" condition is then returned.
The following is an example of code which compresses a file:
int inhandle, outhandle, flag = 1;
void far *buffer;
unsigned int read_size;
... /* open input and output files, initialize compression system */
do {
switch (flag)
{
case 1:
read_size = _read_h (inhandle, buffer, read_size);
if (read_size == 0xffff)
{
_put_str ("Error reading from input file");
return;
}
break;
case -1:
read_size = _write_h (outhandle, buffer, read_size);
if (read_size == 0xffff)
{
_put_str ("Error writing to output file");
return;
}
}
flag = _pack_continue (&buffer, &read_size);
} while (flag);
The udat structure used by _unpack_bufsizec and _unpack_startc is defined in COMPRESS.H as follows:
typedef struct {
void far * buf1; /* pointer to user buffer 1 */
int buf1size; /* size of user buffer 1 (bytes) */
void far * buf2; /* pointer to user buffer 2 */
int buf2size; /* size of user buffer 2 (bytes) */
} udat;
Each time _unpack_continue returns, it indicates one of three possible conditions: 1) more input is needed; 2) output needs to be flushed; or 3) decompression is complete. When more input is needed or the output needs to be flushed, _unpack_continue returns the address and size of the input or output buffer. After input data has been requested, _unpack_continue accepts an input value which indicates the number of input bytes placed in the input buffer. When this input value is zero, all remaining data in the input buffer is decompressed and flushed. The "decompression complete" condition is then returned.
Following is an example of code that would decompress a file:
int inhandle, outhandle, flag = 1;
unsigned int read_size;
void far *buffer;
... /* open input and output files, initialize decompression */
do {
switch (flag)
{
case 1:
read_size = _read_h (inhandle, buffer, read_size);
if (read_size == 0xffff)
{
_put_str ("Error reading from input file");
return;
}
break;
case -1:
read_size = _write_h (outhandle, buffer, read_size);
if (read_size == 0xffff)
{
_put_str ("Error writing to output file");
return;
}
}
flag = _pack_continue (&buffer, &read_size);
} while (flag);
_pack_cmplinit initializes the compression system to allow the compression completion percentage to be calculated. _unpack_cmplinit initializes the decompression system to allow the decompression completion percentage to be calculated. Both functions require the total size of the data before compression, in bytes. Both initialization functions may be called again to reset the respective completion percentage to zero. Thereafter, _get_packcmpl (or _get_unpackcmpl) may be called to obtain the current compression (or decompression) completion percentage.
The completion percentages reported by _get_packcmpl and _get_unpackcmpl are integers between 0 and 100. These values are calculated by determining the total number of uncompressed (or decompressed) bytes processed since the last call to the initialization function and then dividing this value by the original, uncompressed size of the corresponding data stream.
100% of all single-bit errors, 100% of all two-bit errors, 100% of all odd numbers of errors, 100% of all burst errors less than 17 bits wide, 99.9969% of all burst errors 17 bits wide, and 99.9985% of all burst errors greater than 17 bits wide.
Microsoft C:
cl /c /I\msc\include /I\sa\include pack.c
link pack,pack,,\sa\lib\_sas \msc\lib\slibce
Borland C:
bcc /c /ms /I\bc\include /I\sa\include pack.c
tlink \bc\lib\c0s pack,pack,,\sa\lib\_sas \bc\lib\cs