Adaptive Predictor for Lossless Image Compression

Václav Hlavác, Jaroslav Fojtí k
Center for Machine Perception
Faculty of Electrical Engineering, Czech Technical University
121 35 Prague 2, Karlovo námestí 13, Czech Republic
keywords: image compression, lossless compression,
predictor-based compression


A new method for lossless image compression of grey-level images is proposed. The image is treated as a set of stacked bit planes. The compressed version of the image is represented by encoded residuals of a non-linear local predictor spanning the current bit plane as well as a few neighbouring ones. Predictor configurations are grouped in pairs differing in one bit of the representative point only. The frequency of predictor configurations is obtained from the input image. The predictor adapts automatically to the image, it is able to estimate the influence of neighbouring cells and thus copes even with complicated structure or fine texture.

The residuals between the original and the predicted image are those that correspond to the less frequent predictor configurations. Efficiently coded residuals constitute the output image. To our knowledge, the performance of the proposed compression algorithm is comparable to the current state of the art. Especially good results were obtained for binary images, grey-level cartoons and man-made drawings.

1  Introduction

Data compression can be viewed as a process producing a model of the input data together with associated residuals. An improved data model yields better compression results.

Redundant information in generic raster images is due to high correlation between neighbouring cells1. In other words, the local structure and similarity are the main source of redundancy in generic imagery. The proposed image compression method examines the neighbourhood of the current cell using an adaptive probe, creates a non-linear predictor for the current cell value, and stores residuals of the predictor in the resulting compressed image.

In the compression phase, an image is systematically traversed using an adaptive probe and, for the current cell, a non-linear predictor is created by examining the local neighbourhood. The estimated value is compared to the actual value of the cell. Where there is a discrepancy between the estimated and the actual value the residual (binary value one) is stored in the position of the current cell. If the predictor provides a good estimate, a precise match will be found for the majority of the cells and the resulting compressed image will be small. The description of the predictor and the positions of the residuals in the image are stored together in the output file. Three predictors with increasing complexity are described in this paper, i.e. (i) for binary images, (ii) for grey-level images, (iii) adaptive predictor for grey-level image.

The reduced data is optimally coded, and together with the predictor description, the compressed image is created.

The paper is organized as follows. Section  lists the current most efficient lossless compression methods and these are later used for comparison with the proposed methods. Section  explains the basic predictor as a set of probe pairs that differ in one bit only. Section  generalizes the predictor for the grey-level image that is treated as stacked bit planes. Section  further improves the predictor by allowing it to adapt to the actual image data. Section  proposes an efficient way to encode the sparse matrix of predictor residuals. In Section  we present experimental results. In the last section we draw conclusions.

A preliminary version of the paper was published in [].

2  Related works

There are several lossless compression methods used for 1D streams of data as text or executable programs. The ZIP or LZW are examples of commonly used packages. They are known and we do not describe them here. Let us begin with describing some methods similar to ours developed for 2D compression. Lossless compression techniques are usually divided into two generations. The methods in the first generation use quite simple algorithms. Let us just mention three representatives here.

The IBM Q80 coder (lossless JPEG) [] is obsolete and outperformed by the conceptually similar Portable Network Graphics (PNG).

PNG [] consists of two steps. In the first step, the difference image Dx (x,y) = f(x-1,y)-f(x,y) is computed and a matrix which contains only differences is created. In the second step, this matrix is encoded using the LZW method. Both parts may be connected by a pipe and these can run in parallel. The source code of the algorithm is freely available. The disadvantage is that the second step is not tuned to 2D image data. PNG is based on the assumption that the number of regions with limited 1st derivatives is smaller than the number of regions with higher 1st derivatives. The context of images is usually richer.

Paper [] discusses several simple compression methods PVRG-JPEG proposed by the Portable Video Research Group at Stanford. Each method was tried on the images and the best method was selected for actual compression.

The second generation lossless compression methods are usually divided into two subgroups:

Region based.
The image is divided into regions that have similar intensity or other properties. Regions are compressed separately and stored together with a description of the boundary of the region. The method Segment [] belongs to this group. The regions are grown from seeds. The resulting compression ratios are very good.
Data prediction based
methods try to create an internal data model. Residuals of the model are stored. A method which is performing similarly to ours that has similarly is called Callic []. It calculates several features from the pixels in a small neighbourhood. The pixels in the predictor are chosen adaptively using information from image directional derivatives.

The rounding transform from [] uses circular-shaped hierarchical estimators. The compression ratio does not rank the method among the best.

There is one more method which is relevant to ours. It treats the image as separate bit planes and is called the Logic coding method []. The method uses boolean logic for image compression.

3  Idea of image coding for binary image

The initial idea comes from Schlesinger [] who proposed a new representation scheme for binary images. Let us denote the method by Schl-2 . Number 2 comes from two degrees of freedom (DOF).

The Schl-2 method takes a binary image f as input. The support of the image is T={(x,y): 0 x M; 0 y N }, where M is the width of the image and N is the height. The image represents the mapping f(x,y) {0,1}. Without loss of generality, we assume that the leftmost column and the bottom line of the image have value zero, i.e. f(0,y) = f(x,0) = 0.

The Schl-2 method traverses the binary image f by the probe consisting of 2×2 cells. Its representative cell (x,y) is placed in the upper right corner of the probe. The value of the actual (estimated) cell is p4 = f(x,y). Let values of the probe placed in the particular position in the probe be denoted p1 = f(x,y-1); p2 = f(x-1,y-1); p3 = f(x-1,y); see Fig. .

Schlesinger designed the predictor (estimator, probe) e in the following way. All the 16 possible combinations of cells within the 2 ×2 probe are considered.

Probes are arranged into two rows and eight columns in such a way that probes in the same column differ in the upper right bit only, see Fig. . A frequency analysis of the predictor configurations was performed on many different images with the aim of finding out which configuration in each pairs occurs more often. The more frequent configurations are highlighted in Fig. . The probes in column 5 will be discussed soon.

The idea of this compression method is to store residuals of less frequent probes, i.e. value one in the representative point of the probe. The arrangement of the probes allows us to reconstruct the image from a sparse matrix of stored residuals only.

Schlesinger tested the predictor on many images and the more frequent probe of each pair was found through extensive experiments. In the compression phase, only the residuals between predicted and real values are stored. This typically yields a much sparser image.

Actually, the probe configurations proposed by Schlesinger are suboptimal with respect to the relative frequencies of different configurations. Probes in Fig. , column 5 would have to be highlighted in opposite manner if the probe statistics had to be optimal for binary images tested by Schlesinger. A substantial advantage of the suboptimal configurations independent of the image content is that standard set operations (intersection, union, negation) and binary morphology operations can be performed on the compressed images.

4  Compression of binary images

Our compression method for binary images abbreviated as FH-2 takes a step forward compared to Schlesinger's one. The probe frequency analysis is performed for each image. The result is then stored in a frequency table consisting of two rows and eight columns. A value in the table represents the number of occurrences of the corresponding probe in the image. The maxima in columns of the frequency table point to optimal predictor configurations. These are marked.

One bit is needed to distinguish between two probes in one column. Thus one byte describes all eight pairs of probes. This information will be added to the compressed image to store the predictor description.

In the compression phase , the image is traversed in a bottom-up manner and from right to left. In each position of the image, a check is made to determine whether a more frequent or less frequent probe appeared. In the former case, the value 0 is written to the representative cell. Whereas, in the latter case, the value one is stored (corresponding to the residual of the predictor). The algorithm traverses the whole image and produces an image of residuals. This reduced image is typically much sparser than the input image due to the estimation abilities of the predictor.

The decompression phase has at hand the description of the predictor (one byte) and the image of residuals. The left-most column and the bottom most row are known to be zero-valued. The probe starts at the bottom left corner with the values p1, p2, and p3 being known. These three values uniquely determine one of the eight columns of the predictor. The value of the stored residual distinguishes the correct probe in the pair. The decompressed value p4 is set accordingly and overwrites the residual in the position p4. The probe traverses the image row-wise from left to right and from bottom to top, until the original image is restored.

5  Compression of grey-level images

The grey-level image can be treated as stacked bit planes. The predictor described in the previous section was generalized to cope with the current bit plane and the closest bit plane upwards. The predictor context encompassed by the volumetric probe has three degrees of freedom (DOF). Therefore we abbreviate the proposed compression algorithm FH-3. The arrangement of cells in the probe is shown in Fig. . The processed bit plane is the lower one. Its representative point, i.e. the predicted cell, is denoted by p8 and is highlighted. The idea of the compression remains the same as in the binary case.

The frequency analysis of probe configurations is performed when passing through the image for the first time. The process is analog to the binary image case described in Section 3. Here 27 bits = 16 bytes are needed to store the frequency table of all possible configurations. The image of the processed bit-plane is reduced and residuals are stored in the same way as in the binary image case. This requires a second pass through the original image data. The result is a sparse matrix with a small number of residuals.

The original image can be uniquely reconstructed using the residuals and the description of the predictor stored in the compressed image. The compression process as described above starts from the top bit plane using the FH-2 algorithm (since there is no available data above the top plane). Then the FH-3 algorithm is applied to remaining bit planes.

More complicated probes spanning several bit planes can be also used, see Fig. . The rightmost probe is used for processing the top bit plane. The probe to the left is used for coding the bit plane below the top plane and so on.

6  Adaptive compression

The two previous predictors had a fixed number of probe cells (3 in the Schl-2 case and 7 in the FH-3 case) in a fixed arrangement. The desire is to exploit a more complex image context in the predictor. The support set of the predictor can vary in more dimensions according to the context of the image. Such a predictor can encode more complicated structures in the image, e.g. some fine texture.

The size of the predictor support is limited from a practical point of view by the size of the frequency table. The size of the frequency table grows exponentially and has 2k entries, where k is the number of cells of the predictor. The predicted representative cell of the probe is not counted. For example, for 20 cells, the frequency table size is 220 = 1Mbit. This would be too much. Of course, not all predictor cells contain the same amount of information. The trick we propose is to throw away some cells from a relatively large probe adaptively. This is done according to the content of the particular image. In the following, let us denote such an adaptive predictor by FH-Adapt in the sequel.

The construction of predictor probes is similar to the procedure described earlier. They have more cells, e.g. 20, and could span n-DOF (degrees of freedom). In the case of an RGB colour image, there are 3-DOF in the intensity image corresponding to one of the colour components. The two other DOF are added by the context in the same bit-plane in other two colour components. Probes are again arranged in pairs that differ only in the value of the actual (estimated) cell. This facilitates a unique reconstruction of the original image in the decompression phase.

The frequency analysis comes next. The result of the analysis is the frequency table with the size 2k+1, where k is the number of support cells of the predictor. Note that there is no need for the representative cell to be in the set of k elements in the probe. The frequency table is usually rather big. The aim is to select the subset of i cells, i < k, containing most of the information. This results in a reduction of the length of the frequency table to 2i.

The selection step does not need to pass through the original image another time and thus is independent of the size of the image. Predictor cells may be viewed as features in statistical pattern recognition task. The cells contribute to the estimate of the representative point of the probe. After the frequency analysis, it is known how often a particular predictor probe occurs in the image. Moreover, the cells can be ordered according to their importance for estimation of the representative cell. The proposed algorithm decouples the influences of individual cells from the frequency table.

Let us explain the decoupling in the specific case of a binary image. The aim is to reduce the number of probe cells by one to k = 3. All configurations of the probe were shown in Fig. . Assume for a moment that we would like to decouple the influence of cell p3 (the top left corner of the probe Fig. ). If this feature was omitted the probe pairs would merge as shown in Fig. . The frequency of the new probe configuration is the sum of the frequencies of its two constituting configurations, which differ only by the value of the removed cell.

An appropriate representation of the frequency table facilitates the use of very efficient bit operations. Recall that the representative cell has the highest index. Let the frequency table be stored in a vector q with 2k+1 elements. The first 2k elements of the vector q = (q0, ,qk) correspond to the frequencies of the top row of probe configurations.

Let eall be the number of occurrences of the less frequent configurations of the probes. Furthermore, let b be an index to the first element of the second half of the frequency table, b = 2k, and c the index of the removed cell. The value eall is calculated as

eall = b-1

i = 0 
(qi, qi+b)  .
Let ec denote a measure of the estimativness of a cell with index c.
ec = eall - 1

i = 0 
( qi + qi c,qi b + qi c b )  .

The factor 1/2 in front of the sum comes from the fact that the value within the sum is considered twice. The estimativness ec expresses the increase in the number of residuals when the cell c is omitted. The operator denotes logical eXclusive OR.

The cell with the lowest estimativness ec is omitted from the probe. This algorithm can be applied iteratively and in each iteration the cell corresponding to the minimal value ec is thrown away.

At last, we will define the termination criterion for removal of cells. The number of residuals eall increases monotonically if we throw away cells one by one. The number of entries in the frequency table decreases monotonically and thus its length decreases. We are trying to find the extremum of the difference between these two functions. If the increase of the length of the coded residuals (given by the value ec) is bigger than the decrease of the table length or the table is empty then the algorithm terminates.

7  Encoding residuals

To achieve high compression ratio the matrix of residuals should be efficiently coded. Consider Fig.  where a simple example of residuals is shown. The idea is to code residuals along line scans using distances between residuals. The counting starts from an hypothetical starting point positioned left of the first row. The counting finishes in the hypothetical end point positioned to the right of the last row. The distances l1, , lk obtained represent the residuals uniquely. Of course, the sum of all distances is constant, i.e.


i = 1 
li = M N + 1  .
This fact can be used for error checking, e.g. while loading stored data.

The question how to code distances l1, , lk efficiently is important. The traditional Huffman coding is not suitable because the maximal possible length lmax = M N+1 is too big. The time needed to compute the Huffman table is also important. The best algorithms can do it in O(n log(n)) time where the optimal complexity of sorting items is O(n log(n)) and the complexity of computing the Huffman table is O(n). The number O((M N+1) log(M N+1)) would be too high for all practical purposes.

It is desirable to find suitable code to represent big integer numbers. There is a need to encode large values very often. The Elias code [] inspired us. The Elias code is composed of two parts, the first uses unary and the second binary code. We propose a new code, called code with logarithmic growth , that was inspired by Elias' code. The length of this code grows logarithmically with the encoded number li.

We combined the Huffman coding and the prefix code with logarithmic growth for better compression. The combination of the Huffman code and the prefix code with logarithmic growth is illustrated in Fig. . The number K is the highest number coded by the Huffman code. If K 0 we chose a modified code with logarithmic growth. This code can be found in the column round(log2(K)).

There are three options, see Tab. : (a) K = 0, i.e. only the code with logarithmic growth is used (basic option); (b) K is fixed. Experiments have shown that the best values are typically K = 32 or K = 64; (c) K is set optimally according to the data to be compressed. This option is described next.


No. Binary (a) K = 0 (b) Fixed K (c) Optimal K
1 00 00 000 0000
2 01 01 001 0001
3 100 1000 010 0010
4 101 1001 011 0011
5 11000 1010 10000 0100
6 11001 1011 10001 0101
7 11010 110000 10010 0110
8 11011 110001 10011 0111
9 1110000 110010 10100 100000
10 1110001 110011 10101 100001
11 1110010 110100 10110 100010
12 1110011 110101 10111 100011
13 1110100 110110 1100000 100100
14 1110101 110111 1100001 100101
15 1110110 11100000 1100010 100110
16 1110111 11100001 1100011 100111
17 111100000 11100011 1100100 101000

Table 1: Code with logarithmic growth.

The length L1 of the Huffman code table is an increasing function

L1 = 4 + (n+1)  log2 S  [bits],
where S is the size of the Huffman code space (that corresponds to the longest coded symbol). Value (n+1) is the number of entries in the table, n entries are used for the numbers and 1 is used for the prefix. The log2 S term gives the length of the table entry in bits. The length of the coded residuals, (the small perturbations are caused mainly by the change of the exponent, e.g. from 23 to 24), decreases more or less monotonically. The difference between these two monotonic functions is the function we are looking for. The point of interest is the maximum of the function corresponding to the optimal of K.

The relationship between the number K and the length of encoded residuals is illustrated in Fig. . Eight curves, one for each bit plane of a test image, are shown.

8  Experiments

The goal of the experiments were to (1) test the properties of the proposed method, (2) compare it with the results of others, and (3) find possibilities of further improvements.

Let us illustrate the behaviour of the proposed FH-3 compression algorithm on the standard grey-level testing image Lena of size 256×256×8 bits, see Fig. . Selected bit-planes of the original and compressed image are shown in Fig. . Note that the compression is more effective on bit-planes corresponding to more significant bits as there are usually less changes there. Look how well the residuals express perceptually significant features in the image. The mentioned experiment is quantified in Tab. . Notice how the number of residuals increases from the most significant bit-plane to the least significant bit-plane. It is difficult to represent randomness by the estimator. The estimativness ec gives a clear insight into the behaviour of the predictor. Cell 6 and 4 from the neighbouring bit planes contain the highest amount of information.

Bit-plane bit 7 bit 6 bit 5 bit 4 bit 3 bit 2 bit 1 bit 0
FH-2 residuals 2010 5477 10177 15067 21865 28595 31771 32214
FH-3 residuals empty 3841 5935 9421 15045 21690 28622 31272
estimativness e1 empty 25 160 218 233 91 125 320
estimativness e2 empty 616 2165 3108 4930 4919 2521 457
estimativness e3 empty 52 180 219 233 91 128 352
estimativness e4 empty 1892 4101 5543 6611 6600 2813 447
estimativness e5 0 14 123 143 92 26 112 361
estimativness e6 1457 2097 3477 4169 5807 5031 2312 467
estimativness e7 0 23 118 136 170 85 71 220

Table 2: Quantitative description of the FH-2 and FH-3 methods.

The proposed adaptive compression method FH-Adapt was compared on several images from the standard JPEG image set []. The methods used for comparison were: Segment from [], Portable Video Research Group at Stanford PVRG-JPEG from [], Rounding transform from [], and Logic coding from []. Results were taken directly from the papers. The only implementation that was available was that of Callic [].

The quantitative measure used for comparison of the different methods is the compression efficiency CE []

CE = total number of input bytes - totalnumber of outputbytes
total number of input bytes
·100 %

The value CE in Tab.  shows how our method performs in comparison with others. The FH-3 method was outperformed only by Segment and PNG in the case of an image of a monkey (Baboon). The reason is that the predictor cannot cope with the complicated structure of the monkey's hair. Segment performs also better on the Lena image. Unfortunately we were not able to experiment with Segment because the code was not available. The PNG compression is very similar to the IBM Q80 coder []. IBM Q80 is obsolete and the implementation was not accessible.

Method/Image Lena Baboon Boats
PVRG-JPEG 22.7-29.4 6.7-13.7 24.9-33.1
Logic 26.1 12.6 29.3
Segment 54.6 36.3 empty
Rounding Tr. 37.6 empty empty
PNG 20.7 41.7 33.7
FH-3 34.0 19.1 41.2

Table 3: Compression coefficients compared with results of others

The Tab.  shows the comparison of our method FH- Adapt with other common methods on the Lena image and on a technical drawing image, see Fig. . The latter image is perfectly suited for our compression method as can be seen from the results.

Type of data Lena [Bytes] CE Lena [%] Drawing [bytes] CE Drawing [%]
Uncompressed 262656 0 106940 0
ZIP 225007 15.62 10224 90.44
ARJ 228389 14.35 10380 90.29
PCX 274256 -2.85 38624 63.88
GIF 278771 -4.54 17290 83.83
FH-Adapt. 175434 34.21 7297 93.18

Table 4: Comparison with typical compression methods.

Tab.  illustrates the influence of three residual coding methods on three different predictors. This gives us an understanding what improvement the individual method brings. The values in the table are the lengths of the compressed 512 × 512 image Lena in bytes.

Method/Coding Exp. code only Fixed Huffman Moving Huffman
FH-2 203000 200758 200599
FH-3 178655 176021 175922
FH-Adapt178130 175541 175432

Table 5: Comparison of three proposed predictors and three types of coding techniques. Values are given in bytes.

Tests were performed on an Intel Pentium 120MHz computer. The computation time was obtained using a RAM disk to avoid randomness in accessing the disk. The compression of the image Lena (512 × 512 × 8 Bits) takes 6.2 seconds using FH-3 method. The decompression is approximately twice as fast, i.e. in the Lena case it runs 2.7 seconds. Loading/Storing a data without compression takes 0.1 seconds.

9  Conclusions

We have proposed a new lossless image compression method that in many respects outperforms other methods considered as state of the art. The basic idea of the method is to predict differences in one bit only.

The advantages are:

  1. The performance in terms of compression efficiency is similar or better compared to the best methods used. Especially good results were obtained for binary images, grey-level cartoons and man-made drawings.
  2. The quick implementation is possible since the operations used are very simple, e.g. SHL, XOR, ADD, and the selection from a table.

  3. Substantial parts of the compression and decompression phases can run in parallel.

  4. The predictor that constitutes the core of the method can be extended to handle more degrees of the freedom, e.g. to colour images or image sequences.

  5. When the resolution of the image increases, the compression ratio does not degrade. This happens in the case of pyramidal algorithms, e.g. MLP (Multi Level Progressive) [].

The disadvantages of the method are:

  1. The compression algorithm needs at least two passes through the image data. When the Huffman coding is used to encode the residuals from the predictor an additional pass is needed.
  2. The decompression algorithm is very sensitive to errors in the compressed data. Of course, this disadvantage holds for most compression methods.

We encourage the reader to test the method, see the www page

10  Acknowledgements

We are grateful to M. I. Schlesinger from Institute of Cybernetics, Ukrainian Academy of Sciences Kiev, Ukraine. His series of lectures held in Prague in summer semester 1996 initiated research reported in this article. J. Matas suggested us a few improvements and K. Jonsson carefully read the manuscript.

This research was supported by the Czech Ministry of Education grant VS96049, the Grant Agency of the Czech Republic, grants 102/97/0480, 102/97/0855 and the European Union grant Copernicus CP941068.

Figure 1: 2 ×2 probe of the predictor.

Figure 2: Predictor used for binary image coding. All possible configurations are arranged into eight pairs differing in the upper right bit only. Highlighted configurations are treated as less frequent.



Figure 3: FH-3 predictor.

Figure 4: Modified probes spanning several bit planes.



Figure 5: Lengths to be coded.



Figure 6: K selects the length of the fixed part of the code.

Figure 7: Changes of the overall size in dependence of a K value.

Figure 8: Extracted cell p3.

Figure 9: Lena - grey level test image.


Figure 10: Comparison between bit-planes of the original image and bit-planes storing residuals.

Figure 11: Image of the scanned technical drawing.


1 The term cell is used instead of pixel common for 2D images as it copes with higher dimensions too.

File translated from TEX by TTH, version 2.33.
On 5 Aug 1999, 13:50.