IBM: Introduction to Data Compression for IE's

An Introduction to Data Compression for IE's

Timothy D. Greer

When you back up data on your computer using commercially available back-up programs, usually the program includes the feature of "compressing" your data. [ 1 ] Have you ever wondered how this works? Data compression is a feature that should interest an Industrial Engineer; it improves the efficiency of the back-up. Compression also improves efficiency in areas unrelated to back-up, such as data transmission. [ 2 ] Techniques have evolved along with continued development of electronics and digital computing, so that today specific methods may be mathematically complex and not especially accessible to the layman. But the purpose and general principles are straightforward and appropriate for any engineer to understand.

How is it possible to take any collection of bits you may have on your computer, manipulate them in some way, and come up with another, smaller set of bits which can be un-compressed back into the original? This is essentially what compression programs would claim to do. "Give me your data, and I'll squish it down to half (or 1/4, or 1/10) the number of bits. Give me those bits back later and, knowing nothing else about where they came from, I'll unfold them back into the original data!" Well, this is of course impossible. The key to compression is that we generally don't have just "any collection of bits". Most data has lots of internal correlation, and compression programs try to transform the data so that the correlation is eliminated. Data files will often be mostly zeros, for instance. An easy way to literally see this is to think of what a fax machine needs to do. Here is a paper it needs to transmit. The paper is almost completely white, but has some (very important!) black lines here and there. The fax machine could divide the paper up into many tiny squares and send a 1 for each black square and a zero for each white square. But this requires a bit for each square. The machine can instead count the number of consecutive white squares, then the number of consecutive black squares, and so on, and transmit these numbers. This is an example of run-length coding, and for the right kind of data it results in a substantial reduction in the number of bits required. Compression ratios up to about 5:1 have been reported for facsimile systems. [ 3 ]

Of course in real life things aren't quite so simple -- the paper may be a little dirty so it isn't an obvious decision whether a square should be black or white, the operator may get the paper in sideways so that instead of nice, long runs of white squares we get more, shorter runs, etc. And often we want to transmit higher quality pictures than fax machines produce. For pictures, a great deal of experimentation has been done to discover what features are important for humans, and what sorts of correlation exist in typical images. [ 4 ]

For data already in a computer, the field isn't quite so wide open to experimentation and discovery. The format is, after all, man-made. But there is still lots of correlation to take advantage of. Compiled programs will contain certain common structures. Uncompiled programs will have lots of common strings. Text is represented in ASCII with 7-bit characters, but almost all the characters we ever use can be represented with five bits. (Five bits gives you 2**5 = 32 unique characters. You could use 26 for a-z and the other six to signal special cases like capitalization, number, comma, period, etc.) More complex schemes can take advantage of less obvious features of ordinary text. Every cryptographer knows, for instance, not only that "e" appears 65 times as frequently as "q" in English [ 5 ], but also that combinations such as "ch" and "ng" are common, while "vv" and "qp" are pretty unlikely to appear. The number of bits used to encode letters, diphthongs, and even whole words can be weighted according to their probability of use. This is not a technique that had to await the computer age, either. Morse code is an example of its application.

What good is all this to an IE? First, it is important to realize that data compression exists and works. It won't take 50 1-Mb diskettes to back up your full 50 Mb hard drive. If you have an application transmitting data over a line that is at capacity, you may be able to compress your data rather than add a line to expand. If you are evaluating competing communications or storage technologies and only one incorporates data compression, the feature needs to be given appropriate weight. A current version of IBM's Virtual Telecommunications Access Method (VTAM 3.4.1) includes a data compression feature. [ 6 ]

Next, it is useful to know something about the compression techniques used. At least consider the type of data the system is designed and rated for. Run-length encoding would be ridiculous to apply to data already in text format, for instance. How many long strings of "e", or any other letter, do you see in this article? On the other hand, if your "text" is screenfulls of data for a computer terminal, a modified run-length code may be just the ticket, because there will be long strings of blanks included in the data.

The most important division of compression types is between "lossless" and "lossy". Lossless types guarantee you that after compression and uncompression, the original data will be restored unaltered. Lossy compression will probably result in an imperfect restoration of the original data. Lossless compression is appropriate for things like computer programs and data, where one misplaced bit could create havoc. Lossy compression is appropriate for things like television or fax images, where a few misplaced specks will probably go unnoticed. Lossy compression can guarantee you a certain percentage reduction in the storage or transmission bandwidth required. Lossless compression cannot guarantee successful compression at all -- the data may already be in as small a form as the compression technique could produce. Compression techniques of both types encompass a wide range of speeds and memory requirements. [ 7 ]

Finally, just as it is good for an engineer to know a little about electricity and fluid mechanics, although they may be well outside their common field of practice, recognition of data compression's spot in the technical milieu is worthwhile. A Human Factors engineer may find it interesting to know that ordinary pictures can have 90% of their intrinsic information removed and still be nearly indiscernible from the original even by experts. [ 8 ] An efficiency expert might like to know why two different back-up programs take different numbers of diskettes. You may even find you need to learn more -- about error correction, image quality, analog signal transmission, and diskette formats -- to name a few related areas left untouched in this essay.

References

Fogle, David. "Backup Software", PC World Vol. 10, No. 7 (July 1992) pp. 169-180.
Jurgen, Ronald K. "Digital Video", IEEE Spectrum Vol. 29, No. 3 (March 1992) pp. 24-30.
Pratt, William K. Digital Image Processing, J. Wiley & Sons, New York, 1978, page 635.
Ibid, pp. 162-198.
Denning, Dorothy. Cryptography and Data Security, Addison-Wesley, Reading, Mass. 1983, page 65.
SNA Formats. IBM Corp. publication GA27-3136-12, 1992, pp. 4-9 - 4-12.
Gonzalez, Rafael and Wintz, Paul. Digital Image Processing, 2nd edition, Addison-Wesley, Reading, Mass. 1987, pp. 255-329.
Pratt, page 729.

Return to the author's home page.