IBM: Method for paring archived data to allow for smooth depreciation

(Adapted from article 441158, published in Research Disclosure number 441 (January 2001), page 164.)

Method for paring archived data to allow for smooth depreciation

Timothy D. Greer

Disclosed is a method for selecting time-sequenced archive data for replacement with newer data. The method allows for discounting in the same sense as financial depreciation, in order to favor more recent data, while systematically saving diminishing samples of older data. Time resolution of the data points is reduced according to the age of the data. The resulting non-uniform time resolution allows the archive to span more total time, without sacrificing high-resolution coverage of recent results.

This disclosure is concerned with time-sequenced archive data. This is data that becomes available at regular intervals and which is saved for an extended time, considerably longer than the availability interval. For example, a test might be run daily, and the results saved for years. Unless storage is unlimited, eventually it will be necessary to delete some old results in order to make room for newer results. Deletion may also be desirable to improve searching efficiency or for other reasons.

A decision to make, then, is which data to delete? A reasonable approach is to delete the oldest data -- first in first out. However, for some data this is not the best approach. While the more recent data is almost always going to be considered the most important, it may be better to have access to data spanning a longer period of time than to have access to the same amount of data spanning a shorter time in more detail. For example, archived daily test data might be examined to determine when certain tests began to fail. If the tests began to fail more days ago than the size of the archive, it will only be possible to place a lower bound on the time since the last successful run (i.e. a lower bound equal to the size of the archive). If instead some of the data of intermediate age had been discarded to make room for new data, the time span covered by the archive could have been much longer. This would improve the likelihood that both an upper and lower bound could be placed on the time since last success, and failing that at least the lower bound would be greater.

To extend the age of the oldest archive data without using more storage, time resolution must be traded for time span. For example, one might save only test results from every third day. But generally it will be desirable to keep as much resolution as possible in the recent data. As the data gets older, lower time resolution becomes appropriate. So some algorithm is needed to choose which data to discard, and not always the oldest.

Let N = Total number of archived records to be retained
   P0 = Number of archived records to retain over maximum-resolution span
        (The archive is to always have the P0 most recent records)

Partition N into subsizes P0, P1, P2, ... so that P0 + P1 + P2 + ... = N. The sizes of the various Pi reflect the value placed on resolution of different ages; generally one would expect N > P0 >= P1 >= P2 >= .... When archive records are sorted in time order, youngest to oldest, they are partitioned into different groups according to the first P0 go in the first group, then next P1 go in the next group, and so on. When a record is deleted to make room for the next incoming record, the group to delete from is chosen such that Pi/N of the time the Pi group is selected. The oldest record from that group is then deleted.

How should the Pi numbers be chosen? Note that P0=N reduces to the first-in-first-out scheme. Another reasonable way to choose would be to keep the proportionality of the succeeding groups constant, e.g. P1/P0 = P2/P1 = P3/P2 = ... (Obviously this can only be achieved exactly for certain values of N and P0, but it can easily be approximated.) Likewise keeping the selection frequency of the Pi group to Pi/N can only be approximated over a short period of time, but a reasonable pattern of length N can be chosen. In addition, random selection with probability Pi/N may be suitable or even desirable in some circumstances.

Example:  N=15  P0=8  P1=4  P2=2  P3=1
Selection order:  P0 P1 P0 P2 P0 P1 P0 P3 P0 P1 P0 P2 P0 P1 P0

Archiving of test data is an example of the case where a fixed elimination sequence is desirable. For an example where random selection is desirable, consider archiving of video surveillance for crime deterrence. If there is a fixed sequence which the criminal knows, the criminal might plan the crime so that it appears on a tape that will be quickly discarded.

Besides proportionality, essentially a fractal view of the past, other schemes to choose the size of each group may be appropriate. The group sizes reflect the value placed on the time resolution of the archive data of the various age ranges. Also, one could vary the process by not adding any data to the first group (P0) until they are beyond a certain age. (This is the same as increasing the size of P0 except that the proportionality rule P0/P1 = P1/P2, or other rule for deciding the sizes of P1, P2, etc., would not apply.) Likewise, one might not immediately discard data points deleted from the last group. This would allow a low-resolution series of arbitrary length to be retained.

The indicated method generalizes readily. Eliminating each retained series can be done using the method just described. An even lower resolution series of arbitrary length is then retained. The time horizon for saved data is extended even farther, using ever-lower resolution rather than retaining excess data. The indicated method gobbles retained excess, extending recursively.

The information provided, and views expressed on this site are my own and do not represent the IBM Corporation.

Return to the author's home page.