to range selectivity queries, the V-optimal histograms capture im-
portant features of the data in a concise way [PlHS96].
To handle many base tables and many types of queries, a large
number of synopses may be needed. Moreover, for fast response
times that avoid disk access altogether, synopses that are frequently
used to respond to queries should be memory-resident.” Thus we
evaluate the effectiveness of a synopsis as a function of its fool-
print, i.e., the number of memory words to store the synopsis. For
example, it is common practice to evaluate the effectiveness of a
histogram in estimating range selectivities as a function of the his-
togram footprint (i.e., the number of histogram buckets and the
storage requirement for each bucket). Although machines with
large main memories are becoming increasingly commonplace, this
memory remains a precious resource, as it is needed for query-
processing working space (e.g., building hash tables for hash joins)
and for caching disk blocks. Moreover, small footprints are more
likely to lead to effective use of the processor’s Ll and/or L2 cache;
e.g., a synopsis that tits entirely in the processor’s cache enables
even faster response times.
The effectiveness of a synopsis can be measured by the accu-
raq of the answers it provides, and its response time. In order
to keep a synopsis up-to-date, updates to the data warehouse must
be propagated to the synopsis, as discussed above. Thus the final
metric is the update time.
1.1
Concise samples and counting samples
concise sample as new data arrives is more difficult than with ordi-
nary samples. We present a fast algorithm for maintaining a concise
sample within a given footprint bound, as new data is inserted into
the data warehouse.
Counting sutnples are a variation on concise samples in which
the counts are used to keep track of all occurrences of a value in-
serted into the relation since the value was selected for the sample.
We discuss their relative merits as compared with concise samples,
and present a fast algorithm for maintaining counting samples un-
der insertions and deletions to the data warehouse.
In most uses of random samples in estimation, whenever a sam-
ple of size n is needed it is extracted from the base data: either the
entire relation is scanned to extract the sample, or n random disk
blocks must be read (since tuples in a disk block may be highly cor-
related). With our approximate query set-up, as in [GMP97b], we
maintain a random sample at all times. As argued in [GMP97b],
maintaining a random sample allows for the sample to be packed
into consecutive disk blocks or in consecutive pages of memory.
Moreover, for each tuple in the sample, only the attribute(s) of in-
terest are retained, for an even smaller footprint and faster retrieval.
Sampling-based estimation has been shown to be quite useful in
the context of query processing and optimization (see, e.g., Chap-
ter 9 in [BDF+97]). The accuracy of sampling-based estimation
improves with the size of the sample. Since both concise and count-
ing samples provide more sample points for the same footprint,
they provide more accurate estimations.
Note that any algorithm for maintaining a synopsis in the pres-
This paper introduces two new sampling-based summary statis-
tics, concise sumples and counting samples, and presents new tech-
niques for their fast incremental maintenance regardless of the data
distribution.
ence of inserts without accessing the base data can also be used
to compute the synopsis from scratch in one pass over the data, in
limited memory.
Consider the class of queries that ask for the frequently occur-
ring values for an attribute in a relation of size 7~. One possible
synopsis data structure is the set of attribute values in a uniform
random sample of the tuples in the relation: any value occurring
frequently in the sample is returned in response to the query. How-
ever, note that any value occurring frequently in the sample is a
wasteful use of the available space. We can represent k copies of
the same value 21 as the pair (r~, Ic), thereby freeing up space for
k - 2 additional sample points.3 This simple observation leads to
our first new sampling-based synopsis data structure:
Definition 1 A concise sample is a uniform random sample of the
data set such that values uppearing more than once in the sample
are represented us a value and a count.
While using (value, count) pairs is common practice in various
contexts, we apply it in the context of random samples, such that a
concise sample of sample-size m will refer to a sample of m’ > m
sample points whose concise representation (i.e., footprint) is size
m. This simple idea is quite powerful, and to the best of our knowl-
edge, has never before been studied.
Concise samples are never worse than traditional samples, and
can be exponentially or more better depending on the data distri-
bution. We quantify their advantages over traditional samples in
terms of the number of additional sample points for the same foot-
print, and hence in providing more accurate query answers.
Since the number of sample points provided by a concise sam-
ple depends on the data distribution, the problem of maintaining a
2Vanous synopses can be swapped in and out of memory as needed. For per-
sistence and recovery. combinations of snapshots and/or logs can be stored on disk;
alternatively, the synopsis can often be recomputed in one pass over the base data.
Such dtscusstons are beyond the scope of this paper.
3We assume throughout this paper that values and counts use one “word” of mem-
ory each. In general, variable-length encoding could be used for the counts, so that
only /lg ~1 bits are needed to store s as a count: this reduces the footprint but com-
plicates the memory management.
1.2 Hot list queries
We consider an application of concise and counting samples to the
problem of providing fast (approximate) answers to hot list queries.
Specifically, we provide, to a certain accuracy, an ordered set of
(value, count) pairs for the most frequently occurring “values” in a
data set, in potentially orders of magnitude smaller footprint than
needed to maintain the counts for all values. An example hot list
is the top selling items in a database of sales transactions. In var-
ious contexts, hot lists of m pairs are denoted as high-biased his-
tograms [IC93] of m + 1 buckets, the first m mode statistics, or
the m largest itemsets [AS94]. Hot lists can be maintained on sin-
gleton values, pairs of values, triples, etc.; e.g., they can be main-
tained on Ic-itemsets for any specified k, and used to produce asso-
ciation rules [AS94, BMUT97]. Hot lists capture the most skewed
(i.e., popular) values in a relation, and hence have been shown to
be quite useful for estimating predicate selectivities and join sizes
(see [Ioa93, IC93, IP95]). In a mapping of values to parallel pro-
cessors or disks, the most skewed values limit the number of pro-
cessors or disks for which good load balance can be obtained. Hot
lists are also quite useful in data mining contexts for real-time fraud
detection in telecommunications traffic [Pre97], and in fact an early
version of our algorithm described below has been in use in such
contexts for over a year.
Note that the difficulty in incremental maintenance of hot lists
is in detecting when itemsets that were small become large due to a
shift in the distribution of the newer data. Such detection is difficult
since no information is maintained on small itemsets, in order to
remain within the footprint bound, and we do not access the base
data.
Our solution can be viewed as using a probabilistic counting
scheme to identify newly-popular itemsets: If 7 is the estimated
itemset count of the smallest itemset in the hot list, then we add
each new item with probability l/r. Thus, although we cannot af-
ford to maintain counts that will detect when a newly-popular item-
332