New Sampling-Based Summary Statistics
for Improving Approximate Query Answers
Phillip B. Gibbons
Yossi Matias
Information Sciences Research Center
Department
of Computer Science
Bell
Laboratories
Tel-Aviv University
Abstract
In large data recording and warehousing environments, it is of-
ten advantageous to provide fast, approximate answers to queries,
whenever possible. Before DBMSs providing highly-accurate ap-
proximate answers can become a reality, many new techniques for
summarizing data and for estimating answers from summarized
data must be developed. This paper introduces two new sampling-
based summary statistics, concise samples and counting samples,
and presents new techniques for their fast incremental maintenance
regardless of the data distribution. We quantify their advantages
over standard sample views in terms of the number of additional
sample points for the same view size, and hence in providing more
accurate query answers. Finally, we consider their application to
providing fast approximate answers to hot list queries. Our algo-
rithms maintain their accuracy in the presence of ongoing insertions
to the data warehouse.
1 Introduction
In large data recording and warehousing environments, it is often
advantageous to provide fast, approximate answers to queries. The
goal is to provide an estimated response in orders of magnitude
less time than the time to compute an exact answer, by avoiding or
minimizing the number of accesses to the base data.
In a traditional data warehouse set-up, depicted in Figure 1,
each query is answered exactly using the data warehouse. We con-
sider instead the set-up depicted in Figure 2, for providing very
fast approximate answers to queries. In this set-up, new data being
loaded into the data warehouse is also observed by an approximate
answer engine. This engine maintains various summary statistics,
which we denote
synopsis data structures
or
synopses
[GM97].
Queries are sent to the approximate answer engine. Whenever
possible, the engine uses its synopses to promptly return a query re-
sponse, consisting of an approximate answer and an accuracy mea-
sure (e.g., a 95% confidence interval for numerical answers). The
user can then decide whether or not to have an exact answer com-
puted from the base data, based on the user’s desire for the exact
answer and the estimated time for computing an exact answer as
determined by the query optimizer and/or the approximate answer
Permission to make digital or hard copies of all or part of this work for
personal or classroom use i8 granted without fee provided that
copies are not made or distributed for profit or commercial advan-
tage and that copies bear this notice and the full citation on the first page.
To copy otherwise, to republish, to post on eervsn or to
redistribute to lists, requires prior specific permission and/or a fee.
SIGMOD ‘98 Seattle, WA, USA
Q 1998 ACM 0-89791~9956/99/006...$5.00
Queries
Responses
Figure
'1
r-- " -'-'-.--
"'.-~~ /
I
-1
; : .:
Bum i
WU~&QlSSS
I ..'
I..--- ._ .._ -~...I-. ^ _-...A
1: A traditional data warehouse.
New Data
New Data
Figure 2: Data warehouse set-up for providing approximate
query answers.
engine.’ There are a number of scenarios for which a user may pre-
fer an approximate answer in a few seconds over an exact answer
that requires tens of minutes or more to compute, e.g., during a drill
down query sequence in data mining [GM95, HHW97]. Moreover,
as discussed by Faloutsos er al. [FJS97], sometimes the base data
is remote and currently unavailable, so that an exact answer is not
an option, until the data again becomes available.
Techniques for fast approximate answers can also be used in a
more traditional role within the query optimizer to estimate plan
costs, again with very fast response time.
The state-of-the-art in approximate query answers (e.g., [VL93,
HHW97, BDFf97]) is quite limited in its speed, scope and accu-
racy. Before DBMSs providing highly-accurate approximate an-
swers can become a reality, many new techniques for summarizing
data and for estimating answers from summarized data must be de-
veloped. The goal is to develop effective synopses that capture
important information about the data in a concise representation.
The important features of the data are determined by the types of
queries for which approximate answers are a desirable option. For
example, it has been shown that for providing approximate answers
‘This differs from the onLnr
u,g~rqafion
npproach in [HHW97], in which the base
data is scanned and the approximate zmswer is updated as the scan proceeds
331
to range selectivity queries, the V-optimal histograms capture im-
portant features of the data in a concise way [PlHS96].
To handle many base tables and many types of queries, a large
number of synopses may be needed. Moreover, for fast response
times that avoid disk access altogether, synopses that are frequently
used to respond to queries should be memory-resident.” Thus we
evaluate the effectiveness of a synopsis as a function of its fool-
print, i.e., the number of memory words to store the synopsis. For
example, it is common practice to evaluate the effectiveness of a
histogram in estimating range selectivities as a function of the his-
togram footprint (i.e., the number of histogram buckets and the
storage requirement for each bucket). Although machines with
large main memories are becoming increasingly commonplace, this
memory remains a precious resource, as it is needed for query-
processing working space (e.g., building hash tables for hash joins)
and for caching disk blocks. Moreover, small footprints are more
likely to lead to effective use of the processor’s Ll and/or L2 cache;
e.g., a synopsis that tits entirely in the processor’s cache enables
even faster response times.
The effectiveness of a synopsis can be measured by the accu-
raq of the answers it provides, and its response time. In order
to keep a synopsis up-to-date, updates to the data warehouse must
be propagated to the synopsis, as discussed above. Thus the final
metric is the update time.
1.1
Concise samples and counting samples
concise sample as new data arrives is more difficult than with ordi-
nary samples. We present a fast algorithm for maintaining a concise
sample within a given footprint bound, as new data is inserted into
the data warehouse.
Counting sutnples are a variation on concise samples in which
the counts are used to keep track of all occurrences of a value in-
serted into the relation since the value was selected for the sample.
We discuss their relative merits as compared with concise samples,
and present a fast algorithm for maintaining counting samples un-
der insertions and deletions to the data warehouse.
In most uses of random samples in estimation, whenever a sam-
ple of size n is needed it is extracted from the base data: either the
entire relation is scanned to extract the sample, or n random disk
blocks must be read (since tuples in a disk block may be highly cor-
related). With our approximate query set-up, as in [GMP97b], we
maintain a random sample at all times. As argued in [GMP97b],
maintaining a random sample allows for the sample to be packed
into consecutive disk blocks or in consecutive pages of memory.
Moreover, for each tuple in the sample, only the attribute(s) of in-
terest are retained, for an even smaller footprint and faster retrieval.
Sampling-based estimation has been shown to be quite useful in
the context of query processing and optimization (see, e.g., Chap-
ter 9 in [BDF+97]). The accuracy of sampling-based estimation
improves with the size of the sample. Since both concise and count-
ing samples provide more sample points for the same footprint,
they provide more accurate estimations.
Note that any algorithm for maintaining a synopsis in the pres-
This paper introduces two new sampling-based summary statis-
tics, concise sumples and counting samples, and presents new tech-
niques for their fast incremental maintenance regardless of the data
distribution.
ence of inserts without accessing the base data can also be used
to compute the synopsis from scratch in one pass over the data, in
limited memory.
Consider the class of queries that ask for the frequently occur-
ring values for an attribute in a relation of size 7~. One possible
synopsis data structure is the set of attribute values in a uniform
random sample of the tuples in the relation: any value occurring
frequently in the sample is returned in response to the query. How-
ever, note that any value occurring frequently in the sample is a
wasteful use of the available space. We can represent k copies of
the same value 21 as the pair (r~, Ic), thereby freeing up space for
k - 2 additional sample points.3 This simple observation leads to
our first new sampling-based synopsis data structure:
Definition 1 A concise sample is a uniform random sample of the
data set such that values uppearing more than once in the sample
are represented us a value and a count.
While using (value, count) pairs is common practice in various
contexts, we apply it in the context of random samples, such that a
concise sample of sample-size m will refer to a sample of m’ > m
sample points whose concise representation (i.e., footprint) is size
m. This simple idea is quite powerful, and to the best of our knowl-
edge, has never before been studied.
Concise samples are never worse than traditional samples, and
can be exponentially or more better depending on the data distri-
bution. We quantify their advantages over traditional samples in
terms of the number of additional sample points for the same foot-
print, and hence in providing more accurate query answers.
Since the number of sample points provided by a concise sam-
ple depends on the data distribution, the problem of maintaining a
2Vanous synopses can be swapped in and out of memory as needed. For per-
sistence and recovery. combinations of snapshots and/or logs can be stored on disk;
alternatively, the synopsis can often be recomputed in one pass over the base data.
Such dtscusstons are beyond the scope of this paper.
3We assume throughout this paper that values and counts use one “word” of mem-
ory each. In general, variable-length encoding could be used for the counts, so that
only /lg ~1 bits are needed to store s as a count: this reduces the footprint but com-
plicates the memory management.
1.2 Hot list queries
We consider an application of concise and counting samples to the
problem of providing fast (approximate) answers to hot list queries.
Specifically, we provide, to a certain accuracy, an ordered set of
(value, count) pairs for the most frequently occurring “values” in a
data set, in potentially orders of magnitude smaller footprint than
needed to maintain the counts for all values. An example hot list
is the top selling items in a database of sales transactions. In var-
ious contexts, hot lists of m pairs are denoted as high-biased his-
tograms [IC93] of m + 1 buckets, the first m mode statistics, or
the m largest itemsets [AS94]. Hot lists can be maintained on sin-
gleton values, pairs of values, triples, etc.; e.g., they can be main-
tained on Ic-itemsets for any specified k, and used to produce asso-
ciation rules [AS94, BMUT97]. Hot lists capture the most skewed
(i.e., popular) values in a relation, and hence have been shown to
be quite useful for estimating predicate selectivities and join sizes
(see [Ioa93, IC93, IP95]). In a mapping of values to parallel pro-
cessors or disks, the most skewed values limit the number of pro-
cessors or disks for which good load balance can be obtained. Hot
lists are also quite useful in data mining contexts for real-time fraud
detection in telecommunications traffic [Pre97], and in fact an early
version of our algorithm described below has been in use in such
contexts for over a year.
Note that the difficulty in incremental maintenance of hot lists
is in detecting when itemsets that were small become large due to a
shift in the distribution of the newer data. Such detection is difficult
since no information is maintained on small itemsets, in order to
remain within the footprint bound, and we do not access the base
data.
Our solution can be viewed as using a probabilistic counting
scheme to identify newly-popular itemsets: If 7 is the estimated
itemset count of the smallest itemset in the hot list, then we add
each new item with probability l/r. Thus, although we cannot af-
ford to maintain counts that will detect when a newly-popular item-
332
set has now occurred r or more times, we probabilistically expect
to have r occurrences of the itemset before we (tentatively) add the
itemset to the hot list.
We present an algorithm based on concise samples and one
based on counting samples. The former has lower overheads but
the latter is more accurate. We provide accuracy guarantees for the
two methods, and experimental results demonstrating their (often
large) advantage over using a traditional random sample. Our algo-
rithms maintain their accuracy in the presence of ongoing insertions
to the data warehouse.
This work is part of the Approximate Query Answering (AQUA)
project at Bell Labs. Further details on the Aqua project can be
found in [GMP97a, GPAf98].
Outline. Section 2 discusses previous related work. Concise sam-
ples are studied in Section 3, and counting samples are studied in
Section 4. Finally, in Section 5, we describe their application to hot
list queries.
2 Previous related work
Hellerstein, Haas, and Wang [HHW97] proposed a framework for
approximate answers of aggregation queries called online aggrega-
tion, in which the base data is scanned in a random order at query
time and the approximate answer for an aggregation query is up-
dated as the scan proceeds. A graphical display depicts the answer
and a (decreasing) confidence interval as the scan proceeds, so that
the user may stop the process at any time. Our techniques do not
provide such continuously-refined approximations; instead we pro-
vide a single discrete step of approximation. Moreover, we do not
provide special treatment for small sets in group-by operations as
outlined by Hellerstein et al. Furthermore, since our synopses are
precomputed, we must know in advance what are the attribute(s) of
interest; online aggregation does not require such advance knowl-
edge (except for its group-by treatment). Finally, we do not con-
sider all types of aggregation queries, and instead study sampling-
based summary statistics which can be applied to give sampling-
based approximate answers. There are two main advantages of our
approach. First is the response time: our approach is many orders
of magnitude faster since we provide an approximate answer with-
out accessing the base data. Ours may respond without a single
disk access, as compared with the many disk accesses performed
by their approach. Second, we do not require that data be read in
a random order in order to obtain provable guarantees on the accu-
racy.
Other systems support limited on-line aggregation features; e.g.,
the Red Brick system supports running count, average, and sum
(see [HHW97]).
There have been several query processors designed to provide
approximate answers to set-valued queries (e.g., see [VL93] and the
references therein). These operate on the base data at query time
and typically define an approximate answer for set-valued queries
to be subsets and supersets that converge to the exact answer. There
have also been recent works on “fast-first” query processing, whose
goal is to quickly provide a few tuples of the query answer. Bayardo
and Miranker [BM96] devised techniques for optimizing and exe-
cuting queries using pipelined, nested-loops joins in order to mini-
mize the latency until the first answer is produced. The Oracle Rdb
system [AZ961 provides support for running multiple query plans
simultaneously, in order to provide for fast-first query processing.
Barbara et al. [BDFf97] presented a survey of data reduc-
tion techniques, including sampling-based techniques; these can be
used for a variety of purposes, including providing approximate
query answers. Olken and Rotem [OR921 presented techniques for
maintaining random sample views. In [GMP97b], we advocated
the use of a bucking sump/e, a random sample of a relation that is
kept up-to-date, and showed how it can be used for fast incremental
maintenance of equi-depth and Compressed histograms. A concise
sample could be used as a backing sample, for more sample points
for the same footprint.
Matias etul. [MVN93, MVY94, MSY96] proposed and studied
upproximute duta structures that provide fast approximate answers.
These data structures have linear space footprints.
A number of probabilistic techniques have been previously pro-
posed for various counting problems. Morris [Mor78] (see also
[Fla85], [HK95]) showed how to approximate the sum of a set of
%L values in
[l..7n]
using only
O(lgIg7n
+ lglgn) bits of mem-
ory. Flajolet and Martin [FM83, FM851 designed an algorithm
for approximating the number of distinct values in a relation in a
single pass through the data and using only O(lg n) bits of mem-
ory. Other algorithms for approximating the number of distinct
values in a relation include [WVZT90, HNSS95]. Alon, Matias
and Szegedy [AMS96] developed sublinear space randomized al-
gorithms for approximating various frequency moments, as well as
tight bounds on the minimum possible memory required to approx-
imate such frequency moments. Probabilistic techniques for fast
parallel estimation of the size of a set were studied in [Mat92].
None of this previous work considers concise or counting sam-
ples.
3 Concise samples
Consider a relation R with R tuples and an attribute A. Our goal is
to obtain a uniform random sample of R.A, i.e., the values of A for
a random subset of the tuples in R.4
Since a concise sample represents sample points occurring more
than once as (value, count) pairs, the true sample size may be much
larger than its footprint (it is never smaller).
Definition2 Let S = {(VI, cl), , (v,, cJ),vJ+~, ,v(} be u
concise sample. Then sample-size(S) = e - j + cf=, cl, and
footprint(S) =
e
+ j.
A concise sample S of R.A is a uniform random sample of size
sample-size(S), and hence can be used as a uniform random sample
in any sampling-based technique for providing approximate query
answers.
Note that if there are at most m/2 distinct values for R.A,
then a concise sample of sample-size n has a footprint at most
m (i.e., in this case, the concise sample is the exact histogram of
(value, count) pairs for R.A). Thus, the sample-size of a concise
sample may be arbitrarily larger than its footprint:
Lemma 1 For any footprint m > 2, there exists data sets for
which the sample-size of a concise sample is n/m times larger than
its footprint, where n is the size of the datu set.
Since the sample-size of a traditional sample equals its foot-
print, Lemma 1 implies that for such data sets, the concise sample
has n/m times as many sample points as a traditional sample of
the same footprint.
Offline/static computation. We first describe an algorithm for
extracting a concise sample of footprint 711 from a static relation
residing on disk. First, repeat
71x
times: select a random tuple
from the relation (this typically takes multiple disk reads per tu-
ple [ORX9, Ant92]) and extract its value for attribute A. Next,
semi-sort the set of values, and replace every value occurring multi-
ple times with a (value, count) pair. Then, continue to sample until
either adding the sample point would increase the concise sample
4For simplicity, we describe our algorithms here and in the remainder of the paper
in terms of a single attnbute. although the approaches apply equally well to pairs of
attnbures, etc
333
footprint to m + 1 (in which case this last attribute value is ig-
nored) or n samples have been taken. For each new value sampled,
look-up to see if it is already in the concise sample and then either
add a new singleton value, convert a singleton to a (value, count)
pair, or increment the count for a pair. To minimize the cost, sam-
ple points can be taken in batches and stored temporarily in the
working space memory and a look-up hash table can be constructed
to enable constant-time look-ups; once the concise sample is con-
structed, only the concise sample itself is retained. If m’ sample
points are selected in all (i.e., the sample-size is m’), the cost is
O(m’) disk accesses. The incremental approach WC describe next
requires no disk accesses, given the set-up depicted in Figure 2. In
general, it can also be used to compute a concise sample in one
sequential pass over a relation.
3.1 Incremental maintenance of concise samples
We present a fast algorithm for maintaining a concise sample within
a given footprint bound as new data is inserted into the data ware-
house. Since the number of sample points provided by a concise
sample depends on the data distribution, the problem of maintain-
ing a concise sample as new data arrives is more difficult than
with traditional samples. The reservoir sampling algorithm of Vit-
ter [Vit85], that can be used to maintain a traditional sample in the
presence of insertions of new data (see [GMP97b] for extensions
to handle deletions), relies heavily on the fact that we know in ad-
vance the sample-size (which, for traditional samples, equals the
footprint size). With concise samples, the sample-size depends on
the data distribution to date, and any changes in the data distribu-
tion must be reflected in the sampling frequency.
Our maintenance algorithm is as follows. We set up an entry
threshold T (initially I) for new tuples to be selected for the sample.
Let S be the current concise sample and consider a new tuple t.
With probability l/r, we add t.A to S. We do a look-up on t.A in
S. If it is represented by a pair, we increment its count. Otherwise,
if t.A is a singleton in S, we create a pair, or if it is not in S, we
create a singleton. In these latter two cases we have increased the
footprint by I, so if the footprint for S was already equal to the
prespecified footprint bound, then we need to evict existing sample
points to create room.
In order to create room, we raise the threshold to some 7’ and
then subject each sample point in S to this higher threshold. Specif-
ically, each of the sample-size(S) sample points is evicted with
probability r/r’. We expect to have sample-size(S) (T/T’) sam-
ple points evicted. Note that the footprint is only decreased when a
(value, count) pair reverts to a singleton or when a value is removed
altogether. If the footprint has not decreased, we raise the threshold
and try again. Subsequent inserts are selected for the sample with
probability l/r’.
Theorem 2 For any sequence ef insertions, the above algorithm
maintains a concise sample.
ProoJ: Let T be the current threshold. We maintain the invariant
that each tuple in the relation has been treated as if the threshold
were always 7. The crux of the proof is to show that this invari-
ant is maintained when the threshold is raised to T’. Each of the
sample-size(S) sample points is evicted with probability r/r’. If it
was not in S prior to creating room, then by the inductive invariant,
a coin with heads probability l/r was flipped and failed to come
up heads for this tuple. Thus the same probabilistic event would
fail to come up heads with the new, stricter coin (with heads prob-
ability only l/7’). If it was in S prior to creating room, then by
the inductive invariant, a coin with heads probability l/r came up
heads. Since (l/r) (T/T’) = (l/r’), the result is that the tuple is
in the sample with probability I/T’. Thus the inductive invariant is
indeed maintained.
.
The algorithm maintains a concise sample regardless of the se-
quence of increasing thresholds used. Thus, there is complete flexi-
bility in deciding, when raising the threshold, what the new thresh-
old should be. A large raise may evict more than is needed to reduce
the sample footprint below its upper bound, resulting in a smaller
sample-size than there would be if the sample footprint matches
the upper bound. On the other hand, evicting more than is needed
creates room for subsequent additions to the concise sample, so the
procedure for creating room runs less frequently. A small raise also
increases the likelihood that the footprint will not decrease at all,
and the procedure will need to be repeated with a higher threshold.
For simplicity in the experiments reported in Section 3.3, we
raised the threshold by 10% each time. Note that in general, one can
improve threshold selection at a cost of a more elaborate algorithm,
e.g., by using binary search to find a threshold that will create the
desired decrease in the footprint or by setting the threshold so that
(1 - r/7’) times the number of singletons is a lower bound on the
desired decrease in the footprint.
Note that instead of flipping a coin for each insert into the data
warehouse, we can flip a coin that determines how many such in-
serts can be skipped before the next insert that must be placed in the
sample (as in Vitter’s reservoir sampling Algorithm X [Vit85]): the
probability of skipping over exactly i elements is (1 - l/~)~. (l/r).
As T gets large, this results in a significant savings in the number
of coin flips and hence the update time. Likewise, since the prob-
ability of evicting a sample point is typically small (i.e., r’/r is a
small constant), we can save on coin flips and decrease the update
time by using a similar approach when evicting.
Raising a threshold costs O(m’), where m’ is the sample-size
of the concise sample before the threshold was raised. For the case
where the threshold is raised by a constant factor each time, we ex-
pcct there to be a constant number of coin tosses resulting in sample
points being retained for each sample point evicted. Thus we can
amortize the retained against the evicted, and we can amortize the
evicted against their insertion into the sample (each sample point
is evicted only once). It follows that even taking into account the
time for each threshold raise, we have an 0( 1) amortized expected
update time per insert, regardless of the data distribution.
3.2 Quantifying the sample-size advantage of concise sam-
ples
The expected sample-size increases with the skew in the data. By
Lemma 1, the advantage is unbounded for certain distributions. We
show next that for exponential distributions, the advantage is expo-
nential:
Theorem 3 Consider the family of exponential distributions: for
i = 1,2,. ., Pr(v = i) = CY-%(Q - l), for (Y > 1. For any
footprint m > 2. the expected sample-size of a concise sample
with footprint m is at least cTL12.
Prooj The expected sample-size can be lower bounded by the
expected number of randomly selected tuples before the first tuple
whose attribute value v is greater than m/2. (When all values are at
most m/2 then we can fit each value and its count, if any, within the
footprint.) The probability of selecting a value greater than m/2 is
2
cy--p(Ly - 1) = (y-m’2 ,
1=77X/2+1
so the expected number of tuples selected before such an event oc-
curs is C/a.
.
Next, we evaluate the expected gain in using a concise sample
over a traditional sample for arbitrary data sets. The estimate is
334
given in terms of the frequency moment Fk, for Ic 2 2, of the
data set, defined as Fk = cj n,k,
where j is taken over the values
represented in the set and n3 is the number of set elements of value
Theorem 4 For any data set, when using a concise sample S with
sample-size m, the expected gain is
E[m - number of distinct values in S] = 2(--I)’
y 2
k=2
0
ProoJ:
Let p, = nj/n be the probability that an item selected at
random from the set is of value j. Let X, be an indicator random
variable so that X, = 1 if the ith item selected to be in the tradi-
tional sample has a value not represented as yet in the sam
P
le, and
Xi = 0 otherwise. Then, Pr(X, = 1) = c, p, (l--p,)‘- , where
j is taken over the values represented in the set (since X, = 1 if
some value j is selected so that it has not been selected in any
of the first i - 1 steps). Clearly, X = c%, Xi is the number
of distinct values in the traditional sample. We can now evaluate
E[number of distinct values] as
E[X] = 2 E[X,] = 2 cPj(l -Pj)i-’
i=l i=l j
= CPj
1 - (1 -
pj)"
l- (1 -P3)
=c
(1 - (1 -Pj)“)
j
3
.
Note that the footprint for a concise sample is at most twice the
number of distinct values.
3.3 Experimental evaluation
We conducted a number of experiments evaluating the gain in the
sample-size of concise samples over traditional samples. In each
experiment, 500K new values were inserted into an initially empty
data warehouse. Since the exact attribute values do not effect the
relative quality of our techniques, we chose the integer value do-
main from [1, D], where D, the potential number of distinct values,
was varied from 500 to 50K. We used a large variety of Zipf data
distributions. The zipf parameter was varied from 0 to 3 in incre-
ments of 0.25; this varies the skew from nonexistent (the case of
zipf parameter = 0 is the uniform distribution) to quite large. Most
of the experiments restricted each sample to footprint m = 1000.
However, to stress the algorithms, we also considered footprint
m = 100. Recall that if the ratio D/m is I: .5, then all values
inserted into the warehouse can be maintained in the concise sam-
ple. We consider D/m = 5, 50, and 500. Each data point plotted
is the average of 5 trials.
Each experiment compares the sample-size of the samples pro-
duced by three algorithms, with the same footprint m.
l
traditional: a random sample of size m is maintained using
reservoir sampling.
l
concise online; the algorithm described in Section 3. I.
l
concise offline: the offline/static algorithm described at the
beginning of Section 3.
The offline algorithm is plotted to show the intrinsic sample-size
of concise samples for the given distribution. The gap between the
online and the offline is the penalty our online algorithm pays in
terms of loss in sample-size due to suboptimal adjustments in the
threshold. In the experiments plotted, whenever the threshold is
raised, the new threshold is set to Il.1 x ~1, where T is the current
threshold.
Figure 3 depicts the sample-size as a function of the zipf pa-
rameter for varying footprints and D/m ratios. First, in (a) and
(b), we compare footprint 100 and 1000, respectively, for the same
data sets. ‘The sample-size for traditional samples, which equals the
footprint, is so small that it is hidden by the x-axis in these plots.
At the scale shown in these two plots, the other experiments we
performed for footprint 100 and IO00 gave nearly identical results.
These results show that for high skew the sample-size for concise
samples grows up to 3 orders of magnitude larger than for tradi-
tional samples, as anticipated. Also, the online algorithm is within
15% of the offline algorithm for footprint 1000 and within 28%
when constrained to use only footprint 100.
Second, in (c) and (d), we show representative plots of our ex-
periments depicting how the gain in sample-size is effected by the
D/m ratio. In these plots, we compare D/m = 50 and D/m = 5,
respectively, for the same footprint 1000. We have truncated these
plots at zipf parameter 1.5, to permit a more closer examination of
the sample-size gains for zipf parameters near 1.0. (In fact, Fig-
ure 3(d) is simply a more detailed look at the data points in Fig-
ure 3(b) up to zipf parameter 1.5.)
Recall that for D/m = .5, the sample-size for concise samples
is a factor of n/m larger than that for traditional samples, regard-
less of the zipf parameter. These figures show that for D/m = 5,
there are no noticeable gains in sample-size for concise samples
until the zipf parameter is > 0.5, and for D/m = 50, there are no
noticeable gains until the zipf parameter is > 0.75. The improve-
ments with smaller D/m arise since m/D is the fraction of the
distinct values for which counts can be maintained.
Update time overheads. There are two main sources of update
time overheads associated with our (online) concise sampling al-
gorithm. First, there are the coin flips that must be performed to
decide which inserts are added to the concise sample and to evict
values from the concise sample when the threshold is raised. Recall
that we use the technique in [Vit85] that minimizes the number of
coin flips by computing, for a given coin bias, how many flips of the
coin until the next heads (or next tails, depending on which type of
Rip requires an action to be performed by the algorithm). Since the
algorithm does work only when we have such a coin flip, the num-
ber of coin Rips is a good measure of the update time overheads.
For each of the data distribution and footprint scenarios presented
in Figure 3, we report in Table 1 the average coin flips for each new
insert to the data warehouse.
Second, there are the lookups into the current concise sample
to see if a value is already presect in the sample. The coin flip
measure does not account for the work done in initially populating
the concise sample: on start-up, the algorithm places every insert
into the concise sample until it has exceeded its footprint. A lookup
is performed for each of these, so the lookup measure accounts for
this cost, as well as the lookups done when an insert is selected
for the concise sample due to a coin flip. For each of the data
distribution and footprint scenarios presented in Figure 3, we report
in Table 1 the number of lookups per insert to the data warehouse.
As can be seen from the table, the overheads are quite small.
The overheads arc smallest for small zipf parameters. There is
very little dependence on the D/m ratio. An order of magnitude
decrease in the footprint results in roughly an order of magnitude
335
14OOOC
12oOOc
A
.-
5”
a,
80000
z
!2
60000
14000
12000
8
10000
‘3
h
8000
2
z
6000
0
Data: 500000 values
in [1,5000]
Footprint = 100
concise offline -+- -
concise online ----I(----
traditional _
600000 -
Data: 500000 values
in (1,500OJ
5ooOOo _ Footprint = 1000
concise offline -
concise online ----*----
traditional
w
400000 -
3
I
E
300000 _
ri
200000 -
100000
o* = i- - *
0.5 1
1.5 2
2.5
3
0 0.5
1 1.5
2
2.5 3
zipf parameter
zipf parameter
(a)
(b)
t
4000
2000
0
Data: 500000 values
in [I ,50000]
Footprint = 1000
concise offline - _
concise online ----*----
traditional
“”
Data: 500000 values
in [I ,5000]
16000 Footprint = 1000
14000 -
concise offline - -
concise online ----*----
traditional .... -
w
12000 -
.-
?
a,
10000 -
8
I
8000 -
6000 -
0 0.2
0.4
4000 -
2000 -
0
0.6
0.8 1
1.2 1.4
0 0.2 0.4
0.6 0.8
1
1.2
1.4
zipf parameter
zipf parameter
(cl
(4
Figure 3: Comparing sample-sizes of concise and traditional samples as a function of skew, for varying footprints and D/m ratios.
In (a) and (b), we compare footprint 100 and footprint 1000, respectively, for the same data sets. In (c) and (d), we compare
D/m = 50 and D/m = 5, respectively, for the same footprint 1000.
336
Table 1: Coin flips and lookups per insert for the experiments
in Figure 3. These are abstract measures of the computation
costs: the number of instructions executed by the algorithm is
directly proportional to the number of coin flips and lookups,
and is dominated by these two factors.
zipf
Fig. 3(a)
param
flips lookup!
0.00 0.003 0.002
0.25
0.003 0.002
0.50 0.003 0.002
0.75 0.003 0.002
1 .oo 0.004 0.002
I .25 0.006 0.003
1.50 0.01 1 0.007
1.75 0.023 0.013
2.00 0.045 0.027
2.25 0.097 0.061
2.50 0.189 0.125
2.75 0.363 0.271
3.00 0.544 0.482
Figs. 3(b)(d)
flips lookup>
0.023 0.013
0.023 0.013
0.024 0.014
0.027 0.016
0.041 0.024
0.079 0.049
0.188 0.124
0.426 0.333
0.559 0.744
0.000 1.000
0.000 1.000
0.000 1 .ooo
0.000 1.000
Fig. 3(c)
flips lookup!
0.023 0.013
0.023 0.013
0.023 0.013
0.024 0.014
0.032 0.019
0.066 0.040
0.170 0.111
0.406 0.306
0.645 0.726
0.000 1.000
0.000 1.000
0.000 1 .ooo
0.000 1 .ooo
decrease in the overheads for zipf parameters below 2. For zipf pa-
rameters above 2, all values fit within the footprint 1000, so there is
exactly one lookup and zero coin flips per insert to the data ware-
house. Each of these results can be understood by observing that
for a given threshold, the expected number of Rips and lookups is
inversely proportional to the threshold. Moreover, the expectation
of the sample-size is equal to the number of inserts divided by the
current threshold. Thus the flips and lookups per insert increases
with increasing sample-size (except when the flips drop to zero as
discussed above).
Note that despite the procedure to revisit sample points and per-
form coin flips whenever the threshold is raised, the number of flips
per insert is at worst 0.645, and often orders of magnitude smaller.
This is due to a combination of two factors: if the threshold is raised
a large amount, then the procedure is done less often, and if it is
raised only a small amount, then very few flips are needed in the
procedure (since we are using [Vit85]).
4 Counting samples
In this section, we define counting samples, present an algorithm
for their incremental maintenance, and provide analytical guaran-
tees on their performance.
Consider a relation R with n tuples and an attribute A. Count-
ing samples are a variation on concise samples in which the counts
are used to keep track of all occurrences of a value inserted into the
relation since the value was selected for the sample.5 Their defini-
tion is motivated by a sampling&counting process of this type from
a static data warehouse:
Definition 3 A counting sample for R.A with threshold T is any
subset of R.A obtained as follows:
1. For each value v occurring c > 0 times in R, weflir) a coin
with probability 1 /r of heads until the first heads, up to at
most c coin tosses in all; lfthe ich coin toss is heads, then v
occurs c - i + 1 times in the subset, else u is not in the subset.
51n other words. since we have set aside a memory word for a count, why nor count
the subsequent occurrences exactly?
2. Each value v occurring c > 1 times in the subset is repre-
sented as a pair (v, c), and each value v occurring exactly
once is represented as a singleton v.
Obtaining a concise sample from a counting sample. Although
counting samples are not uniform random samples of the base data,
they can be used to obtain such a sample without any further ac-
cess to the base data. Specifically, a concise sample can be ob-
tained from a counting sample by considering each pair (?I, c) in
the counting sample in turn, and flipping a coin with probability
l/r of heads c - 1 times and reducing the count by the number of
tails. The footprint decreases by one for each pair for which all its
coins are tails.
4.1 Incremental maintenance of counting samples
Our incremental maintenance algorithm is as follows. We set up
an entry threshold T (initially 1) for new tuples to be selected for
the counting sample. Let S be the current counting sample and
consider a new tuple t. We do a look-up on t.A in S. If t.A is
represented by a (value, count) pair in S, we increment its count.
If t.A is a singleton in S, we create a pair. Otherwise, t.A is not in
S and we add it to S with probability l/7.
If the footprint for S now exceeds the prespecified footprint
bound, then we need to evict existing values to create room. As
with concise samples, we raise the threshold to some 7’. and then
subject each value in S to this higher threshold. The process is
slightly different for counting samples, since the counts are differ-
ent.
For each value in the counting sample, we flip a biased coin,
decrementing its observed count on each flip of tails until either
the count reaches zero or a heads is flipped. The first coin toss
has probability of heads r/r’, and each subsequent coin toss has
probability of heads l/7’. Values with count zero are removed from
the counting sample; other values remain in the counting sample
with their (typically reduced) counts. (The overall number of coin
tosses can be reduced to a constant per value using an approach
similar to that described for concise samples, since we stop at the
first heads (if any) for each value.) Thus raising a threshold costs
O(m), where m is the number of distinct values in the counting
sample (which is at most the footprint). If the threshold is raised
such a constant factor each time, we expect there to be a constant
number of sample points removed for each sample point flipping
a heads. Thus as in concise sampling, it follows that we have a
constant amortized expected update time per data warehouse insert,
regardless of the data distribution.
An advantage of counting samples over concise samples is that
we can maintain counting samples in the presence of deletions to
the data warehouse. Maintaining concise samples in the presence
of such deletions is difficult: If we fail to delete a sample point in
response to the delete operation, then we risk having the sample
fail to be a subset of the data set. On the other hand, if we always
delete a sample point, then the sample may no longer be a random
sample of the data set. With counting samples, we do not have this
difficulty. For a delete of a value v, we look-up to see if v is in the
counting sample (using a hash function), and decrement its count if
it is. Thus we have O(1) expected update time for deletions to the
data warehouse.
Theorem 5 For any sequence of insertions and deletions, the
above algorithm maintains a counting sample.
ProoJ:
We must show that properties 1 and 2 of the definition
of a counting sample are preserved when an insert occurs, a delete
occurs, or the threshold is raised.
An insert of a value v increases by one its count in R. If the
value is in the counting sample, then one of its coin flips was heads,
337
and we increment the count in the counting sample. Otherwise,
none of its coin flips to date were heads, and the algorithm flips
a coin with the appropriate probability. All other values are un-
touched, so property I is preserved.
A delete of a value II decreases by one its count in R. If the
value is in the counting sample, then the algorithm decrements the
count (which may drop the count to 0). Otherwise, c coin flips
occurred to date and were tails, so the first c - 1 were also tails,
and the value remains omitted from the counting sample. All other
values are untouched, so property 1 is preserved.
Consider raising the threshold from T to T’, and let w be a value
occurring c > 0 times in R. If 7) is not in the counting sample, there
were c coin flips with heads probability l/r that came up tails.
Thus the same c probabilistic events would fail to come up heads
with the new, stricter coin (with heads probability only l/r’). If o
is in the counting sample with count c’, then there were c - c’ coin
flips with heads probability l/r that came up tails, and these same
probabilistic events would come up tails with the stricter coin. This
was followed by a coin flip with heads probability l/r that came up
heads, and the algorithm Hips a coin with heads probability r/r’,
so that the result is the same as a coin Rip with probability (l/r)
(T/T’) = (l/r’). If this coin comes up tails, then subsequent
coin Rips for this value have heads probability l/7’. In this way,
property 1 is preserved for all values.
In all cases, property 2 is immediate, and the theorem is proved.
.
Note that although both concise samples and counting samples
have O(1) amortized update times, counting samples are slower
to update than concise samples, since, unlike concise sample, they
perform a look-up (into the counting sample) at each update to the
data warehouse.
Theorem 6 Let R be an arbitrary relation, and let 7 be the current
threshold for a counting sumple S. (i) Any valrde 2, that occurs at
least r times in R is expected to be in S. (ii) Any value u that
occurs
fv
times in R will be in S with probability 1 - (1 - i)fU.
(iii) For all CY > 1, if
fu
> cy T, then with probability 2 1 - epa,
the value will be in S and its count will be at least
fv - m-.
Prooj Claims (i) and (ii) follow immediately from property 1 of
counting samples. As for (iii), Pr(v E S with count 2
fv --QT.) =
1 - Pr(the first cry coin tosses for u are all tails) = 1 - (1 - :)ar
2 1 - e-a.
.
5 Hot list queries
In this section, we present new algorithms for providing approxi-
mate answers to hot list queries. Recall that hot list queries request
an ordered set of (value, count) pairs for the k most frequently oc-
curring data values, for some k.
5.1
Algorithms
We present four algorithms for providing fast approximate answers
to hot list queries for a relation R with n tuples, based on incre-
mentally maintained synopses with footprint bound m, m 2 2k.
Using traditional samples. A traditional sample of size m can be
maintained using Vitter’s reservoir sampling algorithm [Vit85]. To
report an approximate hot list, we first semi-sort by value, and re-
place every sample point occurring multiple times by a (value, count)
pair. We then compute the k’th largest count ck, and report all pairs
with counts at least max(ck, 6). scaling the counts by n/m, where
6 is a confidence threshold (discussed below). Note that there may
be fcwcr than k distinct values in the sample, so fewer than k pairs
may bc reported (even when using the minimal confidence thresh-
old 6 = 1). The response time for reporting is O(m).
Using concise samples.
A concise sample of footprint m can be
maintained using the algorithm of Section 3. To report an approx-
imate hot list, we first compute the k’th largest count ck (using a
linear time selection algorithm). We report all pairs with counts at
least max(ck, 6), scaling the counts by n/m’, where b is a confi-
dence threshold and m’ is the sample-size of the concise sample.
Note that when b = 1, we will report k pairs, but with larger 6,
fewer than k may be reported. The response time for reporting is
O(,m). Alternatively, we can trade-off update time vs. response
time by keeping the concise sample sorted by counts. This allows
for reporting in O(k) time.
Using counting samples.
A counting sample of footprint m can
be maintained using the algorithm of Section 4. To report an ap-
proximate hot list, we use the same algorithm as described above
for using concise samples, except that instead of scaling the counts,
we add to the counts a compensation, E, determined by the analysis
below. This augmentation of the counts serves to compensate for
inserts of a value into the data warehouse prior to the successful
coin toss that placed it in the counting sample. Let 7 be the current
threshold. We report all pairs with counts at least max(ck, T - e).
Given the conversion of counting samples into concise samples
discussedA in Section 4, this can be seen to be similar to taking
6 = 2 - c. (Using the value of E determined below, 6 = 1.582.)
Full histogram on disk. The last algorithm maintains a full his-
togram on disk, i.e.. (value, count) pairs for all distinct values in
R, with a copy of the top m/2 pairs stored as a synopsis within the
approximate answer engine. This enables exact answers to hot list
queries. The main drawback of this approach is that each update to
R requires a separate disk access to update the histogram. More-
over, it incurs a (typically large) disk footprint that may be on the
order of n. Thus this approach is considered only as a baseline for
our accuracy comparisons.
5.2 Analysis
The confidence threshold 6. The threshold 6 is used to bound the
error. The larger the S, the greater the probability that for reported
values, the counts are quite accurate. On the other hand, the larger
the 6 the greater the probability that fewer than k pairs will be re-
ported. For its use with traditional samples and concise samples,
6 must be an integer (unlike with counting samples, where it need
not be). We have found that 6 = 3 is a good choice, and use that
value in our experiments in Section 5.3.
To study the effect of 6 on the accuracy, we consider in what
follows hot list queries of the form “report all pairs that can be re-
ported with confidence”. That is, we report all values occurring at
least 6 times in the traditional or concise sample. The accuracy of
the approximate hot list reported using concise sampling is sum-
marized in the following theorem:
Theorem 7 Let R be (In urbitrury relation of size n, und let T be
the current threshold@ a concise sample S. Then:
I. Frequent values will be reported: For any E, 0 < E < 1,
uny value u with
fu
> rSl(1 - E) will be reported with
probability at leust 1 - e
-fic’/@(l-f)), As
an example, when
E = 112, the reporting probability is 1 - ee614.
2. Infrequent values will not be reported: For any E, 0 < F < 1.
any value o with
fL,
_< rS/(l + c) will be reported with
probuhilit?, less than t:- sf2/(3( 1 ‘*)).
As
un example, when
c z 1, the (false) reporting probability is less than e?/‘.
338
ProojY These are shown by tirst reducing the problem to the
cast where the threshold has always been 7, and then applying a
straightforward analysis using Chernoff bounds.
.
Determination of P.
The value of ?, used in reporting approxi-
mate hot lists using counting samples, is determined analytically as
follows. Consider a value w in the counting sample S, with count
cv 1
and let fV be the number of times the value occurs in R.
Let
Est, =
cV + ?. We will select C so that Est, will be close to
fV. In particular, WC want E (Est,It, is in S) = fi,. We have that
E (Est,,lu is in S) = t + c;“‘,(j,, - i + 1) Pr(v was inserted
at the ith occurrence ) 11 is in S) which after a lengthy calculation
equals e + fV - 7 + 1 +
&, where 4 = 1 - l/~. Thus we
riced 2
z ?- - 1 - fi. ,$;h:“ii,;r
=T-1-k. Since2
depends on f,,, which WC do not know, we select I? so as to compen-
sate exactly when f,, = T (in this way, i? is the most accurate when
it matters most: smaller f,, should not be reported and the value of
? is less important for larger f?,). Thus ? = 7 (1 - A) - 1 =
T(S)-1
z ,418 T - 1
Theorem 8 Let R he an arbitrary relation, and let T he the current
threshold fur a counting sumple S. (i) Any value u that occurs
fv
< ,582 T times in R will nob be reported. (ii) For all Q >
1,
any value II that occurs
f,,
> N T times in R,, will be reported
with probabilitl)l 2 1 - e--(n--.582).
(iii) tj’v is in S, its augmented
count will be in
[fv - p r ,
fv
+ ,418 T -
l]
with probabilify
> 1 - e--(D+.418),for
all /? > 0.
Proot The algorithm will fail to report v if its count is less than
7 - i?, i.e. count 5 ,582~. Claim (i) follows. For the case where
fv
> ck.r, we have count 5 .582r if the first
fv
-.582~ coin tosses
are all tails, which happens with probability
(l-
$)fu-~58zr, which
is less than e-(fXJ’r-.582) 5 e
--(“--.582). Claim (ii) follows. The
augmented count is at most
f,,
+ e. It is less than
fu - ,O T
if the
unaugmented count is at most
fv
- (/I + ,418)~ which happens if
the first (/3 + ,418)~ coin tosses are all tails, which happens with
probability < e--(8+.418). Claim (iii) follows.
.
On reporting fewer than k values. Our algorithms report fewer
than k values for certain data distributions. Alon ef al.
[AMS96]
showed that any randomized online algorithm for approximating
the frequency of the mode of a given data set to within a constant
factor (with probability > l/2) requires space linear in the number
of distinct values D. This implies that even for k =
1,
any algo-
rithm for answering approximate hot list queries based on a synop-
sis whose footprint is sublinear in D will fail to be accurate for ccr-
tain data distributions. Thus in order to report only highly-accurate
answers, it is inevitable that fewer than k values are reported for
certain distributions.
Note that the problematic data distributions are the nearly-uni-
form ones with relatively small maximum frequency (this is the
case in which the lower bound of Alon et al. is proved). Fortu-
nately, it is the skewed distributions, not the nearly-uniform ones,
that are of interest, and the algorithms report good results for skewed
distributions.
5.3 Experimental evaluation
We conducted a number of experiments comparing the accuracy
and overheads of the algorithms for approximate hot lists described
in Section 5. I. In each experiment, 5OOK new values were inserted
into an initially empty data warehouse. Since the exact attribute
values do not effect the relative quality of the techniques, we chose
the integer value domain from [l, D], where D was varied from 500
to 50K. We used a variety of Zipf data distributions, focusing on
the modest skew cases where the zipf parameter is I .O, I .25, or I .5.
Each of the three approximation algorithms are provided the same
footprint m. Most of the experiments studied the footprint m =
1000 case. However, to stress the algorithms, we also considered
footprint 7n = 100. Recall that if the ratio D/m is 5 .5, then all
values inserted into the warehouse can be maintained in both the
concise sample and the counting sample. As before, we consider
D/m = 5.50, and 500.
Only the points reported by each algorithm are plotted. For the
algorithms using traditional samples or concise samples, we use a
confidence threshold 6 = 3. Whenever the threshold is raised, the
new threshold is set to Il.1 x ~1, where T is the current threshold.
These values gave better results than other choices we tried.
For the following explanation of the plots, we refer the reader
to Figure 4. This plots the most frequent values in the data warc-
house in order of nonincreasing counts, together with their counts.
The x-axis depicts the rank of a value (the actual values are irrele-
vant here); the y-axis depicts the count for the value with that rank.
The k most frequent values are plotted, where k is the number of
values whose frequency matches or exceeds the minimum reported
count over the three approximation algorithms. Also plotted are
values reported by one or more of the approximation algorithms
that do not belong among the k most frequent values (to show false
positives). These values arc tacked on at the right (after the short
vertical line below the x-axis, e.g., between 22 and 23 in this fig-
ure) in nonincreasing order of their actual frequency; the x-axis
typically will not equal their rank since unreported values are not
plotted, creating gaps in the ranks. The exact counts are plotted as
histogram boxes.
The values and (estimated) counts reported by the three approx-
imation algorithms are plotted, one point per value reported. Any
gap in the values reported by an algorithm represents a false nega-
tive. For example, using traditional samples has false negatives for
the values with rank 7 and 8. The difference between a reported
count and the top of the histogram box is the error in the reported
200K 6
150K
Using full histogram ~
Using concise samples
0
Using counting samples
l
Using traditional samples
x
** Data: 500000 values in [1,500]
Zipf parameter 1.5
** Footprint: 100
100K -
5
10 15
20
most frequent values I
Figure 4: Comparison of hot-list algorithms, depicting the fre-
quency of the most frequent values as reported by the four
algorithms.
Figure 4 shows that even with a small footprint, good results are
obtained by the algorithms using concise samples and using count-
ing samples. Specifically, using counting samples accurately re-
ported the I5 most frequent values, I8 of the first 20, and had only
339
two false positives (both of which were reported with a 5 37%
overestimation in the counts). The count of the most frequent value
was accurate to within .14%. Likewise, using concise samples did
almost as well as using counting samples, and much better than
using traditional samples. Using concise samples achieves better
results than using traditional samples because the sample-size was
over 3.8 times larger. Using counting samples achieves better re-
sults than using concise samples because the error in the counts is
only a one-time error arising prior to a value’s last tails flip with the
final threshold.
In order to depict plots from our experiments with footprint
1000, we needed to truncate the y-axis to improve readability. All
three approximation algorithms perform quite well at the handful
of (the most frequent) values not shown due to this truncation”, so
it is more revealing to focus on the rest of the plot.
10000
Using full histogram -----
Using traditional samples
l
8000
- 9
** Data: 500000 values in [1 ,50000]
Zipf parameter 1.25
6000
l
-
Footprint: 1000
Using traditional samples
l
* Data: 500000 values in [l SOOO]
2000
n
20 40 60 80
‘OP
120
most frequent values others
Figure 5: Counting vs. traditional on a less skewed distribution
(zipf parameter 1 .O), using footprint 1000.
Figure 5 compares using counting samples versus using tradi-
tional samples on a less skewed distribution (zipf parameter equals
1.0). With a traditional sample of size 1000, there are only a hand-
ful of possible counts that can be reported, with each increment in
the number of sample points for a value adding 500 to the reported
count. This explains the horizontal rows of reported counts in the
figure. As in the previous plot, using counting samples performed
quite well, using concise samples (not shown to avoid cluttering
the plot) performed not quite as well, and using traditional samples
performed significantly worse.
Finally, in Figure 6, we plot the accuracy of the three approxi-
mation algorithms on an intermediate skewed distribution (zipf pa-
rameter equals 1.25). This plot also depicts the case of a larger
D/m ratio than the previous two plots. For readability, each algo-
rithm has its own plot, and the histogram boxes for the exact counts
have been replaced with a line connecting these counts. As above,
using counting samples is more accurate than using concise sam-
ples which is more accurate than using traditional samples. The
concise sample-size is nearly 3.5 times larger than the traditional
sample-size, leading to the differences between them shown in the
plots.
Table 2 reports on the overheads of each approximation al-
gorithm in terms of the number of coin flips and the number of
lookups for each new insert to the data warehouse. By these met-
rics, using traditional samples is better than using concise samples
is better than using counting samples, as anticipated. Also shown
0
20 40 60 80 loo
I20 140 I60
most frequent values
I
10000
8000
6000
4000
2000
0
10000
8000
6000
4000
2000
0
Using full histogram ~
Using concise samples
l
J
l
* Data: 500000 values in [1 ,SOOOO]
Zipf parameter 1.25
l
* Footprint: 1000 -
20 40 60 80
loo 120 140 I 160
most frequent values
I
, ,
Using full histogram L
Using counting samples
l
‘* Data: 500000 values in [I ,50000]
Zipf parameter 1.25
l
* Footprint: 1000
20 40 60 80
loo 120 140 I 160 most frequent values
Figure 6: Comparison of traditional, concise, and counting
samples on a distribution with zipf parameter 1.25, using foot-
print 1000.
“For
example, in Figure S,
the
reported COUPES for truncated values using concise
samples
had WC-16% error, using counting samples had I%-48 error, and using tra-
ditional samples had X70-3 I70 error.
340
Table 2: Measured data for the hot-list algorithm experiments in Figures 4-6.
Figure 4 flips
lookups raises sample-size threshold reported
Using concise samples 0.014 0.008 56 388 1283 18
Using counting samples 0.006 1.000 60 n/a 1881 20
Using traditional samples 0.003 0.000 n/a 100 n/a 9
Figure 5
flips
lookups raises sample-size threshold reported
Using concise samples 0.040 0.024 40 1813 215 95
Using counting samples 0.053 1.000 47 n/a 541 92
Using traditional samples 0.025 0.000 n/a 1000 n/a 52
Figures 6 flips
lookups raises sample-size threshold reported
Using concise samples 0.066 0.040 33 3498 140 108
Using counting samples 0.046 1.000 38 nla 227 122
Using traditional samples 0.025 0.000 n/a 1000 n/a 38
are the number of threshold raises, the final sample-size, the fi-
nal threshold, and the number of values reported. The number of
raises and the final threshold are larger when using counting sam-
ples than when using concise samples since the counting sample
tends to hold fewer values: its counting of all subsequent occur-
rences implies that most values in the sample are represented as
(value, count) pairs and not as singletons.
6 Conclusions
Providing an immediate, approximate answer to a query whose ex-
act answer takes many orders of magnitude longer to compute is
an attractive option in a number of scenarios. We have presented
a framework for an approximate query engine that observes new
data as it arrives and maintains small synopses on that data. We
have described metrics for evaluating such synopses.
We introduce and study two new sampling-based synopses: con-
cise samples and counting samples. We quantify their advantages
in sample-size over traditional samples with the same footprint in
the best case, in the general case, and in the case of exponential and
zipf distributions. We present an algorithm for the fast incremental
maintenance of concise samples regardless of the data distribution,
and experimental evidence that the algorithm achieves a sample-
size within I%-28% of that of recomputing the concise sample
from scratch at each insert to the data warehouse. The overheads
of the maintenance
algorithm are shown to be quite small. For
counting samples, we present an algorithm for the fast incremen-
tal maintenance under both insertions and deletions, with provable
guarantees regardless of the data distribution. Random samples are
useful in a number of approximate query answers scenarios. The
confidence for such an approximate answer increases with the size
of the samples, so using concise or counting samples can signifi-
cantly increase the confidence as compared with using traditional
samples.
Finally, we consider the problem of providing fast approximate
answers to hot list queries. We present algorithms based on using
traditional samples, concise samples, and counting samples. These
are the first incremental algorithms for this problem; moreover, we
provide analysis and experiments showing their effectiveness and
overheads. Using counting samples is shown to be the most ac-
curate, and far superior to using traditional samples; using con-
cise samples falls in between: nearly matching counting samples
at high skew but nearly matching traditional samples at very low
skew. On the other hand, the overheads are the smallest using tra-
ditional samples, and the largest using counting samples. We show
both with analysis and experiments that the cost incurred when rais-
ing a threshold can be amortized across the entire sequence of data
warehouse updates. We believe that using concise samples may of-
fer the best choice when considering both accuracy and overheads.
In this paper, we have assumed a batch-like processing of data
warehouse inserts, in which inserts and queries do not intermix (the
common case in practice). To address the more general case (which
may soon be the more common case), issues of concurrency bottle-
necks need to be addressed.
Future work is to explore the effectiveness of using concise
samples and counting samples for other concrete approximate an-
swer scenarios. More generally, the area of approximate query
answers is in its infancy, and many new techniques are needed to
make it an effective alternative option to traditional query answers.
In [GPA+98], we present some recent progress towards developing
an effective approximate query answering engine.
Acknowledgments
This work was done while the second author was a member of the
Information Sciences Research Center, Bell Laboratories, Murray
Hill, NJ USA. We thank Vishy Poosala for many discussions re-
lated to this work. We also thank S. Muthukrishnan, Rajeev Ras-
togi, Kyuseok Shim, Jeff Vitter and Andy Witkowski for helpful
discussions related to this work.
References
[AMS96]
[ Ant921
[AS941
[ AZ961
N. Alon, Y. Matias, and M. Szegedi. The space com-
plexity of approximating the frequency moments. In
Proc. 28th ACM Symp. on the Theory of Computing,
pages 20-29, May 1996.
G. Antoshenkov.
Random sampling from pseudo-
ranked B+ trees. In Proc. 18th International Con8 on
Very Large Data Bases, pages 375-382, August 1992.
R. Agrawal and R. Srikant. Fast algorithms for min-
ing association rules in large databases. In Proc. 20th
International Co& on Very Large Data Bases, pages
487-499, September 1994.
G. Antoshenkov and M. Ziauddin. Query process-
ing and optimization in Oracle Rdb. VLDB Journal,
5(4):229-237, 1996.
341
[BDFf97] D. Barbara, W. DuMouchel, C. Faloutsos, P. J. Haas,
J. M. Hellerstein, Y. Ioannidis, H. V. Jagadish, T. John-
son, R. Ng, V. Poosala, K. A. Ross, and K. C. Sevcik.
The New Jersey data reduction report. Bulletin of the
Technical Committee on Data Engineering, 20(4):3-
45, 1997.
[BM96]
R. J. Bayardo, Jr. and D. P. Miranker. Processing
queries for first-few answers. In Proc. 5th Interna-
tional Con& on Information and Knowledge Manage-
ment, pages 45-52, 1996.
[BMUT97] S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dy-
namic itemset counting and implication rules for mar-
ket basket data. In Proc. ACM SIGMOD International
Conf: on Management of Data, pages 25.5-264, May
1997.
[FJS97]
[Fla85]
[FM831
[FM851
[GM951
[GM971
C. Faloutsos, H. V. Jagadish, and N. D. Sidiropou-
10s. Recovering information from summary data. In
Proc. 23rd International Conf on Very Large Data
Bases, pages 36-45, August 1997.
P. Flajolet. Approximate counting: a detailed analysis.
BIT,25:113-134, 1985.
P. Flajolet and G. N. Martin. Probabilistic counting. In
Proc. 24th IEEE Symp. on Foundations of Computer
Science, pages 76-82, October 1983.
P. Flajolet and G. N. Martin. Probabilistic counting
algorithms for data base applications. J. Computer and
System Sciences, 3 1: 182-209, 1985.
P. B. Gibbons and Y. Matias, August 1995. Presen-
tation and feedback during a Bell Labs-Teradata pre-
sentation to Walmart scientists and executives on pro-
posed improvements to the Teradata DBS.
P. B. Gibbons and Y. Matias. Synopsis data structures,
concise samples, and mode statistics. Manuscript, July
1997.
[GMP97a] P. B. Gibbons, Y. Matias, and V. Poosala. Aqua project
white paper. Technical report, Bell Laboratories, Mur-
ray Hill, New Jersey, December 1997.
[GMP97b] P. B. Gibbons, Y. Matias, and V. Poosala. Fast incre-
mental maintenance of approximate histograms. In
Proc. 23rd International Conf on Very Large Data
Bases, pages 466-475, August 1997.
[GPA’98] P. B. Gibbons, V. Poosala, S. Acharya, Y. Bartal,
Y. Matias, S. Muthukrishnan, S. Ramaswamy, and
T. Suel. AQUA: System and techniques for approx-
imate query answering. Technical report, Bell Labo-
ratories, Murray Hill, New Jersey, February 1998.
[HHW97] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online
aggregation. In Proc. ACM SIGMOD International
Conf on Management of Data, pages 171-182, May
1997.
[ HK95]
M. Hofri and N. Kechris. Probabilistic counting of a
large number of events. Manuscript, 1995.
[HNSS95] P. J. Haas, J. F. Naughton, S. Seshadri, and L. Stokes.
Sampling-based estimation of the number of distinct
values of an attribute.
In Proc. 21st International
Conf on Very Large Data Bases, pages 31 l-322,
September 1995.
[IC93]
[Ioa93]
[ IP95]
[Mat921
[Mor78]
[MSY96]
[MVN93]
[MVY94]
[OR891
[OR921
[PIHS96]
[Pre97]
[Vit85]
[VL93]
Y. E. Ioannidis and S. Christodoulakis.
Optimal
histograms for limiting worst-case error propagation
in the size of join results. ACM Transactions on
Database Systems, 18(4):709-748, 1993.
Y. E. Ioannidis. Universality of serial histograms. In
Proc. 19th International Conf on Very Large Data
Bases, pages 256-267, August 1993.
Y. E. Ioannidis and V. Poosala. Balancing histogram
optimality and practicality for query result size estima-
tion. In Proc. ACM SlGMOD International ConJ on
Management of Data, pages 233-244, May 1995.
Y. Matias. Highly Parallel Randomized Algorithmics.
PhD thesis, Tel Aviv University, Israel, 1992.
R. Morris. Counting large numbers of events in small
registers. Communications of the ACM, 21:840-842,
1978.
Y. Matias, S. C. Sahinalp, and N. E. Young. Perfor-
mance evaluation of approximate priority queues. Pre-
sented at DIMACS Fifth Implementation Challenge:
Priority Queues, Dictionaries, and Point Sets, orga-
nized by D. S. Johnson and C. McGeoch, October
1996.
Y. Matias, J. S. Vitter, and W.-C. Ni. Dynamic gener-
ation of discrete random variates. In Proc. 4th ACM-
SIAM Symp. on Discrete Algorithms, pages 361-370,
January 1993.
Y. Matias, J. S. Vitter, and N. E. Young. Approximate
data structures with applications. In Proc. 5th ACM-
SIAM Symp. on Discrete Algorithms, pages 187-194,
January 1994.
F. Olken and D. Rotem. Random sampling from B+
trees. In Proc. 15th International Conf on Very Large
Data Bases, pages 269-277, 1989.
F. Olken and D. Rotem. Maintenance of materialized
views of sampling queries. In Proc. 8th IEEE fnter-
national Conf on Data Engineering, pages 632-641,
February 1992.
V. Poosala, Y. E. Ioannidis, P. J. Haas, and E. J.
Shekita. Improved histograms for selectivity estima-
tion of range predicates. In Proc. ACM SIGMOD In-
ternational ConJ: on Management of Data, pages 294-
305, June 1996.
D. Pregibon. Mega-monitoring: Developing and using
telecommunications signatures, October 1997. Invited
talk at the DIMACS Workshop on Massive Data Sets
in Telecommunications.
J. S. Vitter. Random sampling with a reservoir. ACM
Transactions on Mathematical Software, 11(1):37-57,
1985.
S. V. Vrbsky and J. W. S. Liu. Approximate-a query
processor that produces monotonically improving ap-
proximate answers. IEEE Trans. on Knowledge and
Data Engineering, 5(6): 10561068, 1993.
[WVZT90] K.-Y. Whang, B. T. Vander-Zanden, and H. M. Tay-
lor. A linear-time probabilistic counting algorithm for
database applications. ACM Transactions on Database
Systems, 15(2):208-229, 1990.
342