Count-distinct problem: Difference between revisions

Content deleted Content added

Inline

Revision as of 16:25, 17 October 2014

In computer science, the count-distinct problem ^[1] (also known in applied mathematics as the cardinality estimation problem) is the problem of finding the number of distinct elements in a data stream with repeated elements. This is a well-known problem with numerous applications. The elements might represent IP addresses of packets passing through a router, elements in a large database, motifs in a DNA sequence, or elements of RFID/sensor networks.

Formal Definition

Instance: A stream of elements

x_{1},x_{2},\ldots ,x_{s}

with repetitions, and an integer

m

. Let

n

be the number of distinct elements, namely

n=|\left\{{x_{1},x_{2},\ldots ,x_{s}}\right\}|

, and let these elements be

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

.

Objective: Find an estimate

{\widehat {n}}

of

n

using only

m

storage units, where

m\ll n

.

An example of an instance for the cardinality estimation problem is the stream: $a,b,a,c,d,b,d$ . For this instance, $n=|\left\{{a,b,c,d}\right\}|=4$ .

Naive Solution

The naive solution to the problem is as follows:

Numbered list item

 Initialize a counter,  $c$ , to zero,  $c\leftarrow 0$ .
 Initialize an efficient dictionary data structure,  $D$ , such as hash table or search tree in which insertion and membership can be performed quickly.  
 For each element  $x_{i}$ , a membership query is issued. 
 If  $x_{i}$  is not a member of  $D$  ( $x_{i}\notin D$ )
   Add  $x_{i}$  to  $D$ 
   Increase  $c$  by one,  $c\leftarrow c+1$ 
 Otherwise ( $x_{i}\in D$ ) do nothing.
 Output  $n=c$ .

As long as the number of distinct elements is not too big, $D$ fits in main memory and an exact answer can be retrieved. However, this approach does not scale for bounded storage, or if the computation performed for each element $x_{i}$ should be minimized. In such a case, several streaming algorithms have been proposed which use a fixed number of storage units.

Streaming Algorithms

State-of-the-art estimators hash every element $e_{j}$ into a low dimensional data sketch $h(e_{j})$ , which can be viewed as a random variable (RV). The different techniques can be classified according to the data sketches they store for future processing.

Min/max sketches ^[2] ^[3] store only the minimum/maximum hashed values. Examples of known min/max sketch estimators: Chassaing et al. ^[4] presents max sketch which is the minimum-variance unbiased estimator for the problem. The continuous max sketches estimator ^[5] is the maximum likelihood estimator. The best known estimator is the HyperLogLog algorithm ^[6] it offers the best tradeoff between precision and storage size.

The intuition behind such estimators is that each sketch carries information about the desired quantity. For example, when every element $e_{j}$ is associated with a uniform RV, $h(e_{j})\sim U(0,1)$ , the expected minimum value of $h(e_{1}),h(e_{2}),\ldots ,h(e_{n})$ is $1/(n+1)$ . The hash function guarantees that $h(e_{j})$ is identical for all the appearances of $e_{j}$ . Thus, the existence of duplicates does not affect the value of the extreme order statistics.

There are other estimation techniques other than min/max sketches. The first paper on count-distinct estimation by Flajolet et al. ^[7] describes a bit pattern sketch. In this case, the elements are hashed into a bit vector and the sketch holds the logical OR of all hashed values. Bottom-m sketches ^[8] are a generalization of min sketches, which maintain the $m$ minimal values, where $m\geq 1$ . See Cosma et al. ^[2] for a theoretical overview of count-distinct estimation algorithms, and Metwally ^[9] for a practical overview with comparative simulation results.

Weighted Count-Distinct Problem

In its weighted version, each element is associated with a weight and the goal is to estimate the total sum of weights. Formally,

Instance: A stream of weighted elements

x_{1},x_{2},\ldots ,x_{s}

with repetitions, and an integer

m

. Let

n

be the number of distinct elements, namely

n=|\left\{{x_{1},x_{2},\ldots ,x_{s}}\right\}|

, and let these elements be

\left\{{e_{1},e_{2},\ldots ,e_{n}}\right\}

. Finally, let

w_{j}

be the weight of

e_{j}

.

Objective: Find an estimate

{\widehat {w}}

of

w=\sum _{j=1}^{n}w_{j}

using only

m

storage units, where

m\ll n

.

An example of an instance for the weighted problem is: $a(3),b(4),a(3),c(2),d(3),b(4),d(3)$ . For this instance, $e_{1}=a,e_{2}=b,e_{3}=c,e_{4}=d$ , the weights are $w_{1}=3,w_{2}=4,w_{3}=2,w_{4}=3$ and $\sum {w_{j}}=12$ .

As an application example, $x_{1},x_{2},\ldots ,x_{s}$ could be IP packets received by a server. Each packet belongs to one of $n$ IP flows $e_{1},e_{2},\ldots ,e_{n}$ . The weight $w_{j}$ can be the load imposed by flow $e_{j}$ on the server. Thus, $\sum _{j=1}^{n}{w_{j}}$ represents the total load imposed on the server by all the flows to which packets $x_{1},x_{2},\ldots ,x_{s}$ belong.

Solving the Weighted Count-Distinct Problem

Any extreme order statistics estimator (min/max sketches) for the unweighted problem can be generalized to an estimator for the weighted problem ^[10]. For example, the weighted estimator proposed by Cohen et al. ^[5] can be obtained when the continuous max sketches estimator is extended to solve the weighted problem. In particular, the HyperLogLog algorithm ^[6] can be extended to solve the weighted problem. The extended HyperLogLog algorithm offers the best performance, in terms of statistical accuracy and memory usage, among all the other known algorithms for the weighted problem.

References

^ Ullman, Jeff; Rajaraman, Anand; Leskovec, Jure. "Mining data streams" (PDF). {{cite journal}}: Cite journal requires |journal= (help)
^ ^a ^b Cosma, Ioana A.; Clifford, Peter (2011). "A statistical analysis of probabilistic counting algorithms". Scandinavian Journal of Statistics.
^ Giroire, Frederic; Fusy, Eric (2007). "Estimating the Number of Active Flows in a Data Stream over a Sliding Window" (PDF). ANALCO.
^ Chassaing, Philippe; Gerin, Lucas (2006). "Efficient estimation of the cardinality of large data sets". Proceedings of the 4th Colloquium on Mathematics and Computer Science.
^ ^a ^b Cohen, Edith (1997). "Size-estimation framework with applications to transitive closure and reachability". J. Comput. Syst. Sci.
^ ^a ^b Flajolet, Philippe; Fusy, Eric; Gandouet, Olivier; Meunier, Frederic (2007). "HyperLoglog: the analysis of a near-optimal cardinality estimation algorithm". Analysis of Algorithms (AofA) 2007.
^ Flajolet, Philippe; Martin, G. Nigel (1985). "Probabilistic counting algorithms for data base applications". J. Comput. Syst. Sci. Academic Press, Inc.
^ Cohen, Edith; Kaplan, Haim (2008). "Tighter estimation using bottom k sketches" (PDF). PVLDB. Academic Press, Inc.
^ Metwally, Ahmed; Agrawal, Divyakant; Abbadi, Amr El (2008), Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic, Proceedings of the 11th international conference on Extending Database Technology: Advances in Database Technology
^ Cohen, Reuven; Katzir, Liran; Yehezkel, Aviv (2014). "A Unified Scheme for Generalizing Cardinality Estimators to Sum Aggregation". Information Processing Letters.

[1] Ullman, Jeff; Rajaraman, Anand; Leskovec, Jure. "Mining data streams" (PDF). {{cite journal}}: Cite journal requires |journal= (help)

[cosma2011-2] Cosma, Ioana A.; Clifford, Peter (2011). "A statistical analysis of probabilistic counting algorithms". Scandinavian Journal of Statistics.

[3] Giroire, Frederic; Fusy, Eric (2007). "Estimating the Number of Active Flows in a Data Stream over a Sliding Window" (PDF). ANALCO.

[4] Chassaing, Philippe; Gerin, Lucas (2006). "Efficient estimation of the cardinality of large data sets". Proceedings of the 4th Colloquium on Mathematics and Computer Science.

[edithCohen-5] Cohen, Edith (1997). "Size-estimation framework with applications to transitive closure and reachability". J. Comput. Syst. Sci.

[hyperloglog-6] Flajolet, Philippe; Fusy, Eric; Gandouet, Olivier; Meunier, Frederic (2007). "HyperLoglog: the analysis of a near-optimal cardinality estimation algorithm". Analysis of Algorithms (AofA) 2007.

[7] Flajolet, Philippe; Martin, G. Nigel (1985). "Probabilistic counting algorithms for data base applications". J. Comput. Syst. Sci. Academic Press, Inc.

[8] Cohen, Edith; Kaplan, Haim (2008). "Tighter estimation using bottom k sketches" (PDF). PVLDB. Academic Press, Inc.

[9] Metwally, Ahmed; Agrawal, Divyakant; Abbadi, Amr El (2008), Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic, Proceedings of the 11th international conference on Extending Database Technology: Advances in Database Technology

[10] Cohen, Reuven; Katzir, Liran; Yehezkel, Aviv (2014). "A Unified Scheme for Generalizing Cardinality Estimators to Sum Aggregation". Information Processing Letters.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

@@ Line 13: / Line 13: @@
-==Trivial Solution==
+==Naive Solution==
+The naive solution to the problem is as follows:
-The obvious solution to the problem is to keep a list of all the elements seen so far in the stream. Keep them in an efficient search structure such as hash table or search tree, so one can quickly insert elements to the list and check whether or not an element that just arrived on the stream was already seen.
+# Numbered list item
-One can easily find the exact value of <math> n </math>, in the following way. When a new element <math> x_i </math>, is encountered, compare its value to every distinct (stored) value encountered so far. If the value of <math> x_i</math> has not been seen before, keep it in the storage as well. After all the elements are treated, count the number of stored elements.
+  Initialize a counter, <math> c </math>, to zero, <math> c \leftarrow 0 </math>.
+  Initialize an efficient dictionary data structure, <math> D </math>, such as hash table or search tree in which insertion and membership can be performed quickly.
-As long as the number of distinct elements is not too big, this structure can fit in main memory and there is no problem obtaining an exact answer to the question how many distinct elements appear in the stream.
+  For each element <math> x_i </math>, a membership query is issued.
-However, this simple approach does not scale if storage is limited, or if the computation performed for each element <math> x_i </math> should be minimized. In such a case, several [[streaming algorithms]] have been proposed to solve the count-distinct estimation problem.
+  If <math> x_i </math> is not a member of <math> D </math> (<math> x_i \notin D </math>)
+    Add <math> x_i </math> to <math> D </math>
+    Increase <math> c </math> by one, <math> c \leftarrow c + 1</math>
+  Otherwise (<math> x_i \in D </math>) do nothing.
+  Output <math> n = c </math>.
+As long as the number of distinct elements is not too big, <math> D </math> fits in main memory and an exact answer can be retrieved.
+However, this approach does not scale for bounded storage, or if the computation performed for each element <math> x_i </math> should be minimized. In such a case, several [[streaming algorithms]] have been proposed which use a fixed number of storage units.
 ==Streaming Algorithms==