Shattered set: Difference between revisions

Content deleted Content added

Inline

Revision as of 23:01, 30 November 2014

The concept of shattered sets plays an important role in Vapnik–Chervonenkis theory, also known as VC-theory. Shattering and VC-theory are used in the study of empirical processes as well as in statistical computational learning theory.

Definition

Suppose we have a class C of sets and a given set A. C is said to shatter A if, for each subset T of A, there is some element U of C such that

U\cap A=T.\,

Equivalently, C shatters A when the power set P(A) = { U ∩ A | U ∈ C }.

For example, the class C of all discs in the plane (two-dimensional space) cannot shatter every set A of four points, yet the class of all convex sets in the plane shatters every finite set on the (unit) circle. (For the latter result, connect the dots!)

We employ the letter C to refer to a "class" or "collection" of sets, as in a Vapnik–Chervonenkis class (VC-class). The set A is often assumed to be finite because, in empirical processes, we are interested in the shattering of finite sets of data points.

Example

Suppose we have a set A of four points on the unit circle, and we wish to know if it is shattered by the class C of all discs.

The set A of four particular points on the unit circle. (the unit circle is shown in purple))

To test this, we attempt to draw a disc around every subset of points in A. First, we draw a disc around the subsets of each isolated point. Next, we try to draw a disc around every subset of point pairs. This turns out to be doable for adjacent points, but impossible for points on opposite sides of the circle. As visualized below:

Each individual point can be isolated with a disc (showing all four).
Each subset of adjacent points can be isolated with a disc (showing one of four).
A subset of points on opposite sides of the unit circle can not be isolated with a disc.

Because there is some subset which can not be isolated by any disc in C, we conclude then that A is not shattered by C. And, with a bit of thought, we can prove that NO set of four points is shattered by this C.

However, if we redefine C to be the class of all elliptical discs, we find that we can still isolate all the subsets from above, as well as the points which were problems. Thus, this specific set of 4 points is shattered by the class of elliptical discs. Visualized below:

Opposite points of A are now separable by some ellipse (showing one of two)
Each subset of three points in A is also separable by some ellipse (showing one of four)

With a bit of thought, we could generalize that any set of finite points on a unit circle could be shattered by the class of all convex sets (visualize connecting the dots).

Shatter coefficient

To quantify the richness of a collection C of sets, we use the concept of shattering coefficients (also known as shatter coefficients or the growth function). For a collection C of sets s⊂Ω, Ω being any space, often a probability space, we define the n^th shattering coefficient of C as

S_{C}(n)=\max _{x_{1},x_{2},\dots ,x_{n}\in \Omega }\operatorname {card} \{\,\{\,x_{1},x_{2},\dots ,x_{n}\}\cap s,s\in C\}

where $\operatorname {card}$ denotes the cardinality of the set.

$S_{C}(n)$ is equal to the largest number of subsets of any set A of n points that can be formed by intersecting A with the sets in collection C.

Here are some facts about $S_{C}(n)$ :

1.

S_{C}(n)\leq 2^{n}

for all n because

\{s\cap A|s\in C\}\subseteq P(A)

for any

A\subseteq \Omega

.

2. If

S_{C}(n)=2^{n}

, that means there is a set of cardinality n, which can be shattered by C.

3. If

S_{C}(N)<2^{N}

for some

N>1

then

S_{C}(n)<2^{n}

for all

n\geq N

.

The third property means that if C cannot shatter any set of cardinality N then it can not shatter sets of larger cardinalities.

Vapnik–Chervonenkis class

The VC dimension of a class C is defined as

VC(C)={\underset {n}{\min }}\{n:S_{C}(n)<2^{n}\}\,

or, alternatively, as

VC_{0}(C)={\underset {n}{\max }}\{n:S_{C}(n)=2^{n}\}.\,

Note that $VC(C)=VC_{0}(C)+1.$

If for any n there is a set of cardinality n which can be shattered by C, then $S_{C}(n)=2^{n}$ for all n and the VC dimension of this class C is infinite.

A class with finite VC dimension is called a Vapnik–Chervonenkis class or VC class. A class C is uniformly Glivenko–Cantelli if and only if it is a VC class.

References

Wencour, R. S.; Dudley, R. M. (1981), "Some special Vapnik–Chervonenkis classes", Discrete Mathematics, 33 (3): 313–318, doi:10.1016/0012-365X(81)90274-0.
Steele, J. M. (1975), Combinatorial Entropy and Uniform Limit Laws, Ph.D. thesis, Stanford University, Mathematics Department
Steele, J. M. (1978), "Empirical discrepancies and subadditive processes", Annals of Probability, 6 (1): 118–227, doi:10.1214/aop/1176995615, JSTOR 2242865.

External links

Origin of "Shattered sets" terminology, by J. Steele

@@ Line 9: / Line 9: @@
 Equivalently, ''C'' shatters ''A'' when the [[power set]] ''P''(''A'') = { ''U'' ∩ ''A'' | ''U'' ∈ ''C'' }.
-For example, the class  ''C''  of all [[disc (geometry)|discs]] in the [[plane (geometry)|plane]] (two-dimensional space) cannot shatter every set ''A'' of four points, yet the class of all [[convex set]]s in the plane shatters every finite set on the (unit) circle. (For the collection of all convex sets, connect the dots!<!-- Is this metaphorical or literary?--><!-- Neither. It's a proof of the second assertion.-->)
+For example, the class  ''C''  of all [[disc (geometry)|discs]] in the [[plane (geometry)|plane]] (two-dimensional space) cannot shatter every set ''A'' of four points, yet the class of all [[convex set]]s in the plane shatters every finite set on the (unit) circle. (For the latter result, connect the dots!<!-- Is this metaphorical or literary?--><!-- Neither. It's a proof of the second assertion.-->)
 We employ the letter  ''C''  to refer to a "class" or "collection" of sets, as in a Vapnik&ndash;Chervonenkis class (VC-class).  The set ''A'' is often assumed to be [[finite set|finite]] because, in empirical processes, we are interested in the shattering of finite sets of data points.