Linear classifier: Difference between revisions

Content deleted Content added

Inline

Revision as of 21:48, 24 October 2006

A linear classifier is a classifier that uses a linear function of its inputs to base its decision on. That is, if the input feature vector to the classifier is a real vector ${\vec {x}}$ , then the estimated output score (or probability) is

y=f({\vec {w}}\cdot {\vec {x}})=f\left(\sum _{j}w_{j}x_{j}\right),

where ${\vec {w}}$ is a real vector of weights and f is a function that converts the dot product of the two vectors into the desired output. Often f is a simple function that maps all values above a certain threshold to "yes" and all other values to "no".

For a two-class classification problem, one can visualize the operation of a linear classifier as splitting a high-dimensional input space with a hyperplane: all points on one side of the hyperplane are classified as "yes", while the others are classified as "no".

A linear classifier is often used in situations where the speed of classification is an issue, since it is often the fastest classifier, especially when ${\vec {x}}$ is sparse. However, decision trees can be faster. Also, linear classifiers often work very well when the number of dimensions in ${\vec {x}}$ is large, as in document classification, where each element in ${\vec {x}}$ is typically the number of counts of a word in a document (see document-term matrix). In such cases, the classifier should be well-regularized.

Generative models vs. discriminative models

There are two main approaches for determining the parameters of a linear classifier ${\vec {w}}$ ^[1]^[2]. The first is by modeling conditional density functions $P({\vec {x}}|{\rm {class}})$ . Examples of such algorithms include:

Linear Discriminant Analysis (or Fisher's linear discriminator) (LDA) --- assumes Gaussian conditional density models
Naive Bayes classifier --- assumes independent binomial conditional density models.

The second approach is called discriminative training, which attempts to maximize the quality of the output on a training set. Additional terms in the training cost function can easily perform regularization of the final model. Examples of discriminative training of linear classifiers include

Logistic regression --- maximum likelihood estimation of ${\vec {w}}$ assuming that the observed training set was generated by a binomial model that depends on the output of the classifier.
Perceptron --- an algorithm that attempts to fix all errors encountered in the training set
Support vector machine --- an algorithm that maximizes the margin between the decision hyperplane and the examples in the training set.

Note: In contrast to its name, LDA does not belong to the class of discriminative models in this taxonomy. However, its name makes sense when we compare LDA to the other main linear dimensionality reduction algorithm: Principal Components Analysis (PCA). LDA is a supervised learning algorithm that utilizes the labels of the data, while PCA is an unsupervised learning algorithm that ignores the labels. To summarize, the name is a historical artifact (see ^[3], p.117).

Discriminative training often yields higher accuracy than modeling the conditional density functions. However, handling missing data is often easier with conditional density models.

All of the linear classifier algorithms listed above can be converted into non-linear algorithms operating on a different input space $\varphi ({\vec {x}})$ , using the kernel trick.

Notes

^ T. Mitchell, Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression. Draft Version, 2005 download
^ A. Y. Ng and M. I. Jordan. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and Naive Bayes. in NIPS 14, 2002. download
^ R.O. Duda, P.E. Hart, D.G. Stork, "Pattern Classification", Wiley, (2001). ISBN 0-471-05669-3

@@ Line 12: / Line 12: @@
 ==Generative models vs. discriminative models==
-There are two main approaches for determining the parameters of a linear classifier <math>\vec w</math> [2,3]. The first is by modeling [[conditional probability|conditional density functions]] <math>P(\vec x|{\rm class})</math>. Examples of such algorithms include:
+There are two main approaches for determining the parameters of a linear classifier <math>\vec w</math> <ref>T. Mitchell, Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression. Draft Version, 2005 [http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf download]</ref><ref>A. Y. Ng and M. I. Jordan. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and Naive Bayes.  in NIPS 14, 2002. [http://www.cs.berkeley.edu/~jordan/papers/ng-jordan-nips01.ps.gz download]</ref>. The first is by modeling [[conditional probability|conditional density functions]] <math>P(\vec x|{\rm class})</math>. Examples of such algorithms include:
 * [[linear discriminant analysis|Linear Discriminant Analysis (or Fisher's linear discriminator)]] (LDA) --- assumes [[normal distribution|Gaussian]] conditional density models
 * [[Naive Bayes classifier]] --- assumes [[independent random variables|independent]] [[binomial distribution|binomial]] conditional density models.
@@ Line 21: / Line 21: @@
 * [[Support vector machine]] --- an algorithm that maximizes the [[margin]] between the decision hyperplane and the examples in the training set.
-'''Note:'''   In contrast to its name, LDA does not belong to the class of discriminative models in this [[taxonomy]]. However, its name makes sense when we compare LDA to the other main linear [[dimensionality reduction]] algorithm: [[principal components analysis|Principal Components Analysis]] (PCA). LDA is a [[supervised learning]] algorithm that utilizes the labels of the data, while PCA is an [[unsupervised learning]] algorithm that ignores the labels. To summarize, the name is a historical artifact (see [1, p.117]).
+'''Note:'''   In contrast to its name, LDA does not belong to the class of discriminative models in this [[taxonomy]]. However, its name makes sense when we compare LDA to the other main linear [[dimensionality reduction]] algorithm: [[principal components analysis|Principal Components Analysis]] (PCA). LDA is a [[supervised learning]] algorithm that utilizes the labels of the data, while PCA is an [[unsupervised learning]] algorithm that ignores the labels. To summarize, the name is a historical artifact (see <ref>R.O. Duda, P.E. Hart, D.G. Stork, "Pattern Classification", Wiley, (2001). ISBN 0-471-05669-3</ref>, p.117).
 Discriminative training often yields higher accuracy than modeling the conditional density functions. However, handling missing data is often easier with conditional density models.
@@ Line 27: / Line 27: @@
 All of the linear classifier algorithms listed above can be converted into non-linear algorithms operating on a different input space <math>\varphi(\vec x)</math>, using the [[kernel trick]].
-== References ==
+== Notes ==
+<references/>
-# R.O. Duda, P.E. Hart, D.G. Stork, "Pattern Classification", Wiley, (2001). ISBN 0-471-05669-3
+See also:
-# T. Mitchell, Generative and Discriminative Classifiers: Naive Bayes and Logistic Regression. Draft Version, 2005 [http://www.cs.cmu.edu/~tom/mlbook/NBayesLogReg.pdf download]
+# Y. Yang, X. Liu, "A re-examination of text categorization", Proc. ACM SIGIR Conference, pp. 42-49, (1999). [http://citeseer.ist.psu.edu/yang99reexamination.html paper @ citeseer]</ref>
-# A. Y. Ng and M. I. Jordan. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and Naive Bayes.  in NIPS 14, 2002. [http://www.cs.berkeley.edu/~jordan/papers/ng-jordan-nips01.ps.gz download]
+# R. Herbrich, "Learning Kernel Classifiers: Theory and Algorithms," MIT Press, (2001). ISBN 0-262-08306-X</ref>
-# Y. Yang, X. Liu, "A re-examination of text categorization", Proc. ACM SIGIR Conference, pp. 42-49, (1999). [http://citeseer.ist.psu.edu/yang99reexamination.html paper @ citeseer]
-# R. Herbrich, "Learning Kernel Classifiers: Theory and Algorithms," MIT Press, (2001). ISBN 0-262-08306-X
 [[Category:Classification algorithms]]