Jump to content

AdaBoost: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Tag: Reverted
m Added machine learning category.
 
(32 intermediate revisions by 22 users not shown)
Line 1: Line 1:
{{short description|Adaptive boosting based classification algorithm}}
{{short description|Adaptive boosting based classification algorithm}}
{{Machine learning}}
'''AdaBoost''', short for ''Adaptive [[Boosting (meta-algorithm)|Boosting]]'', is a [[statistical classification]] [[meta-algorithm]] formulated by [[Yoav Freund]] and [[Robert Schapire]], who won the 2003 [[Gödel Prize]] for their work. It can be used in conjunction with many other types of learning algorithms to improve performance. The output of the other learning algorithms ('weak learners') is combined into a weighted sum that represents the final output of the boosted classifier. AdaBoost is adaptive in the sense that subsequent weak learners are tweaked in favor of those instances misclassified by previous classifiers. In some problems it can be less susceptible to the [[overfitting (machine learning)|overfitting]] problem than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner.
'''AdaBoost''' (short for '''Ada'''ptive [[Boosting (meta-algorithm)|'''Boost'''ing]]) is a [[statistical classification]] [[meta-algorithm]] formulated by [[Yoav Freund]] and [[Robert Schapire]] in 1995, who won the 2003 [[Gödel Prize]] for their work. It can be used in conjunction with many types of learning algorithm to improve performance. The output of multiple ''weak learners'' is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented for [[binary classification]], although it can be generalized to multiple classes or bounded intervals of [[Real number|real]] values.<ref>{{Citation |last1=Freund |first1=Yoav |title=A {{as written|desi|cion-theoretic [sic]}} generalization of on-line learning and an application to boosting |date=1995 |url=http://dx.doi.org/10.1007/3-540-59119-2_166 |series=Lecture Notes in Computer Science |pages=23–37 |place=Berlin, Heidelberg |publisher=Springer Berlin Heidelberg |isbn=978-3-540-59119-1 |access-date=2022-06-24 |last2=Schapire |first2=Robert E.|doi=10.1007/3-540-59119-2_166 }}</ref><ref>{{Cite journal |last1=Hastie |first1=Trevor |last2=Rosset |first2=Saharon |last3=Zhu |first3=Ji |last4=Zou |first4=Hui |date=2009 |title=Multi-class AdaBoost |journal=Statistics and Its Interface |volume=2 |issue=3 |pages=349–360 |doi=10.4310/sii.2009.v2.n3.a8 |issn=1938-7989|doi-access=free }}</ref>


AdaBoost is adaptive in the sense that subsequent weak learners (models) are adjusted in favor of instances misclassified by previous models. In some problems, it can be less susceptible to [[overfitting]] than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner.
Every learning algorithm tends to suit some problem types better than others, and typically has many different parameters and configurations to adjust before it achieves optimal performance on a dataset. AdaBoost (with [[Decision tree learning|decision trees]] as the weak learners) is often referred to as the best out-of-the-box classifier.<ref>{{cite arxiv |last1=Kégl|first1=Balázs|title=The return of AdaBoost.MH: multi-class Hamming trees |eprint=1312.6086 |date=20 December 2013|class=cs.LG}}</ref><ref>{{cite web|last1=Joglekar|first1=Sachin|title=adaboost – Sachin Joglekar's blog|url=https://codesachin.wordpress.com/tag/adaboost/|website=codesachin.wordpress.com|accessdate=3 August 2016}}</ref> When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree growing algorithm such that later trees tend to focus on harder-to-classify examples.


Although AdaBoost is typically used to combine weak base learners (such as [[decision stump]]s), it has been shown to also effectively combine strong base learners (such as deeper [[decision tree]]s), producing an even more accurate model.<ref>{{cite journal |last1=Wyner |first1=Abraham J. |last2=Olson |first2=Matthew |last3=Bleich |first3=Justin |last4=Mease |first4=David |title=Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers |journal=Journal of Machine Learning Research |date=2017 |volume=18 |issue=48 |pages=1–33 |url=http://jmlr.org/papers/v18/15-240.html |access-date=17 March 2022}}</ref>
==Training==
AdaBoost refers to a particular method of training a boosted classifier. A boost classifier is a classifier in the form


Every learning algorithm tends to suit some problem types better than others, and typically has many different parameters and configurations to adjust before it achieves optimal performance on a [[dataset]]. AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier.<ref>{{cite arXiv |last1=Kégl|first1=Balázs|title=The return of AdaBoost.MH: multi-class Hamming trees |eprint=1312.6086 |date=20 December 2013|class=cs.LG}}</ref><ref>{{cite web|last1=Joglekar|first1=Sachin|title=adaboost – Sachin Joglekar's blog|url=https://codesachin.wordpress.com/tag/adaboost/|website=codesachin.wordpress.com|access-date=3 August 2016}}</ref> When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree-growing algorithm such that later trees tend to focus on harder-to-classify examples.
:<math>F_T(x) = \sum_{t=1}^T f_t(x)\,\!</math>

where each <math>f_t</math> is a weak learner that takes an object <math>x</math> as input and returns a value indicating the class of the object. For example, in the two-class problem, the sign of the weak learner output identifies the predicted object class and the absolute value gives the confidence in that classification. Similarly, the <math>T</math>th classifier is positive if the sample is in a positive class and negative otherwise.
==Training==
AdaBoost refers to a particular method of training a boosted classifier. A boosted classifier is a classifier of the form
<math display="block">F_T(x) = \sum_{t=1}^T f_t(x)</math>
where each <math>f_t</math> is a weak learner that takes an object <math>x</math> as input and returns a value indicating the class of the object. For example, in the two-class problem, the sign of the weak learner's output identifies the predicted object class and the absolute value gives the confidence in that classification.


Each weak learner produces an output hypothesis, <math>h(x_i)</math>, for each sample in the training set. At each iteration <math>t</math>, a weak learner is selected and assigned a coefficient <math>\alpha_t</math> such that the sum training error <math>E_t</math> of the resulting <math>t</math>-stage boost classifier is minimized.
Each weak learner produces an output hypothesis <math>h</math> which fixes a prediction <math>h(x_i)</math> for each sample in the training set. At each iteration <math>t</math>, a weak learner is selected and assigned a coefficient <math>\alpha_t</math> such that the total training error <math>E_t</math> of the resulting <math>t</math>-stage boosted classifier is minimized.


:<math>E_t = \sum_i E[F_{t-1}(x_i) + \alpha_t h(x_i)]</math>
<math display="block">E_t = \sum_i E[F_{t-1}(x_i) + \alpha_t h(x_i)]</math>


Here <math>F_{t-1}(x)</math> is the boosted classifier that has been built up to the previous stage of training, <math>E(F)</math> is some error function and <math>f_t(x) = \alpha_t h(x)</math> is the weak learner that is being considered for addition to the final classifier.
Here <math>F_{t-1}(x)</math> is the boosted classifier that has been built up to the previous stage of training and <math>f_t(x) = \alpha_t h(x)</math> is the weak learner that is being considered for addition to the final classifier.


===Weighting===
===Weighting===
At each iteration of the training process, a weight <math>w_{i,t}</math> is assigned to each sample in the training set equal to the current error <math>E(F_{t-1}(x_i))</math> on that sample. These weights can be used to inform the training of the weak learner, for instance, decision trees can be grown that favor splitting sets of samples with high weights.
At each iteration of the training process, a weight <math>w_{i,t}</math> is assigned to each sample in the training set equal to the current error <math>E(F_{t-1}(x_i))</math> on that sample. These weights can be used in the training of the weak learner. For instance, decision trees can be grown which favor the splitting of sets of samples with large weights.


==Derivation==
==Derivation==
Line 23: Line 27:


Suppose we have a data set <math>\{(x_1, y_1), \ldots, (x_N, y_N)\}</math> where each item <math>x_i</math> has an associated class <math>y_i \in \{-1, 1\}</math>, and a set of weak classifiers <math>\{k_1, \ldots, k_L\}</math> each of which outputs a classification <math>k_j(x_i) \in \{-1, 1\}</math> for each item. After the <math>(m-1)</math>-th iteration our boosted classifier is a linear combination of the weak classifiers of the form:
Suppose we have a data set <math>\{(x_1, y_1), \ldots, (x_N, y_N)\}</math> where each item <math>x_i</math> has an associated class <math>y_i \in \{-1, 1\}</math>, and a set of weak classifiers <math>\{k_1, \ldots, k_L\}</math> each of which outputs a classification <math>k_j(x_i) \in \{-1, 1\}</math> for each item. After the <math>(m-1)</math>-th iteration our boosted classifier is a linear combination of the weak classifiers of the form:
<math display="block">C_{(m-1)}(x_i) = \alpha_1k_1(x_i) + \cdots + \alpha_{m-1}k_{m-1}(x_i),</math>

where the class will be the sign of <math>C_{(m-1)}(x_i)</math>. At the <math>m</math>-th iteration we want to extend this to a better boosted classifier by adding another weak classifier <math>k_m</math>, with another weight <math>\alpha_m</math>:
:<math>C_{(m-1)}(x_i) = \alpha_1k_1(x_i) + \cdots + \alpha_{m-1}k_{m-1}(x_i)</math>
<math display="block">C_{m}(x_i) = C_{(m-1)}(x_i) + \alpha_m k_m(x_i)</math>

Where the class will be the sign of <math>C_{(m-1)}(x_i)</math>. At the <math>m</math>-th iteration we want to extend this to a better boosted classifier by adding another weak classifier <math>k_m</math>, with another weight <math>\alpha_m</math>:

:<math>C_{m}(x_i) = C_{(m-1)}(x_i) + \alpha_m k_m(x_i)</math>


So it remains to determine which weak classifier is the best choice for <math>k_m</math>, and what its weight <math>\alpha_m</math> should be. We define the total error <math>E</math> of <math>C_m</math> as the sum of its [[Loss functions for classification|exponential loss]] on each data point, given as follows:
So it remains to determine which weak classifier is the best choice for <math>k_m</math>, and what its weight <math>\alpha_m</math> should be. We define the total error <math>E</math> of <math>C_m</math> as the sum of its [[Loss functions for classification|exponential loss]] on each data point, given as follows:
<math display="block">E = \sum_{i=1}^N e^{-y_i C_m(x_i)} = \sum_{i=1}^N e^{-y_i C_{(m-1)}(x_i)}e^{-y_i \alpha_m k_m(x_i)} </math>

:<math>E = \sum_{i=1}^N e^{-y_i C_m(x_i)} = \sum_{i=1}^N e^{-y_i C_{(m-1)}(x_i)}e^{-y_i \alpha_m k_m(x_i)} </math>


Letting <math>w_i^{(1)} = 1</math> and <math>w_i^{(m)} = e^{-y_i C_{m-1}(x_i)}</math> for <math>m > 1</math>, we have:
Letting <math>w_i^{(1)} = 1</math> and <math>w_i^{(m)} = e^{-y_i C_{m-1}(x_i)}</math> for <math>m > 1</math>, we have:
:<math>E = \sum_{i=1}^N w_i^{(m)}e^{-y_i\alpha_m k_m(x_i)}</math>
<math display="block">E = \sum_{i=1}^N w_i^{(m)}e^{-y_i\alpha_m k_m(x_i)}</math>


We can split this summation between those data points that are correctly classified by <math>k_m</math> (so <math>y_i k_m(x_i) = 1</math>) and those that are misclassified (so <math>y_i k_m(x_i) = -1</math>):
We can split this summation between those data points that are correctly classified by <math>k_m</math> (so <math>y_i k_m(x_i) = 1</math>) and those that are misclassified (so <math>y_i k_m(x_i) = -1</math>):
<math display="block">\begin{align}
E &= \sum_{y_i = k_m(x_i)} w_i^{(m)}e^{-\alpha_m} + \sum_{y_i \neq k_m(x_i)} w_i^{(m)}e^{\alpha_m} \\
&= \sum_{i=1}^N w_i^{(m)}e^{-\alpha_m} + \sum_{y_i \neq k_m(x_i)} w_i^{(m)}\left(e^{\alpha_m} - e^{-\alpha_m}\right)
\end{align}</math>


Since the only part of the right-hand side of this equation that depends on <math>k_m</math> is <math display="inline">\sum_{y_i \neq k_m(x_i)} w_i^{(m)}</math>, we see that the <math>k_m</math> that minimizes <math>E</math> is the one in the set <math>\{k_1, \ldots, k_L\}</math> that minimizes <math display="inline">\sum_{y_i \neq k_m(x_i)} w_i^{(m)}</math> [assuming that <math>\alpha_m > 0</math>], i.e. the weak classifier with the lowest weighted error (with weights <math>w_i^{(m)} = e^{-y_i C_{m-1}(x_i)}</math>).
:<math>E = \sum_{y_i = k_m(x_i)} w_i^{(m)}e^{-\alpha_m} + \sum_{y_i \neq k_m(x_i)} w_i^{(m)}e^{\alpha_m}</math>
:<math> = \sum_{i=1}^N w_i^{(m)}e^{-\alpha_m} + \sum_{y_i \neq k_m(x_i)} w_i^{(m)}(e^{\alpha_m}-e^{-\alpha_m})</math>

Since the only part of the right-hand side of this equation that depends on <math>k_m</math> is <math>\sum_{y_i \neq k_m(x_i)} w_i^{(m)}</math>, we see that the <math>k_m</math> that minimizes <math>E</math> is the one that minimizes <math>\sum_{y_i \neq k_m(x_i)} w_i^{(m)}</math> [assuming that <math>\alpha_m > 0</math>], i.e. the weak classifier with the lowest weighted error (with weights <math>w_i^{(m)} = e^{-y_i C_{m-1}(x_i)}</math>).


To determine the desired weight <math>\alpha_m</math> that minimizes <math>E</math> with the <math>k_m</math> that we just determined, we differentiate:
To determine the desired weight <math>\alpha_m</math> that minimizes <math>E</math> with the <math>k_m</math> that we just determined, we differentiate:
<math display="block">\frac{d E}{d \alpha_m} = \frac{d (\sum_{y_i = k_m(x_i)} w_i^{(m)}e^{-\alpha_m} + \sum_{y_i \neq k_m(x_i)} w_i^{(m)}e^{\alpha_m}) }{d \alpha_m}</math>


Luckily the minimum occurs when setting this to zero, then solving for <math>\alpha_m</math> yields:
:<math>\frac{d E}{d \alpha_m} = \frac{d (\sum_{y_i = k_m(x_i)} w_i^{(m)}e^{-\alpha_m} + \sum_{y_i \neq k_m(x_i)} w_i^{(m)}e^{\alpha_m}) }{d \alpha_m}</math>
<math display="block">\alpha_m = \frac{1}{2}\ln\left(\frac{\sum_{y_i = k_m(x_i)} w_i^{(m)}}{\sum_{y_i \neq k_m(x_i)} w_i^{(m)}}\right)</math>

Setting this to zero and solving for <math>\alpha_m</math> yields:

:<math>\alpha_m = \frac{1}{2}\ln\left(\frac{\sum_{y_i = k_m(x_i)} w_i^{(m)}}{\sum_{y_i \neq k_m(x_i)} w_i^{(m)}}\right)</math>


{{Math proof|drop=hidden|
{{Math proof|drop=hidden|
:<math>\frac{d E}{d \alpha_m} = -\sum_{y_i = k_m(x_i)} w_i^{(m)}e^{-\alpha_m} + \sum_{y_i \neq k_m(x_i)} w_i^{(m)}e^{\alpha_m} = 0</math>
<math display="block">\frac{d E}{d \alpha_m} = -\sum_{y_i = k_m(x_i)} w_i^{(m)}e^{-\alpha_m} + \sum_{y_i \neq k_m(x_i)} w_i^{(m)}e^{\alpha_m} = 0</math>

because <math>e^{-\alpha_m}</math> does not depend on <math>i</math>
because <math>e^{-\alpha_m}</math> does not depend on <math>i</math>
:<math>e^{-\alpha_m} \sum_{y_i = k_m(x_i)} w_i^{(m)} = e^{\alpha_m} \sum_{y_i \neq k_m(x_i)} w_i^{(m)}</math>
<math display="block">e^{-\alpha_m} \sum_{y_i = k_m(x_i)} w_i^{(m)} = e^{\alpha_m} \sum_{y_i \neq k_m(x_i)} w_i^{(m)}</math>
<math display="block">-\alpha_m + \ln\left( \sum_{y_i = k_m(x_i)} w_i^{(m)} \right) = \alpha_m + \ln\left(\sum_{y_i \neq k_m(x_i)} w_i^{(m)} \right) </math>

:<math>-\alpha_m + \log\left( \sum_{y_i = k_m(x_i)} w_i^{(m)} \right) = \alpha_m + \log\left(\sum_{y_i \neq k_m(x_i)} w_i^{(m)} \right) </math>
<math display="block">-2\alpha_m = \ln\left( \dfrac{\sum_{y_i \neq k_m(x_i)} w_i^{(m)}}{\sum_{y_i = k_m(x_i)} w_i^{(m)}} \right) </math>
<math display="block">\alpha_m = - \dfrac{1}{2} \ln\left( \dfrac{\sum_{y_i \neq k_m(x_i)} w_i^{(m)}}{\sum_{y_i = k_m(x_i)} w_i^{(m)}} \right) </math>

:<math>-2\alpha_m = \log\left( \dfrac{\sum_{y_i \neq k_m(x_i)} w_i^{(m)}}{\sum_{y_i = k_m(x_i)} w_i^{(m)}} \right) </math>
<math display="block">\alpha_m = \dfrac{1}{2} \ln\left( \dfrac{\sum_{y_i = k_m(x_i)} w_i^{(m)}}{\sum_{y_i \neq k_m(x_i)} w_i^{(m)}} \right) </math>

:<math>\alpha_m = - \dfrac{1}{2} \log\left( \dfrac{\sum_{y_i \neq k_m(x_i)} w_i^{(m)}}{\sum_{y_i = k_m(x_i)} w_i^{(m)}} \right) </math>

:<math>\alpha_m = \dfrac{1}{2} \log\left( \dfrac{\sum_{y_i = k_m(x_i)} w_i^{(m)}}{\sum_{y_i \neq k_m(x_i)} w_i^{(m)}} \right) </math>

}}
}}


We calculate the weighted error rate of the weak classifier to be <math>\epsilon_m = \sum_{y_i \neq k_m(x_i)} w_i^{(m)} / \sum_{i=1}^N w_i^{(m)}</math>, so it follows that:
We calculate the weighted error rate of the weak classifier to be <math>\epsilon_m = \frac{\sum_{y_i \neq k_m(x_i)} w_i^{(m)}}{\sum_{i=1}^N w_i^{(m)}}</math>, so it follows that:
<math display="block">\alpha_m = \frac{1}{2}\ln\left( \frac{1 - \epsilon_m}{\epsilon_m}\right)</math>
which is the negative logit function multiplied by 0.5. Due to the convexity of <math>E</math> as a function of <math>\alpha_m</math>, this new expression for <math>\alpha_m</math> gives the global minimum of the loss function.


Note: This derivation only applies when <math>k_m(x_i) \in \{-1, 1\}</math>, though it can be a good starting guess in other cases, such as when the weak learner is biased (<math>k_m(x) \in \{a,b\}, a \neq -b</math>), has multiple leaves (<math>k_m(x) \in \{a,b,\dots,n\}</math>) or is some other function <math>k_m(x) \in \mathbb{R}</math>.
:<math>\alpha_m = \frac{1}{2}\ln\left( \frac{1 - \epsilon_m}{\epsilon_m}\right)</math>


Thus we have derived the AdaBoost algorithm: At each iteration, choose the classifier <math>k_m</math>, which minimizes the total weighted error <math display="inline">\sum_{y_i \neq k_m(x_i)} w_i^{(m)}</math>, use this to calculate the error rate <math>\epsilon_m = \frac{\sum_{y_i \neq k_m(x_i)} w_i^{(m)}}{\sum_{i=1}^N w_i^{(m)}}</math>, use this to calculate the weight <math>\alpha_m = \frac{1}{2}\ln\left( \frac{1 - \epsilon_m}{\epsilon_m}\right)</math>, and finally use this to improve the boosted classifier <math>C_{m-1}</math> to <math>C_{m} = C_{(m-1)} + \alpha_m k_m</math>.
which is the negative logit function multiplied by 0.5.

Thus we have derived the AdaBoost algorithm: At each iteration, choose the classifier <math>k_m</math>, which minimizes the total weighted error <math>\sum_{y_i \neq k_m(x_i)} w_i^{(m)}</math>, use this to calculate the error rate <math>\epsilon_m = \sum_{y_i \neq k_m(x_i)} w_i^{(m)} / \sum_{i=1}^N w_i^{(m)}</math>, use this to calculate the weight <math>\alpha_m = \frac{1}{2}\ln\left( \frac{1 - \epsilon_m}{\epsilon_m}\right)</math>, and finally use this to improve the boosted classifier <math>C_{m-1}</math> to <math>C_{m} = C_{(m-1)} + \alpha_m k_m</math>.


==Statistical understanding of boosting==
==Statistical understanding of boosting==
Line 80: Line 73:
Boosting is a form of linear [[Regression analysis|regression]] in which the features of each sample <math>x_i</math> are the outputs of some weak learner <math>h</math> applied to <math>x_i</math>.
Boosting is a form of linear [[Regression analysis|regression]] in which the features of each sample <math>x_i</math> are the outputs of some weak learner <math>h</math> applied to <math>x_i</math>.


While regression tries to fit <math>F(x)</math> to <math>y(x)</math> as precisely as possible without loss of generalization, typically using [[least squares|least square]] error <math>E(f) = (y(x) - f(x))^2</math>, the AdaBoost error function <math>E(f) = e^{-y(x)f(x)}</math> takes into account the fact that only the sign of the final result is used, thus <math>|F(x)|</math> can be far larger than 1 without increasing error. However, the exponential increase in the error for sample <math>x_i</math> as <math>-y(x_i)f(x_i)</math> increases results in excessive weight being assigned to outliers.
While regression tries to fit <math>F(x)</math> to <math>y(x)</math> as precisely as possible without loss of generalization, typically using [[least squares|least square]] error <math>E(f) = (y(x) - f(x))^2</math>, whereas the AdaBoost error function <math>E(f) = e^{-y(x)f(x)}</math> takes into account the fact that only the sign of the final result is used, thus <math>|F(x)|</math> can be far larger than 1 without increasing error. However, the exponential increase in the error for sample <math>x_i</math> as <math>-y(x_i)f(x_i)</math> increases, resulting in excessive weights being assigned to outliers.


One feature of the choice of exponential error function is that the error of the final additive model is the product of the error of each stage, that is, <math>e^{\sum_i -y_i f(x_i)} = \prod_i e^{-y_i f(x_i)}</math>. Thus it can be seen that the weight update in the AdaBoost algorithm is equivalent to recalculating the error on <math>F_t(x)</math> after each stage.
One feature of the choice of exponential error function is that the error of the final additive model is the product of the error of each stage, that is, <math>e^{\sum_i -y_i f(x_i)} = \prod_i e^{-y_i f(x_i)}</math>. Thus it can be seen that the weight update in the AdaBoost algorithm is equivalent to recalculating the error on <math>F_t(x)</math> after each stage.


There is a lot of flexibility allowed in the choice of loss function. As long as the loss function is [[monotonic function|monotonic]] and [[continuously differentiable function|continuously differentiable]], the classifier is always driven toward purer solutions.<ref name="fht"/> Zhang (2004) provides a loss function based on least squares, a modified [[Huber loss function]]:
There is a lot of flexibility allowed in the choice of loss function. As long as the loss function is [[monotonic function|monotonic]] and [[continuously differentiable function|continuously differentiable]], the classifier is always driven toward purer solutions.<ref name="fht"/> Zhang (2004) provides a loss function based on least squares, a modified [[Huber loss function]]:
<math display="block">\phi(y,f(x)) = \begin{cases}

:<math>\phi(y,f(x)) = \begin{cases} -4y f(x) & \mbox{if }y f(x) < -1, \\ (y f(x) - 1)^2 &\mbox{if } -1 \le y f(x) \le 1, \\ 0 &\mbox{if } y f(x) > 1 \end{cases}</math>
-4y f(x) & \text{if }y f(x) < -1, \\
(y f(x) - 1)^2 &\text{if } -1 \le y f(x) \le 1. \\
0 &\text{if } y f(x) > 1
\end{cases}</math>


This function is more well-behaved than LogitBoost for <math>f(x)</math> close to 1 or -1, does not penalise ‘overconfident’ predictions (<math>y f(x) > 1</math>), unlike unmodified least squares, and only penalises samples misclassified with confidence greater than 1 linearly, as opposed to quadratically or exponentially, and is thus less susceptible to the effects of outliers.
This function is more well-behaved than LogitBoost for <math>f(x)</math> close to 1 or -1, does not penalise ‘overconfident’ predictions (<math>y f(x) > 1</math>), unlike unmodified least squares, and only penalises samples misclassified with confidence greater than 1 linearly, as opposed to quadratically or exponentially, and is thus less susceptible to the effects of outliers.
Line 92: Line 88:
==Boosting as gradient descent==
==Boosting as gradient descent==
{{Main article|Gradient boosting}}
{{Main article|Gradient boosting}}
Boosting can be seen as minimization of a [[convex function|convex]] loss function over a [[convex set]] of functions.<ref>{{cite journal |first=T. |last=Zhang |title=Statistical behavior and consistency of classification methods based on convex risk minimization |journal=Annals of Statistics |volume=32 |issue=1 |pages=56–85 |year=2004 |doi=10.1214/aos/1079120130 |jstor=3448494 |doi-access=free }}</ref> Specifically, the loss being minimized by AdaBoost is the exponential loss <math>\sum_i \phi(i,y,f) = \sum_i e^{-y_i f(x_i)}</math>, whereas LogitBoost performs logistic regression, minimizing <math>\sum_i \phi(i,y,f) = \sum_i \ln\left(1+e^{-y_i f(x_i)}\right)</math>.
Boosting can be seen as minimization of a [[convex function|convex]] loss function over a [[convex set]] of functions.<ref>{{cite journal |first=T. |last=Zhang |title=Statistical behavior and consistency of classification methods based on convex risk minimization |journal=Annals of Statistics |volume=32 |issue=1 |pages=56–85 |year=2004 |doi=10.1214/aos/1079120130 |jstor=3448494 |doi-access=free }}</ref> Specifically, the loss being minimized by AdaBoost is the exponential loss
<math display="block">\sum_i \phi(i,y,f) = \sum_i e^{-y_i f(x_i)},</math>
whereas LogitBoost performs logistic regression, minimizing
<math display="block">\sum_i \phi(i,y,f) = \sum_i \ln\left(1+e^{-y_i f(x_i)}\right).</math>


In the gradient descent analogy, the output of the classifier for each training point is considered a point <math>\left(F_t(x_1), \dots, F_t(x_n)\right)</math> in n-dimensional space, where each axis corresponds to a training sample, each weak learner <math>h(x)</math> corresponds to a vector of fixed orientation and length, and the goal is to reach the target point <math>(y_1, \dots, y_n)</math> (or any region where the value of loss function <math>E_T(x_1, \dots, x_n)</math> is less than the value at that point), in the fewest steps. Thus AdaBoost algorithms perform either [[gradient descent|Cauchy]] (find <math>h(x)</math> with the steepest gradient, choose <math>\alpha</math> to minimize test error) or [[Newton's method|Newton]] (choose some target point, find <math>\alpha h(x)</math> that brings <math>F_t</math> closest to that point) optimization of training error.
In the gradient descent analogy, the output of the classifier for each training point is considered a point <math>\left(F_t(x_1), \dots, F_t(x_n)\right)</math> in n-dimensional space, where each axis corresponds to a training sample, each weak learner <math>h(x)</math> corresponds to a vector of fixed orientation and length, and the goal is to reach the target point <math>(y_1, \dots, y_n)</math> (or any region where the value of loss function <math>E_T(x_1, \dots, x_n)</math> is less than the value at that point), in the fewest steps. Thus AdaBoost algorithms perform either [[gradient descent|Cauchy]] (find <math>h(x)</math> with the steepest gradient, choose <math>\alpha</math> to minimize test error) or [[Newton's method|Newton]] (choose some target point, find <math>\alpha h(x)</math> that brings <math>F_t</math> closest to that point) optimization of training error.
Line 101: Line 100:
* Desired outputs <math>y_1 \dots y_n, y \in \{-1, 1\}</math>
* Desired outputs <math>y_1 \dots y_n, y \in \{-1, 1\}</math>
* Initial weights <math>w_{1,1} \dots w_{n,1}</math> set to <math>\frac{1}{n}</math>
* Initial weights <math>w_{1,1} \dots w_{n,1}</math> set to <math>\frac{1}{n}</math>
* Error function <math>E(f(x), y, i) = e^{-y_i f(x_i)}</math>
* Error function <math>E(f(x), y_i) = e^{-y_i f(x_i)}</math>
* Weak learners <math>h\colon x \rightarrow \{-1, 1\}</math>
* Weak learners <math>h\colon x \rightarrow \{-1, 1\}</math>
For <math>t</math> in <math>1 \dots T</math>:
For <math>t</math> in <math>1 \dots T</math>:
Line 112: Line 111:
** <math>w_{i,t+1} = w_{i,t} e^{-y_i \alpha_t h_t(x_i)}</math> for <math>i</math> in <math>1 \dots n</math>
** <math>w_{i,t+1} = w_{i,t} e^{-y_i \alpha_t h_t(x_i)}</math> for <math>i</math> in <math>1 \dots n</math>
** Renormalize <math>w_{i,t+1}</math> such that <math>\sum_i w_{i,t+1} = 1</math>
** Renormalize <math>w_{i,t+1}</math> such that <math>\sum_i w_{i,t+1} = 1</math>
**(Note: It can be shown that <math>\frac{\sum_{h_{t+1}(x_i) = y_i} w_{i,t+1}}{\sum_{h_{t+1}(x_i) \neq y_i} w_{i,t+1}} = \frac{\sum_{h_t(x_i) = y_i} w_{i,t}}{\sum_{h_t(x_i) \neq y_i} w_{i,t}}</math> at every step, which can simplify the calculation of the new weights.)
**(Note: It can be shown that <math>\frac{\sum_{h_{t}(x_i) = y_i} w_{i,t+1}}{\sum_{h_{t}(x_i) \neq y_i} w_{i,t+1}} = \frac{\sum_{h_t(x_i) = y_i} w_{i,t}}{\sum_{h_t(x_i) \neq y_i} w_{i,t}}</math> at every step, which can simplify the calculation of the new weights.)

===Choosing {{mvar|&alpha;<sub>t</sub>}}===
The stage values <math>\alpha_t</math> are chosen as it can be analytically shown to be the minimizer of the exponential error function for Discrete AdaBoost.<ref name="ss">{{cite journal | citeseerx = 10.1.1.33.4002 | title = Improved Boosting Algorithms Using Confidence-rated Predictions | first1 = Robert | last1 = Schapire | first2 = Yoram | last2 = Singer | year = 1999| pages = 80–91 }}</ref>

Minimize:

<math>\sum_i w_i e^{-y_i h_i \alpha_t}</math>

Using the convexity of the exponential function, and assuming that <math>\forall i, h_i \in \{-1, 1\}</math> we have:

<math>
\begin{align}
\sum_i w_i e^{-y_i h_i \alpha_t} &\leq \sum_i \left(\frac {1-y_i h_i} {2}\right) w_i e^{\alpha_t} + \sum_i \left(\frac {1+y_i h_i} {2}\right) w_i e^{-\alpha_t}\\
&= \left(\epsilon_t \right)e^{\alpha_t} + \left(1-\epsilon_t \right)e^{-\alpha_t}
\end{align}
</math>

We then differentiate that expression with respect to <math>\alpha_t</math> and set it to zero to find the minimum of the upper bound:

<math>
\begin{align}
\left(\epsilon_t \right)e^{\alpha_t} - \left (1-\epsilon_t \right)e^{-\alpha_t} &= 0\\
\alpha_t &= \frac {1} {2} \ln \left(\frac {1-\epsilon_t} {\epsilon_t}\right)
\end{align}
</math>

Note that this only applies when <math>h_i \in \{-1, 1\}</math>, though it can be a good starting guess in other cases, such as when the weak learner is biased (<math>h(x) \in \{a,b\}, a \neq -b</math>), has multiple leaves (<math>h(x) \in \{a,b,\dots,n\}</math>) or is some other function <math>h(x) \in \mathbb{R}</math>. In such cases the choice of weak learner and coefficient can be condensed to a single step in which <math>f_t = \alpha_t h_t(x)</math> is chosen from all possible <math>\alpha, h</math> as the minimizer of <math>\sum_i w_{i,t} e^{-y_i f_t(x_i)}</math> by some numerical searching routine.


==Variants==
==Variants==
Line 151: Line 123:
{{Main article|LogitBoost}}
{{Main article|LogitBoost}}
LogitBoost represents an application of established [[logistic regression]] techniques to the AdaBoost method. Rather than minimizing error with respect to y, weak learners are chosen to minimize the (weighted least-squares) error of <math>f_t(x)</math> with respect to
LogitBoost represents an application of established [[logistic regression]] techniques to the AdaBoost method. Rather than minimizing error with respect to y, weak learners are chosen to minimize the (weighted least-squares) error of <math>f_t(x)</math> with respect to
<math display="block">z_t = \frac{y^* - p_t(x)}{2 p_t(x)(1 - p_t(x))},</math>

: <math>z_t = \frac{y^* - p_t(x)}{2 p_t(x)(1 - p_t(x))},</math>

where
where
<math display="block">p_t(x) = \frac{e^{F_{t-1}(x)}}{e^{F_{t-1}(x)}+e^{-F_{t-1}(x)}},</math>

: <math>p_t(x) = \frac{e^{F_{t-1}(x)}}{e^{F_{t-1}(x)}+e^{-F_{t-1}(x)}},</math>
<math display="block">w_t = p_t(x)(1 - p_t(x))</math>
: <math>w_t = p_t(x)(1 - p_t(x))</math>
<math display="block">y^* = \frac{y+1} 2.</math>
: <math>y^* = \frac{y+1} 2.</math>


That is <math>z_t</math> is the [[Newton–Raphson]] approximation of the minimizer of the log-likelihood error at stage <math>t</math>, and the weak learner <math>f_t</math> is chosen as the learner that best approximates <math>z_t</math> by weighted least squares.
That is <math>z_t</math> is the [[Newton–Raphson]] approximation of the minimizer of the log-likelihood error at stage <math>t</math>, and the weak learner <math>f_t</math> is chosen as the learner that best approximates <math>z_t</math> by weighted least squares.
Line 165: Line 134:


===Gentle AdaBoost===
===Gentle AdaBoost===
While previous boosting algorithms choose <math>f_t</math> greedily, minimizing the overall test error as much as possible at each step, GentleBoost features a bounded step size. <math>f_t</math> is chosen to minimize <math>\sum_i w_{t,i} (y_i-f_t(x_i))^2</math>, and no further coefficient is applied. Thus, in the case where a weak learner exhibits perfect classification performance, GentleBoost chooses <math>f_t(x) = \alpha_t h_t(x)</math> exactly equal to <math>y</math>, while steepest descent algorithms try to set <math>\alpha_t = \infty</math>. Empirical observations about the good performance of GentleBoost appear to back up Schapire and Singer's remark that allowing excessively large values of <math>\alpha</math> can lead to poor generalization performance.<ref name="ss"/><ref name="fs">{{cite web | url = http://www.site.uottawa.ca/~stan/csi5387/boost-tut-ppr.pdf | title = A Short Introduction to Boosting | last1 = Freund | last2 = Schapire | year = 1999 | postscript = : }}</ref>
While previous boosting algorithms choose <math>f_t</math> greedily, minimizing the overall test error as much as possible at each step, GentleBoost features a bounded step size. <math>f_t</math> is chosen to minimize <math display="inline">\sum_i w_{t,i} (y_i-f_t(x_i))^2</math>, and no further coefficient is applied. Thus, in the case where a weak learner exhibits perfect classification performance, GentleBoost chooses <math>f_t(x) = \alpha_t h_t(x)</math> exactly equal to <math>y</math>, while steepest descent algorithms try to set <math>\alpha_t = \infty</math>. Empirical observations about the good performance of GentleBoost appear to back up Schapire and Singer's remark that allowing excessively large values of <math>\alpha</math> can lead to poor generalization performance.<ref name="ss">{{cite journal |last1=Schapire |first1=Robert |last2=Singer |first2=Yoram |year=1999 |title=Improved Boosting Algorithms Using Confidence-rated Predictions |pages=80–91 |citeseerx=10.1.1.33.4002}}</ref><ref name="fs">{{cite web | url = http://www.site.uottawa.ca/~stan/csi5387/boost-tut-ppr.pdf | title = A Short Introduction to Boosting | last1 = Freund | last2 = Schapire | year = 1999 | postscript = : }}</ref>


===Early Termination===
===Early Termination===

Latest revision as of 19:48, 23 November 2024

AdaBoost (short for Adaptive Boosting) is a statistical classification meta-algorithm formulated by Yoav Freund and Robert Schapire in 1995, who won the 2003 Gödel Prize for their work. It can be used in conjunction with many types of learning algorithm to improve performance. The output of multiple weak learners is combined into a weighted sum that represents the final output of the boosted classifier. Usually, AdaBoost is presented for binary classification, although it can be generalized to multiple classes or bounded intervals of real values.[1][2]

AdaBoost is adaptive in the sense that subsequent weak learners (models) are adjusted in favor of instances misclassified by previous models. In some problems, it can be less susceptible to overfitting than other learning algorithms. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner.

Although AdaBoost is typically used to combine weak base learners (such as decision stumps), it has been shown to also effectively combine strong base learners (such as deeper decision trees), producing an even more accurate model.[3]

Every learning algorithm tends to suit some problem types better than others, and typically has many different parameters and configurations to adjust before it achieves optimal performance on a dataset. AdaBoost (with decision trees as the weak learners) is often referred to as the best out-of-the-box classifier.[4][5] When used with decision tree learning, information gathered at each stage of the AdaBoost algorithm about the relative 'hardness' of each training sample is fed into the tree-growing algorithm such that later trees tend to focus on harder-to-classify examples.

Training

[edit]

AdaBoost refers to a particular method of training a boosted classifier. A boosted classifier is a classifier of the form where each is a weak learner that takes an object as input and returns a value indicating the class of the object. For example, in the two-class problem, the sign of the weak learner's output identifies the predicted object class and the absolute value gives the confidence in that classification.

Each weak learner produces an output hypothesis which fixes a prediction for each sample in the training set. At each iteration , a weak learner is selected and assigned a coefficient such that the total training error of the resulting -stage boosted classifier is minimized.

Here is the boosted classifier that has been built up to the previous stage of training and is the weak learner that is being considered for addition to the final classifier.

Weighting

[edit]

At each iteration of the training process, a weight is assigned to each sample in the training set equal to the current error on that sample. These weights can be used in the training of the weak learner. For instance, decision trees can be grown which favor the splitting of sets of samples with large weights.

Derivation

[edit]

This derivation follows Rojas (2009):[6]

Suppose we have a data set where each item has an associated class , and a set of weak classifiers each of which outputs a classification for each item. After the -th iteration our boosted classifier is a linear combination of the weak classifiers of the form: where the class will be the sign of . At the -th iteration we want to extend this to a better boosted classifier by adding another weak classifier , with another weight :

So it remains to determine which weak classifier is the best choice for , and what its weight should be. We define the total error of as the sum of its exponential loss on each data point, given as follows:

Letting and for , we have:

We can split this summation between those data points that are correctly classified by (so ) and those that are misclassified (so ):

Since the only part of the right-hand side of this equation that depends on is , we see that the that minimizes is the one in the set that minimizes [assuming that ], i.e. the weak classifier with the lowest weighted error (with weights ).

To determine the desired weight that minimizes with the that we just determined, we differentiate:

Luckily the minimum occurs when setting this to zero, then solving for yields:

Proof

because does not depend on

We calculate the weighted error rate of the weak classifier to be , so it follows that: which is the negative logit function multiplied by 0.5. Due to the convexity of as a function of , this new expression for gives the global minimum of the loss function.

Note: This derivation only applies when , though it can be a good starting guess in other cases, such as when the weak learner is biased (), has multiple leaves () or is some other function .

Thus we have derived the AdaBoost algorithm: At each iteration, choose the classifier , which minimizes the total weighted error , use this to calculate the error rate , use this to calculate the weight , and finally use this to improve the boosted classifier to .

Statistical understanding of boosting

[edit]

Boosting is a form of linear regression in which the features of each sample are the outputs of some weak learner applied to .

While regression tries to fit to as precisely as possible without loss of generalization, typically using least square error , whereas the AdaBoost error function takes into account the fact that only the sign of the final result is used, thus can be far larger than 1 without increasing error. However, the exponential increase in the error for sample as increases, resulting in excessive weights being assigned to outliers.

One feature of the choice of exponential error function is that the error of the final additive model is the product of the error of each stage, that is, . Thus it can be seen that the weight update in the AdaBoost algorithm is equivalent to recalculating the error on after each stage.

There is a lot of flexibility allowed in the choice of loss function. As long as the loss function is monotonic and continuously differentiable, the classifier is always driven toward purer solutions.[7] Zhang (2004) provides a loss function based on least squares, a modified Huber loss function:

This function is more well-behaved than LogitBoost for close to 1 or -1, does not penalise ‘overconfident’ predictions (), unlike unmodified least squares, and only penalises samples misclassified with confidence greater than 1 linearly, as opposed to quadratically or exponentially, and is thus less susceptible to the effects of outliers.

Boosting as gradient descent

[edit]

Boosting can be seen as minimization of a convex loss function over a convex set of functions.[8] Specifically, the loss being minimized by AdaBoost is the exponential loss whereas LogitBoost performs logistic regression, minimizing

In the gradient descent analogy, the output of the classifier for each training point is considered a point in n-dimensional space, where each axis corresponds to a training sample, each weak learner corresponds to a vector of fixed orientation and length, and the goal is to reach the target point (or any region where the value of loss function is less than the value at that point), in the fewest steps. Thus AdaBoost algorithms perform either Cauchy (find with the steepest gradient, choose to minimize test error) or Newton (choose some target point, find that brings closest to that point) optimization of training error.

Example algorithm (Discrete AdaBoost)

[edit]

With:

  • Samples
  • Desired outputs
  • Initial weights set to
  • Error function
  • Weak learners

For in :

  • Choose :
    • Find weak learner that minimizes , the weighted sum error for misclassified points
    • Choose
  • Add to ensemble:
  • Update weights:
    • for in
    • Renormalize such that
    • (Note: It can be shown that at every step, which can simplify the calculation of the new weights.)

Variants

[edit]

Real AdaBoost

[edit]

The output of decision trees is a class probability estimate , the probability that is in the positive class.[7] Friedman, Hastie and Tibshirani derive an analytical minimizer for for some fixed (typically chosen using weighted least squares error):

.

Thus, rather than multiplying the output of the entire tree by some fixed value, each leaf node is changed to output half the logit transform of its previous value.

LogitBoost

[edit]

LogitBoost represents an application of established logistic regression techniques to the AdaBoost method. Rather than minimizing error with respect to y, weak learners are chosen to minimize the (weighted least-squares) error of with respect to where

That is is the Newton–Raphson approximation of the minimizer of the log-likelihood error at stage , and the weak learner is chosen as the learner that best approximates by weighted least squares.

As p approaches either 1 or 0, the value of becomes very small and the z term, which is large for misclassified samples, can become numerically unstable, due to machine precision rounding errors. This can be overcome by enforcing some limit on the absolute value of z and the minimum value of w

Gentle AdaBoost

[edit]

While previous boosting algorithms choose greedily, minimizing the overall test error as much as possible at each step, GentleBoost features a bounded step size. is chosen to minimize , and no further coefficient is applied. Thus, in the case where a weak learner exhibits perfect classification performance, GentleBoost chooses exactly equal to , while steepest descent algorithms try to set . Empirical observations about the good performance of GentleBoost appear to back up Schapire and Singer's remark that allowing excessively large values of can lead to poor generalization performance.[9][10]

Early Termination

[edit]

A technique for speeding up processing of boosted classifiers, early termination refers to only testing each potential object with as many layers of the final classifier necessary to meet some confidence threshold, speeding up computation for cases where the class of the object can easily be determined. One such scheme is the object detection framework introduced by Viola and Jones:[11] in an application with significantly more negative samples than positive, a cascade of separate boost classifiers is trained, the output of each stage biased such that some acceptably small fraction of positive samples is mislabeled as negative, and all samples marked as negative after each stage are discarded. If 50% of negative samples are filtered out by each stage, only a very small number of objects would pass through the entire classifier, reducing computation effort. This method has since been generalized, with a formula provided for choosing optimal thresholds at each stage to achieve some desired false positive and false negative rate.[12]

In the field of statistics, where AdaBoost is more commonly applied to problems of moderate dimensionality, early stopping is used as a strategy to reduce overfitting.[13] A validation set of samples is separated from the training set, performance of the classifier on the samples used for training is compared to performance on the validation samples, and training is terminated if performance on the validation sample is seen to decrease even as performance on the training set continues to improve.

Totally corrective algorithms

[edit]

For steepest descent versions of AdaBoost, where is chosen at each layer t to minimize test error, the next layer added is said to be maximally independent of layer t:[14] it is unlikely to choose a weak learner t+1 that is similar to learner t. However, there remains the possibility that t+1 produces similar information to some other earlier layer. Totally corrective algorithms, such as LPBoost, optimize the value of every coefficient after each step, such that new layers added are always maximally independent of every previous layer. This can be accomplished by backfitting, linear programming or some other method.

Pruning

[edit]

Pruning is the process of removing poorly performing weak classifiers to improve memory and execution-time cost of the boosted classifier. The simplest methods, which can be particularly effective in conjunction with totally corrective training, are weight- or margin-trimming: when the coefficient, or the contribution to the total test error, of some weak classifier falls below a certain threshold, that classifier is dropped. Margineantu & Dietterich[15] suggested an alternative criterion for trimming: weak classifiers should be selected such that the diversity of the ensemble is maximized. If two weak learners produce very similar outputs, efficiency can be improved by removing one of them and increasing the coefficient of the remaining weak learner.[16]

See also

[edit]

References

[edit]
  1. ^ Freund, Yoav; Schapire, Robert E. (1995), A desicion-theoretic [sic] generalization of on-line learning and an application to boosting, Lecture Notes in Computer Science, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 23–37, doi:10.1007/3-540-59119-2_166, ISBN 978-3-540-59119-1, retrieved 2022-06-24
  2. ^ Hastie, Trevor; Rosset, Saharon; Zhu, Ji; Zou, Hui (2009). "Multi-class AdaBoost". Statistics and Its Interface. 2 (3): 349–360. doi:10.4310/sii.2009.v2.n3.a8. ISSN 1938-7989.
  3. ^ Wyner, Abraham J.; Olson, Matthew; Bleich, Justin; Mease, David (2017). "Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers". Journal of Machine Learning Research. 18 (48): 1–33. Retrieved 17 March 2022.
  4. ^ Kégl, Balázs (20 December 2013). "The return of AdaBoost.MH: multi-class Hamming trees". arXiv:1312.6086 [cs.LG].
  5. ^ Joglekar, Sachin. "adaboost – Sachin Joglekar's blog". codesachin.wordpress.com. Retrieved 3 August 2016.
  6. ^ Rojas, Raúl (2009). "AdaBoost and the super bowl of classifiers a tutorial introduction to adaptive boosting" (Tech. Rep.). Freie University, Berlin.
  7. ^ a b Friedman, Jerome; Hastie, Trevor; Tibshirani, Robert (1998). "Additive Logistic Regression: A Statistical View of Boosting". Annals of Statistics. 28: 2000. CiteSeerX 10.1.1.51.9525.
  8. ^ Zhang, T. (2004). "Statistical behavior and consistency of classification methods based on convex risk minimization". Annals of Statistics. 32 (1): 56–85. doi:10.1214/aos/1079120130. JSTOR 3448494.
  9. ^ Schapire, Robert; Singer, Yoram (1999). "Improved Boosting Algorithms Using Confidence-rated Predictions": 80–91. CiteSeerX 10.1.1.33.4002. {{cite journal}}: Cite journal requires |journal= (help)
  10. ^ Freund; Schapire (1999). "A Short Introduction to Boosting" (PDF):
  11. ^ Viola, Paul; Jones, Robert (2001). "Rapid Object Detection Using a Boosted Cascade of Simple Features". CiteSeerX 10.1.1.10.6807. {{cite journal}}: Cite journal requires |journal= (help)
  12. ^ McCane, Brendan; Novins, Kevin; Albert, Michael (2005). "Optimizing cascade classifiers". {{cite journal}}: Cite journal requires |journal= (help)
  13. ^ Trevor Hastie; Robert Tibshirani; Jerome Friedman (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). New York: Springer. ISBN 978-0-387-84858-7.
  14. ^ Šochman, Jan; Matas, Jiří (2004). Adaboost with Totally Corrective Updates for Fast Face Detection. ISBN 978-0-7695-2122-0.
  15. ^ Margineantu, Dragos; Dietterich, Thomas (1997). "Pruning Adaptive Boosting". CiteSeerX 10.1.1.38.7017. {{cite journal}}: Cite journal requires |journal= (help)
  16. ^ Tamon, Christino; Xiang, Jie (2000). "On the Boosting Pruning Problem". {{cite journal}}: Cite journal requires |journal= (help)

Further reading

[edit]