Neural tangent kernel: Difference between revisions
No edit summary Tags: Mobile edit Mobile web edit |
No edit summary |
||
(47 intermediate revisions by 26 users not shown) | |||
Line 1: | Line 1: | ||
{{short description|Type of kernel induced by artificial neural networks}} |
|||
{{COI|date=December 2019}} |
|||
In the study of [[ |
In the study of [[Artificial neural network|artificial neural networks]] (ANNs), the '''neural tangent kernel''' ('''NTK''') is a [[Kernel method|kernel]] that describes the evolution of [[Deep learning|deep artificial neural networks]] during their training by [[gradient descent]]. It allows ANNs to be studied using theoretical tools from [[kernel methods]]. |
||
In general, a kernel is a [[Positive-definite kernel|positive-semidefinite]] [[Symmetric function|symmetric]] function of two inputs which represents some notion of similarity between the two inputs. The NTK is a specific kernel derived from a given neural network; in general, when the neural network parameters change during training, the NTK evolves as well. However, in the limit of large layer width the NTK becomes constant, revealing a duality between training the wide neural network and kernel methods: [[gradient descent]] in the [[Large width limits of neural networks|infinite-width limit]] is fully equivalent to kernel gradient descent with the NTK. As a result, using gradient descent to minimize least-square loss for neural networks yields the same mean estimator as ridgeless kernel regression with the NTK. This duality enables simple [[Closed-form expression|closed form]] equations describing the training dynamics, [[Generalization (machine learning)|generalization]], and predictions of wide neural networks. |
|||
For most common neural network architectures, in the limit of large layer width the NTK becomes constant. This enables simple [[Closed-form expression|closed form]] statements to be made about neural network predictions, training dynamics, generalization, and loss surfaces. For example, it guarantees that wide enough ANNs converge to a [[Maxima and minima|global minimum]] when trained to minimize an empirical loss. The NTK of large width networks is also related to several other [[large width limits of neural networks]]. |
|||
The NTK was introduced in 2018 by |
The NTK was introduced in 2018 by Arthur Jacot, Franck Gabriel and Clément Hongler,<ref name=":0">{{Citation |last1=Jacot |first1=Arthur |title=Neural Tangent Kernel: Convergence and Generalization in Neural Networks |date=2018 |url=http://papers.nips.cc/paper/8076-neural-tangent-kernel-convergence-and-generalization-in-neural-networks.pdf |work=Advances in Neural Information Processing Systems 31 |pages=8571–8580 |editor-last=Bengio |editor-first=S. |access-date=2019-11-27 |publisher=Curran Associates, Inc. |arxiv=1806.07572 |bibcode= |last2=Gabriel |first2=Franck |last3=Hongler |first3=Clement |editor2-last=Wallach |editor2-first=H. |editor3-last=Larochelle |editor3-first=H. |editor4-last=Grauman |editor4-first=K.}}</ref> who used it to study the convergence and generalization properties of fully connected neural networks. Later works<ref name=":3">{{Cite arXiv |last1=Arora |first1=Sanjeev |last2=Du |first2=Simon S. |last3=Hu |first3=Wei |last4=Li |first4=Zhiyuan |last5=Salakhutdinov |first5=Ruslan |last6=Wang |first6=Ruosong |date=2019-11-04 |title=On Exact Computation with an Infinitely Wide Neural Net |class=cs.LG |eprint=1904.11955 }}</ref><ref>{{Cite arXiv |last=Yang |first=Greg |date=2020-11-29 |title=Tensor Programs II: Neural Tangent Kernel for Any Architecture |class=stat.ML |eprint=2006.14548 }}</ref> extended the NTK results to other neural network architectures. In fact, the phenomenon behind NTK is not specific to neural networks and can be observed in generic nonlinear models, usually by a suitable scaling<ref>{{Citation |last1=Chizat |first1=Lénaïc |title=On lazy training in differentiable programming |date=2019-12-08 |url=https://dl.acm.org/doi/abs/10.5555/3454287.3454551 |work=Proceedings of the 33rd International Conference on Neural Information Processing Systems |pages=2937–2947 |access-date=2023-05-11 |place=Red Hook, NY, USA |publisher=Curran Associates Inc. |arxiv=1812.07956 |last2=Oyallon |first2=Edouard |last3=Bach |first3=Francis}}</ref>.{{Toclimit|3}} |
||
== |
== Main results (informal) == |
||
Let <math>f(x;\theta )</math> denote the scalar function computed by a given neural network with parameters <math>\theta</math> on input <math>x</math>. Then the neural tangent kernel is defined<ref name=":0" /> as<math display="block">\Theta (x,x';\theta )=\nabla _{\theta }f(x;\theta )\cdot \nabla _{\theta }f(x';\theta ).</math>Since it is written as a [[dot product]] between mapped inputs (with the [[gradient]] of the neural network function serving as the feature map), we are guaranteed that the NTK is [[Symmetric function|symmetric]] and [[Positive-semidefinite function|positive semi-definite]]. The NTK is thus a valid kernel function. |
|||
Consider a [[Multilayer perceptron|fully connected neural network]] whose parameters are chosen [[Independent and identically distributed random variables|i.i.d.]] according to any mean-zero distribution. This random initialization of <math>\theta</math> induces a distribution over <math>f(x;\theta )</math> whose statistics we will analyze, both at initialization and throughout training (gradient descent on a specified dataset). We can visualize this distribution via a neural network ensemble which is constructed by drawing many times from the initial distribution over <math>f(x;\theta )</math> and training each draw according to the same training procedure. |
|||
=== Scalar output case === |
|||
[[File:NTK evolution.gif|thumb|380x380px|At initialization, an ensemble of wide neural networks is a zero-mean [[Gaussian process]]; during training ([[gradient descent]] on [[Mean squared error|mean-square error]]), the ensemble evolves according to the neural tangent kernel. The converged ensemble is a Gaussian process whose mean is the ridgeless [[Kernel method|kernel regression]] estimator and whose variance vanishes on the training points. Here, the neural network is a scalar function trained on inputs drawn from the unit circle.]] |
|||
⚫ | |||
The number of neurons in each layer is called the layer’s width. Consider taking the width of every hidden layer to infinity and training the neural network with [[gradient descent]] (with a suitably small [[learning rate]]). In this [[Large width limits of neural networks|infinite-width limit]], several nice properties emerge: |
|||
* At initialization (before training), the neural network ensemble is a zero-mean [[Gaussian process]] (GP).<ref name=":4" /> This means that distribution of functions is the [[Maximum entropy distribution|maximum-entropy distribution]] with mean <math>\mathbb {E} _{\theta }[f(x;\theta )]=0</math> and covariance <math>\mathbb {E} _{\theta }[f(x;\theta )f(x';\theta )]=\Sigma (x,x')</math>, where the GP covariance <math>\Sigma (x,x')</math> can be computed from the network architecture. In other words, the distribution of neural network functions at initialization has no structure other than its first and second moments (mean and covariance). This follows from the central limit theorem. |
|||
The Neural Tangent Kernel (NTK) is a kernel <math>\Theta:\mathbb{R}^{n_{\mathrm{in}}}\times\mathbb{R}^{n_{\mathrm{in}}}\to\mathbb{R}</math> defined by<math display="block">\Theta\left(x,y;\theta\right)=\sum_{p=1}^{P}\partial_{\theta_{p}}f\left(x;\theta\right)\partial_{\theta_{p}}f\left(y;\theta\right).</math>In the language of [[kernel method]]s, the NTK <math>\Theta</math> is the kernel associated with the [[Feature (machine learning)|feature map]] <math>\left(x\mapsto\partial_{\theta_{p}}f\left(x;\theta\right)\right)_{p=1,\ldots,P}</math>. |
|||
⚫ | * The NTK is deterministic.<ref name=":0" /><ref name="Lee">{{Cite journal |last1=Lee |first1=Jaehoon |last2=Xiao |first2=Lechao |last3=Schoenholz |first3=Samuel S. |last4=Bahri |first4=Yasaman |last5=Novak |first5=Roman |last6=Sohl-Dickstein |first6=Jascha |last7=Pennington |first7=Jeffrey |year=2020 |title=Wide neural networks of any depth evolve as linear models under gradient descent |journal=Journal of Statistical Mechanics: Theory and Experiment |volume=2020 |issue=12 |page=124002 |arxiv=1902.06720 |bibcode=2020JSMTE2020l4002L |doi=10.1088/1742-5468/abc62b |s2cid=62841516}}</ref> In other words, the NTK is independent of the random parameter initialization. |
||
* The NTK does not change during training.<ref name=":0" /><ref name="Lee" /> |
|||
* Each parameter changes negligibly throughout training. As Lee ''et al.''<ref name="Lee" /> note, "although individual parameters move by a vanishingly small amount, they collectively conspire to provide a finite change in the final output of the network, as is necessary for training." |
|||
* During training, the neural network is linearized, i.e., its parameter dependence can be captured by its first-order [[Taylor expansion]]: <math>f(x;\theta_0 +\Delta \theta )=f(x;\theta_0 )+\Delta \theta \cdot \nabla _{\theta }f(x;\theta_0 )</math>, where <math>\theta_0</math> are the initial parameters.<ref name="Lee" /> This follows from the fact that each parameter changes negligibly during training. (The neural network remains nonlinear with respect to the inputs.) |
|||
* The training dynamics are equivalent to [[Kernel method|kernel gradient descent]] using the NTK as the kernel.<ref name=":0" /> If the [[loss function]] is [[Mean squared error|mean-squared error]], the final distribution over <math>f(x;\theta )</math> is still a Gaussian process, but with a new mean and covariance.<ref name=":0" /><ref name="Lee" /> In particular, the mean converges to the same estimator yielded by kernel regression with the NTK as kernel and zero [[Regularization (mathematics)|ridge regularization]], and the covariance is expressible in terms of the NTK and the initial GP covariance. It can be shown that the ensemble variance vanishes at the training points (in other words, the neural network always interpolates the training data, regardless of initialization). |
|||
From a physics point of view, the NTK can be understood as a type of [[Hamiltonian field theory|Hamiltonian]], since it generates the time-evolution of observables when the neural network is trained by gradient descent with infinitesimally small steps (the continuum limit).<ref>{{Cite book |last=Roberts |first=Daniel A. |title=The principles of deep learning theory: an effective theory approach to understanding neural networks |last2=Yaida |first2=Sho |date=2022 |publisher=Cambridge University Press |others=Boris Hanin |year=2022 |isbn=978-1-316-51933-2 |location=Cambridge New York, NY Port Melbourne, VIC New Delhi Singapore |pages=360 |chapter=∞. The End of Training}}</ref> |
|||
⚫ | |||
=== Vector output case === |
|||
⚫ | |||
=== Ridgeless kernel regression and kernel gradient descent === |
|||
In this case, the Neural Tangent Kernel <math>\Theta:\mathbb{R}^{n_{\mathrm{in}}}\times\mathbb{R}^{n_{\mathrm{in}}}\to\mathcal{M}_{n_{\mathrm{out}}}\left(\mathbb{R}\right)</math> is a [[Kernel methods for vector output|matrix-valued kernel]], with values in the space of <math>n_{\mathrm{out}}\times n_{\mathrm{out}}</math> matrices, defined by<math display="block">\Theta_{k,l}\left(x,y;\theta\right)=\sum_{p=1}^{P}\partial_{\theta_{p}}f_{k}\left(x;\theta\right)\partial_{\theta_{p}}f_{l}\left(y;\theta\right).</math> |
|||
[[Kernel method|Kernel methods]] are machine learning algorithms which use only pairwise relations between input points. Kernel methods do not depend on the concrete values of the inputs; they only depend on the relations between the inputs and other inputs (such as the training set). These pairwise relations are fully captured by the kernel function: a [[Symmetric function|symmetric]], [[Positive-semidefinite function|positive-semidefinite]] function of two inputs which represents some notion of similarity between the two inputs. A fully equivalent condition is that there exists some feature map <math>{\mathbf {x}}\mapsto \psi ({\mathbf {x}})</math> such that the kernel function can be written as a [[dot product]] of the mapped inputs<math display="block">K({\mathbf {x}},{\mathbf {x}}')=\psi ({\mathbf {x}})\cdot \psi ({\mathbf {x}}').</math>The properties of a kernel method depend on the choice of kernel function. (Note that <math>\psi ({\mathbf {x}})</math> may have higher dimension than <math>\mathbf{x}</math>.) As a relevant example, consider [[linear regression]]. This is the task of estimating <math>{\mathbf {w}}^{*}</math> given <math>N</math> samples <math>({\mathbf {x}}_{i},y_{i})</math> generated from <math>y^{*}({\mathbf {x}})={\mathbf {w}}^{*}\cdot {\mathbf {x}}</math>, where each <math>\mathbf {x}_{i}</math> is drawn according to some input data distribution. In this setup, <math>{\mathbf {w}}^{*}</math> is the weight vector which defines the true function <math>y^{*}</math>; we wish to use the training samples to develop a model <math>\mathbf {\hat {w}}</math> which approximates <math>{\mathbf {w}}^{*}</math>. We do this by minimizing the [[Mean squared error|mean-square error]] between our model and the training samples:<math display="block">{\mathbf {\hat {w}}}=\arg \min _{\mathbf {w}}{\frac {1}{N}}\sum_{i=0}^{N}||y^{*}({\mathbf {x}}_{i})-{\mathbf {w}}\cdot {\mathbf {x}}_{i}||^{2}</math>There exists an explicit solution for <math>\mathbf {\hat {w}}</math> which minimizes the squared error: <math>{\mathbf {\hat {w}}}=({\mathbf {X}}{\mathbf {X}}^{T})^{-1}{\mathbf {X}}{\mathbf {y}}</math>, where <math>{\mathbf {X}}</math> is the matrix whose columns are the training inputs, and <math>{\mathbf {y}}</math> is the vector of training outputs. Then, the model can make predictions on new inputs: <math>{\hat {y}}({\mathbf {x}})={\mathbf {\hat {w}}}\cdot {\mathbf {x}}</math>. |
|||
However, this result can be rewritten as: <math>{\hat {y}}({\mathbf {x}})=({\mathbf {x}}^{T}{\mathbf {X}})({\mathbf {X}}^{T}{\mathbf {X}})^{-1}{\mathbf {y}}</math>.<ref>{{Cite book |last1=Shawe-Taylor |first1=John |url=http://dx.doi.org/10.1017/cbo9780511809682 |title=Kernel Methods for Pattern Analysis |last2=Cristianini |first2=Nello |date=2004-06-28 |publisher=Cambridge University Press |doi=10.1017/cbo9780511809682 |isbn=978-0-521-81397-6}}</ref> Note that this dual solution is expressed solely in terms of the inner products between inputs. This motivates extending linear regression to settings in which, instead of directly taking inner products between inputs, we first transform the inputs according to a chosen feature map and then evaluate the inner products between the transformed inputs. As discussed above, this can be captured by a kernel function <math>K({\mathbf {x}},{\mathbf {x}}')</math>, since all kernel functions are inner products of feature-mapped inputs. This yields the ridgeless kernel regression estimator:<math display="block">{\hat {y}}({\mathbf {x}})=K({\mathbf {x}},{\mathbf {X}})\;K({\mathbf {X}},{\mathbf {X}})^{-1}\;{\mathbf {y}}.</math>If the kernel matrix <math>K({\mathbf {X}},{\mathbf {X}})</math> is [[Singular matrices|singular]], one uses the [[Moore-Penrose pseudoinverse]]. The regression equations are called "ridgeless" because they lack a ridge [[Regularization (mathematics)|regularization]] term. |
|||
== Derivation == |
|||
In this view, linear regression is a special case of kernel regression with the identity feature map: <math>\psi ({\mathbf {x}})={\mathbf {x}}</math>. Equivalently, kernel regression is simply linear regression in the feature space (i.e. the range of the feature map defined by the chosen kernel). Note that kernel regression is typically a ''nonlinear'' regression in the input space, which is a major strength of the algorithm. |
|||
Just as it’s possible to perform linear regression using iterative optimization algorithms such as gradient descent, one can perform kernel regression using kernel gradient descent. This is equivalent to performing gradient descent in the feature space. It’s known that if the weight vector is initialized close to zero, least-squares gradient descent converges to the minimum-norm solution, i.e., the final weight vector has the minimum [[Euclidean norm]] of all the interpolating solutions. In the same way, kernel gradient descent yields the minimum-norm solution with respect to the [[Reproducing kernel Hilbert space|RKHS norm]]. This is an example of the implicit regularization of gradient descent. |
|||
⚫ | The NTK gives a rigorous connection between the inference performed by infinite-width ANNs and that performed by [[Kernel method|kernel methods]]: when the loss function is the [[Quadratic loss function|least-squares loss]], the inference performed by an ANN is in expectation equal to ridgeless kernel regression with respect to the NTK. This suggests that the performance of large ANNs in the NTK parametrization can be replicated by kernel methods for suitably chosen kernels.<ref name=":0" /><ref name=":3" /> |
||
=== Overparametrization, interpolation, and generalization === |
|||
In overparametrized models, the number of tunable parameters exceeds the number of training samples. In this case, the model is able to memorize (perfectly fit) the training data. Therefore, overparametrized models interpolate the training data, achieving essentially zero training error.<ref name=":6">{{Cite arXiv |last=Belkin |first=Mikhail |date=2021-05-29 |title=Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation |class=stat.ML |eprint=2105.14368 }}</ref> |
|||
[[File:Double descent.png|thumb|345x345px|Modern overparametrized models achieve low generalization error despite having the capacity to interpolate (memorize) the training set.<ref name=":6" /> This phenomenon can be understood by studying the generalization properties of high-dimensional kernel regression.]] |
|||
Kernel regression is typically viewed as a non-parametric learning algorithm, since there are no explicit parameters to tune once a kernel function has been chosen. An alternate view is to recall that kernel regression is simply linear regression in feature space, so the “effective” number of parameters is the dimension of the feature space. Therefore, studying kernels with high-dimensional feature maps can provide insights about strongly overparametrized models. |
|||
As an example, consider the problem of generalization. According to classical statistics, memorization should cause models to fit noisy signals in the training data, harming their performance on unseen data. To mitigate this, machine learning algorithms often introduce regularization to mitigate noise-fitting tendencies. Surprisingly, modern neural networks (which tend to be strongly overparametrized) seem to generalize well, even in the absence of explicit regularization.<ref name=":6" /><ref>{{Cite arXiv|last1=Novak |first1=Roman |last2=Bahri |first2=Yasaman |last3=Abolafia |first3=Daniel A. |last4=Pennington |first4=Jeffrey |last5=Sohl-Dickstein |first5=Jascha |date=2018-02-15 |title=Sensitivity and Generalization in Neural Networks: an Empirical Study |class=stat.ML |eprint=1802.08760 }}</ref> To study the generalization properties of overparametrized neural networks, one can exploit the infinite-width duality with ridgeless kernel regression. Recent works<ref>{{Cite arXiv |last1=Jacot |first1=Arthur |last2=Şimşek |first2=Berfin |last3=Spadaro |first3=Francesco |last4=Hongler |first4=Clément |last5=Gabriel |first5=Franck |date=2020-06-17 |title=Kernel Alignment Risk Estimator: Risk Prediction from Training Data |class=stat.ML |eprint=2006.09796 }}</ref><ref>{{Cite journal |last1=Canatar |first1=Abdulkadir |last2=Bordelon |first2=Blake |last3=Pehlevan |first3=Cengiz |date=2021-05-18 |title=Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks |journal=Nature Communications |language=en |volume=12 |issue=1 |pages=2914 |doi=10.1038/s41467-021-23103-1 |issn=2041-1723 |pmc=8131612 |pmid=34006842|arxiv=2006.13198 |bibcode=2021NatCo..12.2914C }}</ref><ref>{{Cite arXiv |last1=Simon |first1=James B. |last2=Dickens |first2=Madeline |last3=Karkada |first3=Dhruva |last4=DeWeese |first4=Michael R. |date=2022-10-12 |title=The Eigenlearning Framework: A Conservation Law Perspective on Kernel Regression and Wide Neural Networks |class=cs.LG |eprint=2110.03922 }}</ref> have derived equations describing the expected generalization error of high-dimensional kernel regression; these results immediately explain the generalization of sufficiently wide neural networks trained to convergence on least-squares. |
|||
⚫ | |||
For a [[Convex function|convex]] loss functional <math>{\mathcal {C}}</math> with a [[global minimum]], if the NTK remains [[Positive-definite kernel|positive-definite]] during training, the loss of the ANN <math>{\mathcal {C}}\left(f\left(\cdot;\theta \left(t\right)\right)\right)</math> converges to that minimum as <math>t\to \infty</math>. This positive-definiteness property has been shown in a number of cases, yielding the first proofs that large-width ANNs converge to global minima during training.<ref name=":0" /><ref name=":2">{{cite arXiv |last1=Allen-Zhu |first1=Zeyuan |last2=Li |first2=Yuanzhi |last3=Song |first3=Zhao |date=2018 |title=A convergence theory for deep learning via overparameterization |class=cs.LG |eprint=1811.03962}}</ref><ref>{{cite arXiv|last1=Du |first1=Simon S |last2=Zhai |first2=Xiyu |last3=Poczos |first3=Barnabas |last4=Aarti |first4=Singh |date=2019 |title=Gradient descent provably optimizes over-parameterized neural networks |class=cs.LG |eprint=1810.02054}}</ref><ref>{{cite journal |last1=Zou |first1=Difan |last2=Cao |first2=Yuan |last3=Zhou |first3=Dongruo |last4=Gu |first4=Quanquan |date=2020 |title=Gradient descent optimizes over-parameterized deep ReLU networks |url=https://link.springer.com/article/10.1007/s10994-019-05839-6 |journal=Machine Learning |volume=109 |issue=3 |pages=467–492|doi=10.1007/s10994-019-05839-6 |s2cid=53752874 |doi-access=free }}</ref><ref>{{Cite arXiv |last1=Allen-Zhu |first1=Zeyuan |last2=Li |first2=Yuanzhi |last3=Song |first3=Zhao |date=2019-05-27 |title=On the Convergence Rate of Training Recurrent Neural Networks |class=cs.LG |eprint=1810.12065 }}</ref><ref name=":5">{{Cite arXiv|last1=Du |first1=Simon |last2=Lee |first2=Jason |last3=Li |first3=Haochuan |last4=Wang |first4=Liwei |last5=Zhai |first5=Xiyu |date=2019-05-24 |title=Gradient Descent Finds Global Minima of Deep Neural Networks |language=en |pages=1675–1685 |class=cs.LG |eprint=1811.03804}}</ref> |
|||
== Extensions and limitations == |
|||
⚫ | The NTK can be studied for various [[Artificial neural network|ANN architectures]],<ref name=":3" /> in particular [[convolutional neural networks]] (CNNs),<ref>{{cite arXiv |eprint=1902.04760 |class=cs.NE |first=Greg |last=Yang |title=Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation |date=2019-02-13}}</ref> [[recurrent neural networks]] (RNNs) and [[Transformer (machine learning model)|transformers]].<ref>{{Cite arXiv |last1=Hron |first1=Jiri |last2=Bahri |first2=Yasaman |last3=Sohl-Dickstein |first3=Jascha |last4=Novak |first4=Roman |date=2020-06-18 |title=Infinite attention: NNGP and NTK for deep attention networks |class=stat.ML |eprint=2006.10540 }}</ref> In such settings, the large-width limit corresponds to letting the number of parameters grow, while keeping the number of layers fixed: for [[Convolutional neural network|CNNs]], this involves letting the number of channels grow. |
||
Individual parameters of a wide neural network in the kernel regime change negligibly during training. However, this implies that infinite-width neural networks cannot exhibit [[feature learning]], which is widely considered to be an important property of realistic deep neural networks. This is not a generic feature of infinite-width neural networks and is largely due to a specific choice of the scaling by which the width is taken to the infinite limit; indeed several works<ref>{{Cite journal |last1=Mei |first1=Song |last2=Montanari |first2=Andrea |last3=Nguyen |first3=Phan-Minh |date=2018-08-14 |title=A mean field view of the landscape of two-layer neural networks |journal=Proceedings of the National Academy of Sciences |language=en |volume=115 |issue=33 |pages=E7665–E7671 |doi=10.1073/pnas.1806579115 |issn=0027-8424 |pmc=6099898 |pmid=30054315 |arxiv=1804.06561 |bibcode=2018PNAS..115E7665M |doi-access=free }}</ref><ref>{{Cite journal |last1=Chizat |first1=Lénaïc |last2=Bach |first2=Francis |date=2018-12-03 |title=On the global convergence of gradient descent for over-parameterized models using optimal transport |url=https://dl.acm.org/doi/10.5555/3327144.3327226 |journal=Proceedings of the 32nd International Conference on Neural Information Processing Systems |series=NIPS'18 |location=Red Hook, NY, USA |publisher=Curran Associates Inc. |pages=3040–3050 |arxiv=1805.09545 }}</ref><ref>{{Cite arXiv |last1=Nguyen |first1=Phan-Minh |last2=Pham |first2=Huy Tuan |date=2020-01-30 |title=A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks |class=cs.LG |eprint=2001.11443 }}</ref><ref>{{Cite arXiv |last1=Yang |first1=Greg |last2=Hu |first2=Edward J. |date=2022-07-15 |title=Feature Learning in Infinite-Width Neural Networks |class=cs.LG |eprint=2011.14522 }}</ref> have found alternate infinite-width scaling limits of neural networks in which there is no duality with kernel regression and feature learning occurs during training. Others<ref>{{cite arXiv |eprint=1909.08156 |class=cs.LG |first1=Jiaoyang |last1=Huang |first2=Horng-Tzer |last2=Yau |title=Dynamics of Deep Neural Networks and Neural Tangent Hierarchy |date=2019-09-17}}</ref> introduce a "neural tangent hierarchy" to describe finite-width effects, which may drive feature learning. |
|||
⚫ | Neural Tangents is a [[free and open-source]] [[Python (programming language)|Python]] library used for computing and doing inference with the infinite width NTK and [[neural network Gaussian process]] (NNGP) corresponding to various common ANN architectures.<ref>{{Citation |last1=Novak |first1=Roman |title=Neural Tangents: Fast and Easy Infinite Neural Networks in Python |date=2019-12-05 |work=International Conference on Learning Representations (ICLR) |volume=2020 |arxiv=1912.02803 |bibcode=2019arXiv191202803N |last2=Xiao |first2=Lechao |last3=Hron |first3=Jiri |last4=Lee |first4=Jaehoon |last5=Alemi |first5=Alexander A. |last6=Sohl-Dickstein |first6=Jascha |last7=Schoenholz |first7=Samuel S.}}</ref> In addition, there exists a [[scikit-learn]] compatible implementation of the infinite width NTK for [[Gaussian process|Gaussian processes]] called [https://github.com/392781/scikit-ntk scikit-ntk].<ref>{{Cite arXiv|last=Lencevicius |first=Ronaldas Paulius |date=2022 |title=An Empirical Analysis of the Laplace and Neural Tangent Kernels |class=stat.ML |eprint=2208.03761}}</ref> |
||
== Details == |
|||
When optimizing the parameters <math>\theta\in\mathbb{R}^{P}</math> of an ANN to minimize an empirical loss through [[gradient descent]], the NTK governs the dynamics of the ANN output function <math>f_{\theta}</math> throughout the training. |
When optimizing the parameters <math>\theta\in\mathbb{R}^{P}</math> of an ANN to minimize an empirical loss through [[gradient descent]], the NTK governs the dynamics of the ANN output function <math>f_{\theta}</math> throughout the training. |
||
=== Scalar output |
=== Case 1: Scalar output === |
||
⚫ | |||
⚫ | |||
⚫ | The NTK is a kernel <math>\Theta:\mathbb{R}^{n_{\mathrm{in}}}\times\mathbb{R}^{n_{\mathrm{in}}}\to\mathbb{R}</math> defined by<math display="block">\Theta\left(x,y;\theta\right)=\sum_{p=1}^{P}\partial_{\theta_{p}}f\left(x;\theta\right)\partial_{\theta_{p}}f\left(y;\theta\right).</math>In the language of [[kernel method]]s, the NTK <math>\Theta</math> is the kernel associated with the [[Feature (machine learning)|feature map]] <math>\left(x\mapsto\partial_{\theta_{p}}f\left(x;\theta\right)\right)_{p=1,\ldots,P}</math>. To see how this kernel drives the training dynamics of the ANN, consider a [[Dataset (machine learning)|dataset]] <math>\left(x_{i}\right)_{i=1,\ldots,n}\subset\mathbb{R}^{n_{\mathrm{in}}}</math> with scalar labels <math>\left(z_{i}\right)_{i=1,\ldots,n}\subset\mathbb{R}</math> and a [[loss function]] <math>c:\mathbb{R}\times\mathbb{R}\to\mathbb{R}</math>. Then the associated empirical loss, defined on functions <math>f:\mathbb{R}^{n_{\mathrm{in}}}\to\mathbb{R}</math>, is given by<math display="block">\mathcal{C}\left(f\right)=\sum_{i=1}^{n}c\left(f\left(x_{i}\right),z_{i}\right).</math>When the ANN <math>f\left(\cdot;\theta\right):\mathbb{R}^{n_{\mathrm{in}}}\to\mathbb{R}</math> is trained to fit the dataset (i.e. minimize <math>\mathcal{C}</math>) via continuous-time gradient descent, the parameters <math>\left(\theta\left(t\right)\right)_{t\geq0}</math> evolve through the [[ordinary differential equation]]: |
||
:<math display="block">\partial_{t}\theta\left(t\right)=-\nabla\mathcal{C}\left(f\left(\cdot;\theta\right)\right).</math> |
:<math display="block">\partial_{t}\theta\left(t\right)=-\nabla\mathcal{C}\left(f\left(\cdot;\theta\right)\right).</math> |
||
Line 28: | Line 63: | ||
During training the ANN output function follows an evolution differential equation given in terms of the NTK: |
During training the ANN output function follows an evolution differential equation given in terms of the NTK: |
||
:<math>\partial_{t}f\left(x;\theta\left(t\right)\right)=-\sum_{i=1}^{n}\Theta\left(x,x_{i};\theta\right)\partial_{w}c\left(w,z_{i}\right)\Big|_{w=f\left(x_{i};\theta\left(t\right)\right)}.</math> |
:<math>\partial_{t}f\left(x;\theta\left(t\right)\right)=-\sum_{i=1}^{n}\Theta\left(x,x_{i};\theta\right)\partial_{w}c\left(w,z_{i}\right)\Big|_{w=f\left(x_{i};\theta\left(t\right)\right)}.</math> |
||
This equation shows how the NTK drives the dynamics of <math>f\left(\cdot;\theta\left(t\right)\right)</math> in the space of functions <math>\mathbb{R}^{n_{\mathrm{in}}}\to\mathbb{R}</math> during training. |
This equation shows how the NTK drives the dynamics of <math>f\left(\cdot;\theta\left(t\right)\right)</math> in the space of functions <math>\mathbb{R}^{n_{\mathrm{in}}}\to\mathbb{R}</math> during training. |
||
=== Vector output |
=== Case 2: Vector output === |
||
⚫ | |||
⚫ | |||
⚫ | In this case, the NTK <math>\Theta:\mathbb{R}^{n_{\mathrm{in}}}\times\mathbb{R}^{n_{\mathrm{in}}}\to\mathcal{M}_{n_{\mathrm{out}}}\left(\mathbb{R}\right)</math> is a [[Kernel methods for vector output|matrix-valued kernel]], with values in the space of <math>n_{\mathrm{out}}\times n_{\mathrm{out}}</math> matrices, defined by<math display="block">\Theta_{k,l}\left(x,y;\theta\right)=\sum_{p=1}^{P}\partial_{\theta_{p}}f_{k}\left(x;\theta\right)\partial_{\theta_{p}}f_{l}\left(y;\theta\right).</math>[[Empirical risk minimization]] proceeds as in the scalar case, with the difference being that the loss function takes vector inputs <math>c:\mathbb{R}^{n_{\mathrm{out}}}\times\mathbb{R}^{n_{\mathrm{out}}}\to\mathbb{R}</math>. The training of <math>f_{\theta\left(t\right)}</math> through continuous-time gradient descent yields the following evolution in function space driven by the NTK:<math display="block">\partial_{t}f_{k}\left(x;\theta\left(t\right)\right)=-\sum_{i=1}^{n}\sum_{l=1}^{n_{\mathrm{out}}}\Theta_{k,l}\left(x,x_{i};\theta\right)\partial_{w_{l}}c\left(\left(w_{1},\ldots,w_{n_{\mathrm{out}}}\right),z_{i}\right)\Big|_{w=f\left(x_{i};\theta\left(t\right)\right)}.</math>This generalizes the equation shown in case 1 for scalar outputs. |
||
==== Interpretation ==== |
|||
⚫ | |||
== |
=== Interpretation === |
||
⚫ | Each data point <math>x_{i}</math> influences the evolution, of the output <math>f\left(x;\theta\right)</math> for each input <math>x</math>, throughout the training. More concretely, with respect to example <math>i</math>, the NTK value <math>\Theta\left(x,x_{i};\theta\right)</math> determines the influence of the loss gradient <math>\partial_{w}c\left(w,z_{i}\right)\big|_{w=f\left(x_{i};\theta\right)}</math> on the evolution of ANN output <math>f\left(x;\theta\right)</math> through a gradient descent step. In the scalar case, this reads<math display="block">f\left(x;\theta\left(t+\epsilon\right)\right)-f\left(x;\theta\left(t\right)\right)\approx\epsilon\sum_{i=1}^{n}\Theta\left(x,x_{i};\theta\left(t\right)\right)\partial_{w}c\left(w,z_{i}\right)\big|_{w=f\left(x_{i};\theta\right)}.</math> |
||
Recent theoretical and empirical work in Deep Learning has shown the performance of ANNs to strictly improve as their layer widths grow larger. |
|||
<ref>{{Cite journal|last=Novak|first=Roman|last2=Bahri|first2=Yasaman|last3=Abolafia|first3=Daniel A.|last4=Pennington|first4=Jeffrey|last5=Sohl-Dickstein|first5=Jascha|date=2018-02-15|title=Sensitivity and Generalization in Neural Networks: an Empirical Study|url=https://openreview.net/forum?id=HJC2SzZCW|bibcode=2018arXiv180208760N|arxiv=1802.08760}}</ref> |
|||
<ref>{{Cite journal|last=Canziani|first=Alfredo|last2=Paszke|first2=Adam|last3=Culurciello|first3=Eugenio|date=2016-11-04|title=An Analysis of Deep Neural Network Models for Practical Applications|url=https://openreview.net/forum?id=Bygq-H9eg|bibcode=2016arXiv160507678C|arxiv=1605.07678}}</ref> For various [[Artificial neural network|ANN architectures]], the NTK yields precise insight into the training in this large-width regime. |
|||
<ref name=":0"></ref> |
|||
<ref name=":2">{{Cite journal|last=Allen-Zhu|first=Zeyuan|last2=Li|first2=Yuanzhi|last3=Song|first3=Zhao|date=2018-11-09|title=A Convergence Theory for Deep Learning via Over-Parameterization|journal=International Conference on Machine Learning|language=en|pages=242–252|arxiv=1811.03962}}</ref> |
|||
<ref name=":5">{{Cite journal|last=Du|first=Simon|last2=Lee|first2=Jason|last3=Li|first3=Haochuan|last4=Wang|first4=Liwei|last5=Zhai|first5=Xiyu|date=2019-05-24|title=Gradient Descent Finds Global Minima of Deep Neural Networks|journal=International Conference on Machine Learning|language=en|pages=1675–1685|arxiv=1811.03804}}</ref> |
|||
⚫ | <ref name="Lee">{{Cite journal| |
||
<ref name=":1">{{Citation|last=Arora|first=Sanjeev|title=On Exact Computation with an Infinitely Wide Neural Net|date=2019| journal = NeurIPS |pages=8139–8148|last2=Du|first2=Simon S|last3=Hu|first3=Wei|last4=Li|first4=Zhiyuan|last5=Salakhutdinov|first5=Russ R|last6=Wang|first6=Ruosong|arxiv=1904.11955}}</ref> |
|||
<ref>{{cite arxiv|last=Huang|first=Jiaoyang|last2=Yau|first2=Horng-Tzer|date=2019-09-17|title=Dynamics of Deep Neural Networks and Neural Tangent Hierarchy| arxiv =1909.08156}}</ref> |
|||
=== Wide fully-connected ANNs have a deterministic NTK, which remains constant throughout training === |
=== Wide fully-connected ANNs have a deterministic NTK, which remains constant throughout training === |
||
Consider an ANN with [[Artificial neural network|fully-connected]] layers <math>\ell=0,\ldots,L</math> of widths <math>n_{0}=n_{\mathrm{in}},n_{1},\ldots,n_{L}=n_{\mathrm{out}}</math>, so that <math>f\left(\cdot;\theta\right)=R_{L-1}\circ\cdots\circ R_{0}</math>, where <math>R_{\ell}=\sigma\circ A_{\ell}</math> is the composition of an [[affine transformation]] <math>A_{i}</math> with the pointwise application of a [[Activation function|nonlinearity]] <math>\sigma:\mathbb{R}\to\mathbb{R}</math>, where <math>\theta</math> parametrizes the maps <math>A_{0},\ldots,A_{L-1}</math>. The parameters <math>\theta\in\mathbb{R}^{P}</math> are initialized randomly, in an [[Independent and identically distributed random variables|independent identically distributed]] way. |
Consider an ANN with [[Artificial neural network|fully-connected]] layers <math>\ell=0,\ldots,L</math> of widths <math>n_{0}=n_{\mathrm{in}},n_{1},\ldots,n_{L}=n_{\mathrm{out}}</math>, so that <math>f\left(\cdot;\theta\right)=R_{L-1}\circ\cdots\circ R_{0}</math>, where <math>R_{\ell}=\sigma\circ A_{\ell}</math> is the composition of an [[affine transformation]] <math>A_{i}</math> with the pointwise application of a [[Activation function|nonlinearity]] <math>\sigma:\mathbb{R}\to\mathbb{R}</math>, where <math>\theta</math> parametrizes the maps <math>A_{0},\ldots,A_{L-1}</math>. The parameters <math>\theta\in\mathbb{R}^{P}</math> are initialized randomly, in an [[Independent and identically distributed random variables|independent, identically distributed]] way. |
||
As the widths grow, the NTK's scale is affected by the exact parametrization of the <math>A_{i}</math>'s and by the parameter initialization. This motivates the so-called NTK parametrization <math>A_{\ell}\left(x\right)=\frac{1}{\sqrt{n_{\ell}}}W^{\left(\ell\right)}x+b^{\left(\ell\right)}</math>. This parametrization ensures that if the parameters <math>\theta\in\mathbb{R}^{P}</math> are initialized as [[Gaussian distribution|standard normal variables]], the NTK has a finite nontrivial limit. In the large-width limit, the NTK converges to a deterministic (non-random) limit <math>\Theta_{\infty}</math>, which stays constant in time. |
|||
The NTK <math>\Theta_{\infty}</math> is explicitly given by <math>\Theta_{\infty}=\Theta^{\left(L\right)}</math>, where <math>\Theta^{\left(L\right)}</math> is determined by the set of recursive equations: |
The NTK <math>\Theta_{\infty}</math> is explicitly given by <math>\Theta_{\infty}=\Theta^{\left(L\right)}</math>, where <math>\Theta^{\left(L\right)}</math> is determined by the set of recursive equations: |
||
Line 70: | Line 96: | ||
\end{pmatrix}\right)}\left[f\left(X\right)f\left(Y\right)\right].</math> |
\end{pmatrix}\right)}\left[f\left(X\right)f\left(Y\right)\right].</math> |
||
In this formula the kernels <math>\Sigma^{\left(\ell\right)}</math> are the so-called activation kernels |
In this formula the kernels <math>\Sigma^{\left(\ell\right)}</math> are the ANN's so-called activation kernels.<ref>{{Citation|last1=Cho|first1=Youngmin|title=Kernel Methods for Deep Learning|date=2009|url=http://papers.nips.cc/paper/3628-kernel-methods-for-deep-learning.pdf|work=Advances in Neural Information Processing Systems 22|pages=342–350|editor-last=Bengio|editor-first=Y.|publisher=Curran Associates, Inc.|access-date=2019-11-27|last2=Saul|first2=Lawrence K.|editor2-last=Schuurmans|editor2-first=D.|editor3-last=Lafferty|editor3-first=J. D.|editor4-last=Williams|editor4-first=C. K. I.}}</ref><ref>{{Citation|last1=Daniely|first1=Amit|title=Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity|date=2016|url=http://papers.nips.cc/paper/6427-toward-deeper-understanding-of-neural-networks-the-power-of-initialization-and-a-dual-view-on-expressivity.pdf|work=Advances in Neural Information Processing Systems 29|pages=2253–2261|editor-last=Lee|editor-first=D. D.|publisher=Curran Associates, Inc.|access-date=2019-11-27|last2=Frostig|first2=Roy|last3=Singer|first3=Yoram|editor2-last=Sugiyama|editor2-first=M.|editor3-last=Luxburg|editor3-first=U. V.|editor4-last=Guyon|editor4-first=I.|bibcode=2016arXiv160205897D|arxiv=1602.05897}}</ref><ref name=":4">{{Cite journal|last1=Lee|first1=Jaehoon|last2=Bahri|first2=Yasaman|last3=Novak|first3=Roman|last4=Schoenholz|first4=Samuel S.|last5=Pennington|first5=Jeffrey|last6=Sohl-Dickstein|first6=Jascha|date=2018-02-15|title=Deep Neural Networks as Gaussian Processes|url=https://openreview.net/forum?id=B1EA-M-0Z}}</ref> |
||
=== Wide fully connected networks are linear in their parameters throughout training === |
=== Wide fully connected networks are linear in their parameters throughout training === |
||
Line 78: | Line 104: | ||
.</math> |
.</math> |
||
== See also == |
|||
⚫ | The NTK can be studied for various [[Artificial neural network|ANN architectures]]<ref name=": |
||
* [[Large width limits of neural networks]] |
|||
⚫ | |||
⚫ | |||
For a [[Convex function|convex]] loss functional <math>\mathcal{C}</math> with a [[global minimum]], if the NTK remains [[Positive-definite kernel|positive-definite]] during training, the loss of the ANN <math>\mathcal{C}\left(f\left(\cdot;\theta\left(t\right)\right)\right)</math> converges to that minimum as <math>t\to\infty</math>. This positive-definiteness property has been shown in a number of cases, yielding the first proofs that large-width ANNs converge to global minima during training.<ref name=":0" /><ref name=":2" /><ref name=":3">{{Cite journal|last=Allen-Zhu|first=Zeyuan|last2=Li|first2=Yuanzhi|last3=Song|first3=Zhao|date=2018-10-29|title=On the convergence rate of training recurrent neural networks|journal=NeurIPS|language=en|arxiv=1810.12065}}</ref> |
|||
=== Kernel methods === |
|||
⚫ | The NTK gives a rigorous connection between the inference performed by infinite-width ANNs and that performed by [[ |
||
== Software libraries == |
|||
⚫ | |||
== References == |
== References == |
||
Line 97: | Line 113: | ||
[[Category:Kernel methods for machine learning]] |
[[Category:Kernel methods for machine learning]] |
||
==External links== |
|||
*{{Cite web|first=Anil|last=Ananthaswamy|author-link=Anil Ananthaswamy|date=2021-10-11|title=A New Link to an Old Model Could Crack the Mystery of Deep Learning|url=https://www.quantamagazine.org/a-new-link-to-an-old-model-could-crack-the-mystery-of-deep-learning-20211011/|website=[[Quanta Magazine]]}} |
Latest revision as of 02:32, 5 October 2024
In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.
In general, a kernel is a positive-semidefinite symmetric function of two inputs which represents some notion of similarity between the two inputs. The NTK is a specific kernel derived from a given neural network; in general, when the neural network parameters change during training, the NTK evolves as well. However, in the limit of large layer width the NTK becomes constant, revealing a duality between training the wide neural network and kernel methods: gradient descent in the infinite-width limit is fully equivalent to kernel gradient descent with the NTK. As a result, using gradient descent to minimize least-square loss for neural networks yields the same mean estimator as ridgeless kernel regression with the NTK. This duality enables simple closed form equations describing the training dynamics, generalization, and predictions of wide neural networks.
The NTK was introduced in 2018 by Arthur Jacot, Franck Gabriel and Clément Hongler,[1] who used it to study the convergence and generalization properties of fully connected neural networks. Later works[2][3] extended the NTK results to other neural network architectures. In fact, the phenomenon behind NTK is not specific to neural networks and can be observed in generic nonlinear models, usually by a suitable scaling[4].
Main results (informal)
[edit]Let denote the scalar function computed by a given neural network with parameters on input . Then the neural tangent kernel is defined[1] asSince it is written as a dot product between mapped inputs (with the gradient of the neural network function serving as the feature map), we are guaranteed that the NTK is symmetric and positive semi-definite. The NTK is thus a valid kernel function.
Consider a fully connected neural network whose parameters are chosen i.i.d. according to any mean-zero distribution. This random initialization of induces a distribution over whose statistics we will analyze, both at initialization and throughout training (gradient descent on a specified dataset). We can visualize this distribution via a neural network ensemble which is constructed by drawing many times from the initial distribution over and training each draw according to the same training procedure.
The number of neurons in each layer is called the layer’s width. Consider taking the width of every hidden layer to infinity and training the neural network with gradient descent (with a suitably small learning rate). In this infinite-width limit, several nice properties emerge:
- At initialization (before training), the neural network ensemble is a zero-mean Gaussian process (GP).[5] This means that distribution of functions is the maximum-entropy distribution with mean and covariance , where the GP covariance can be computed from the network architecture. In other words, the distribution of neural network functions at initialization has no structure other than its first and second moments (mean and covariance). This follows from the central limit theorem.
- The NTK is deterministic.[1][6] In other words, the NTK is independent of the random parameter initialization.
- The NTK does not change during training.[1][6]
- Each parameter changes negligibly throughout training. As Lee et al.[6] note, "although individual parameters move by a vanishingly small amount, they collectively conspire to provide a finite change in the final output of the network, as is necessary for training."
- During training, the neural network is linearized, i.e., its parameter dependence can be captured by its first-order Taylor expansion: , where are the initial parameters.[6] This follows from the fact that each parameter changes negligibly during training. (The neural network remains nonlinear with respect to the inputs.)
- The training dynamics are equivalent to kernel gradient descent using the NTK as the kernel.[1] If the loss function is mean-squared error, the final distribution over is still a Gaussian process, but with a new mean and covariance.[1][6] In particular, the mean converges to the same estimator yielded by kernel regression with the NTK as kernel and zero ridge regularization, and the covariance is expressible in terms of the NTK and the initial GP covariance. It can be shown that the ensemble variance vanishes at the training points (in other words, the neural network always interpolates the training data, regardless of initialization).
From a physics point of view, the NTK can be understood as a type of Hamiltonian, since it generates the time-evolution of observables when the neural network is trained by gradient descent with infinitesimally small steps (the continuum limit).[7]
Applications
[edit]Ridgeless kernel regression and kernel gradient descent
[edit]Kernel methods are machine learning algorithms which use only pairwise relations between input points. Kernel methods do not depend on the concrete values of the inputs; they only depend on the relations between the inputs and other inputs (such as the training set). These pairwise relations are fully captured by the kernel function: a symmetric, positive-semidefinite function of two inputs which represents some notion of similarity between the two inputs. A fully equivalent condition is that there exists some feature map such that the kernel function can be written as a dot product of the mapped inputsThe properties of a kernel method depend on the choice of kernel function. (Note that may have higher dimension than .) As a relevant example, consider linear regression. This is the task of estimating given samples generated from , where each is drawn according to some input data distribution. In this setup, is the weight vector which defines the true function ; we wish to use the training samples to develop a model which approximates . We do this by minimizing the mean-square error between our model and the training samples:There exists an explicit solution for which minimizes the squared error: , where is the matrix whose columns are the training inputs, and is the vector of training outputs. Then, the model can make predictions on new inputs: .
However, this result can be rewritten as: .[8] Note that this dual solution is expressed solely in terms of the inner products between inputs. This motivates extending linear regression to settings in which, instead of directly taking inner products between inputs, we first transform the inputs according to a chosen feature map and then evaluate the inner products between the transformed inputs. As discussed above, this can be captured by a kernel function , since all kernel functions are inner products of feature-mapped inputs. This yields the ridgeless kernel regression estimator:If the kernel matrix is singular, one uses the Moore-Penrose pseudoinverse. The regression equations are called "ridgeless" because they lack a ridge regularization term.
In this view, linear regression is a special case of kernel regression with the identity feature map: . Equivalently, kernel regression is simply linear regression in the feature space (i.e. the range of the feature map defined by the chosen kernel). Note that kernel regression is typically a nonlinear regression in the input space, which is a major strength of the algorithm.
Just as it’s possible to perform linear regression using iterative optimization algorithms such as gradient descent, one can perform kernel regression using kernel gradient descent. This is equivalent to performing gradient descent in the feature space. It’s known that if the weight vector is initialized close to zero, least-squares gradient descent converges to the minimum-norm solution, i.e., the final weight vector has the minimum Euclidean norm of all the interpolating solutions. In the same way, kernel gradient descent yields the minimum-norm solution with respect to the RKHS norm. This is an example of the implicit regularization of gradient descent.
The NTK gives a rigorous connection between the inference performed by infinite-width ANNs and that performed by kernel methods: when the loss function is the least-squares loss, the inference performed by an ANN is in expectation equal to ridgeless kernel regression with respect to the NTK. This suggests that the performance of large ANNs in the NTK parametrization can be replicated by kernel methods for suitably chosen kernels.[1][2]
Overparametrization, interpolation, and generalization
[edit]In overparametrized models, the number of tunable parameters exceeds the number of training samples. In this case, the model is able to memorize (perfectly fit) the training data. Therefore, overparametrized models interpolate the training data, achieving essentially zero training error.[9]
Kernel regression is typically viewed as a non-parametric learning algorithm, since there are no explicit parameters to tune once a kernel function has been chosen. An alternate view is to recall that kernel regression is simply linear regression in feature space, so the “effective” number of parameters is the dimension of the feature space. Therefore, studying kernels with high-dimensional feature maps can provide insights about strongly overparametrized models.
As an example, consider the problem of generalization. According to classical statistics, memorization should cause models to fit noisy signals in the training data, harming their performance on unseen data. To mitigate this, machine learning algorithms often introduce regularization to mitigate noise-fitting tendencies. Surprisingly, modern neural networks (which tend to be strongly overparametrized) seem to generalize well, even in the absence of explicit regularization.[9][10] To study the generalization properties of overparametrized neural networks, one can exploit the infinite-width duality with ridgeless kernel regression. Recent works[11][12][13] have derived equations describing the expected generalization error of high-dimensional kernel regression; these results immediately explain the generalization of sufficiently wide neural networks trained to convergence on least-squares.
Convergence to a global minimum
[edit]For a convex loss functional with a global minimum, if the NTK remains positive-definite during training, the loss of the ANN converges to that minimum as . This positive-definiteness property has been shown in a number of cases, yielding the first proofs that large-width ANNs converge to global minima during training.[1][14][15][16][17][18]
Extensions and limitations
[edit]The NTK can be studied for various ANN architectures,[2] in particular convolutional neural networks (CNNs),[19] recurrent neural networks (RNNs) and transformers.[20] In such settings, the large-width limit corresponds to letting the number of parameters grow, while keeping the number of layers fixed: for CNNs, this involves letting the number of channels grow.
Individual parameters of a wide neural network in the kernel regime change negligibly during training. However, this implies that infinite-width neural networks cannot exhibit feature learning, which is widely considered to be an important property of realistic deep neural networks. This is not a generic feature of infinite-width neural networks and is largely due to a specific choice of the scaling by which the width is taken to the infinite limit; indeed several works[21][22][23][24] have found alternate infinite-width scaling limits of neural networks in which there is no duality with kernel regression and feature learning occurs during training. Others[25] introduce a "neural tangent hierarchy" to describe finite-width effects, which may drive feature learning.
Neural Tangents is a free and open-source Python library used for computing and doing inference with the infinite width NTK and neural network Gaussian process (NNGP) corresponding to various common ANN architectures.[26] In addition, there exists a scikit-learn compatible implementation of the infinite width NTK for Gaussian processes called scikit-ntk.[27]
Details
[edit]When optimizing the parameters of an ANN to minimize an empirical loss through gradient descent, the NTK governs the dynamics of the ANN output function throughout the training.
Case 1: Scalar output
[edit]An ANN with scalar output consists of a family of functions parametrized by a vector of parameters .
The NTK is a kernel defined byIn the language of kernel methods, the NTK is the kernel associated with the feature map . To see how this kernel drives the training dynamics of the ANN, consider a dataset with scalar labels and a loss function . Then the associated empirical loss, defined on functions , is given byWhen the ANN is trained to fit the dataset (i.e. minimize ) via continuous-time gradient descent, the parameters evolve through the ordinary differential equation:
During training the ANN output function follows an evolution differential equation given in terms of the NTK:
This equation shows how the NTK drives the dynamics of in the space of functions during training.
Case 2: Vector output
[edit]An ANN with vector output of size consists in a family of functions parametrized by a vector of parameters .
In this case, the NTK is a matrix-valued kernel, with values in the space of matrices, defined byEmpirical risk minimization proceeds as in the scalar case, with the difference being that the loss function takes vector inputs . The training of through continuous-time gradient descent yields the following evolution in function space driven by the NTK:This generalizes the equation shown in case 1 for scalar outputs.
Interpretation
[edit]Each data point influences the evolution, of the output for each input , throughout the training. More concretely, with respect to example , the NTK value determines the influence of the loss gradient on the evolution of ANN output through a gradient descent step. In the scalar case, this reads
Wide fully-connected ANNs have a deterministic NTK, which remains constant throughout training
[edit]Consider an ANN with fully-connected layers of widths , so that , where is the composition of an affine transformation with the pointwise application of a nonlinearity , where parametrizes the maps . The parameters are initialized randomly, in an independent, identically distributed way.
As the widths grow, the NTK's scale is affected by the exact parametrization of the 's and by the parameter initialization. This motivates the so-called NTK parametrization . This parametrization ensures that if the parameters are initialized as standard normal variables, the NTK has a finite nontrivial limit. In the large-width limit, the NTK converges to a deterministic (non-random) limit , which stays constant in time.
The NTK is explicitly given by , where is determined by the set of recursive equations:
where denotes the kernel defined in terms of the Gaussian expectation:
In this formula the kernels are the ANN's so-called activation kernels.[28][29][5]
Wide fully connected networks are linear in their parameters throughout training
[edit]The NTK describes the evolution of neural networks under gradient descent in function space. Dual to this perspective is an understanding of how neural networks evolve in parameter space, since the NTK is defined in terms of the gradient of the ANN's outputs with respect to its parameters. In the infinite width limit, the connection between these two perspectives becomes especially interesting. The NTK remaining constant throughout training at large widths co-occurs with the ANN being well described throughout training by its first order Taylor expansion around its parameters at initialization:[6]
See also
[edit]References
[edit]- ^ a b c d e f g h Jacot, Arthur; Gabriel, Franck; Hongler, Clement (2018), Bengio, S.; Wallach, H.; Larochelle, H.; Grauman, K. (eds.), "Neural Tangent Kernel: Convergence and Generalization in Neural Networks" (PDF), Advances in Neural Information Processing Systems 31, Curran Associates, Inc., pp. 8571–8580, arXiv:1806.07572, retrieved 2019-11-27
- ^ a b c Arora, Sanjeev; Du, Simon S.; Hu, Wei; Li, Zhiyuan; Salakhutdinov, Ruslan; Wang, Ruosong (2019-11-04). "On Exact Computation with an Infinitely Wide Neural Net". arXiv:1904.11955 [cs.LG].
- ^ Yang, Greg (2020-11-29). "Tensor Programs II: Neural Tangent Kernel for Any Architecture". arXiv:2006.14548 [stat.ML].
- ^ Chizat, Lénaïc; Oyallon, Edouard; Bach, Francis (2019-12-08), "On lazy training in differentiable programming", Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA: Curran Associates Inc., pp. 2937–2947, arXiv:1812.07956, retrieved 2023-05-11
- ^ a b Lee, Jaehoon; Bahri, Yasaman; Novak, Roman; Schoenholz, Samuel S.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018-02-15). "Deep Neural Networks as Gaussian Processes".
{{cite journal}}
: Cite journal requires|journal=
(help) - ^ a b c d e f Lee, Jaehoon; Xiao, Lechao; Schoenholz, Samuel S.; Bahri, Yasaman; Novak, Roman; Sohl-Dickstein, Jascha; Pennington, Jeffrey (2020). "Wide neural networks of any depth evolve as linear models under gradient descent". Journal of Statistical Mechanics: Theory and Experiment. 2020 (12): 124002. arXiv:1902.06720. Bibcode:2020JSMTE2020l4002L. doi:10.1088/1742-5468/abc62b. S2CID 62841516.
- ^ Roberts, Daniel A.; Yaida, Sho (2022). "∞. The End of Training". The principles of deep learning theory: an effective theory approach to understanding neural networks. Boris Hanin. Cambridge New York, NY Port Melbourne, VIC New Delhi Singapore: Cambridge University Press. p. 360. ISBN 978-1-316-51933-2.
{{cite book}}
: CS1 maint: date and year (link) - ^ Shawe-Taylor, John; Cristianini, Nello (2004-06-28). Kernel Methods for Pattern Analysis. Cambridge University Press. doi:10.1017/cbo9780511809682. ISBN 978-0-521-81397-6.
- ^ a b c Belkin, Mikhail (2021-05-29). "Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation". arXiv:2105.14368 [stat.ML].
- ^ Novak, Roman; Bahri, Yasaman; Abolafia, Daniel A.; Pennington, Jeffrey; Sohl-Dickstein, Jascha (2018-02-15). "Sensitivity and Generalization in Neural Networks: an Empirical Study". arXiv:1802.08760 [stat.ML].
- ^ Jacot, Arthur; Şimşek, Berfin; Spadaro, Francesco; Hongler, Clément; Gabriel, Franck (2020-06-17). "Kernel Alignment Risk Estimator: Risk Prediction from Training Data". arXiv:2006.09796 [stat.ML].
- ^ Canatar, Abdulkadir; Bordelon, Blake; Pehlevan, Cengiz (2021-05-18). "Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks". Nature Communications. 12 (1): 2914. arXiv:2006.13198. Bibcode:2021NatCo..12.2914C. doi:10.1038/s41467-021-23103-1. ISSN 2041-1723. PMC 8131612. PMID 34006842.
- ^ Simon, James B.; Dickens, Madeline; Karkada, Dhruva; DeWeese, Michael R. (2022-10-12). "The Eigenlearning Framework: A Conservation Law Perspective on Kernel Regression and Wide Neural Networks". arXiv:2110.03922 [cs.LG].
- ^ Allen-Zhu, Zeyuan; Li, Yuanzhi; Song, Zhao (2018). "A convergence theory for deep learning via overparameterization". arXiv:1811.03962 [cs.LG].
- ^ Du, Simon S; Zhai, Xiyu; Poczos, Barnabas; Aarti, Singh (2019). "Gradient descent provably optimizes over-parameterized neural networks". arXiv:1810.02054 [cs.LG].
- ^ Zou, Difan; Cao, Yuan; Zhou, Dongruo; Gu, Quanquan (2020). "Gradient descent optimizes over-parameterized deep ReLU networks". Machine Learning. 109 (3): 467–492. doi:10.1007/s10994-019-05839-6. S2CID 53752874.
- ^ Allen-Zhu, Zeyuan; Li, Yuanzhi; Song, Zhao (2019-05-27). "On the Convergence Rate of Training Recurrent Neural Networks". arXiv:1810.12065 [cs.LG].
- ^ Du, Simon; Lee, Jason; Li, Haochuan; Wang, Liwei; Zhai, Xiyu (2019-05-24). "Gradient Descent Finds Global Minima of Deep Neural Networks". pp. 1675–1685. arXiv:1811.03804 [cs.LG].
- ^ Yang, Greg (2019-02-13). "Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation". arXiv:1902.04760 [cs.NE].
- ^ Hron, Jiri; Bahri, Yasaman; Sohl-Dickstein, Jascha; Novak, Roman (2020-06-18). "Infinite attention: NNGP and NTK for deep attention networks". arXiv:2006.10540 [stat.ML].
- ^ Mei, Song; Montanari, Andrea; Nguyen, Phan-Minh (2018-08-14). "A mean field view of the landscape of two-layer neural networks". Proceedings of the National Academy of Sciences. 115 (33): E7665 – E7671. arXiv:1804.06561. Bibcode:2018PNAS..115E7665M. doi:10.1073/pnas.1806579115. ISSN 0027-8424. PMC 6099898. PMID 30054315.
- ^ Chizat, Lénaïc; Bach, Francis (2018-12-03). "On the global convergence of gradient descent for over-parameterized models using optimal transport". Proceedings of the 32nd International Conference on Neural Information Processing Systems. NIPS'18. Red Hook, NY, USA: Curran Associates Inc.: 3040–3050. arXiv:1805.09545.
- ^ Nguyen, Phan-Minh; Pham, Huy Tuan (2020-01-30). "A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks". arXiv:2001.11443 [cs.LG].
- ^ Yang, Greg; Hu, Edward J. (2022-07-15). "Feature Learning in Infinite-Width Neural Networks". arXiv:2011.14522 [cs.LG].
- ^ Huang, Jiaoyang; Yau, Horng-Tzer (2019-09-17). "Dynamics of Deep Neural Networks and Neural Tangent Hierarchy". arXiv:1909.08156 [cs.LG].
- ^ Novak, Roman; Xiao, Lechao; Hron, Jiri; Lee, Jaehoon; Alemi, Alexander A.; Sohl-Dickstein, Jascha; Schoenholz, Samuel S. (2019-12-05), "Neural Tangents: Fast and Easy Infinite Neural Networks in Python", International Conference on Learning Representations (ICLR), vol. 2020, arXiv:1912.02803, Bibcode:2019arXiv191202803N
- ^ Lencevicius, Ronaldas Paulius (2022). "An Empirical Analysis of the Laplace and Neural Tangent Kernels". arXiv:2208.03761 [stat.ML].
- ^ Cho, Youngmin; Saul, Lawrence K. (2009), Bengio, Y.; Schuurmans, D.; Lafferty, J. D.; Williams, C. K. I. (eds.), "Kernel Methods for Deep Learning" (PDF), Advances in Neural Information Processing Systems 22, Curran Associates, Inc., pp. 342–350, retrieved 2019-11-27
- ^ Daniely, Amit; Frostig, Roy; Singer, Yoram (2016), Lee, D. D.; Sugiyama, M.; Luxburg, U. V.; Guyon, I. (eds.), "Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity" (PDF), Advances in Neural Information Processing Systems 29, Curran Associates, Inc., pp. 2253–2261, arXiv:1602.05897, Bibcode:2016arXiv160205897D, retrieved 2019-11-27