Jump to content

History of artificial neural networks: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Unnecessary mention of group name, when group leader is already in the author list.
Mbethke (talk | contribs)
m Fixed duplicated word
 
(30 intermediate revisions by 15 users not shown)
Line 3: Line 3:
{{Primary sources|date=August 2022}}
{{Primary sources|date=August 2022}}
{{Update|date=September 2021}}
{{Update|date=September 2021}}
{{Duplicated citations|reason=[[User:Polygnotus/DuplicateReferences|DuplicateReferences]] detected:<br>
* https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf (refs: 5, 148)
* https://www.degruyter.com/view/books/9781400882618/9781400882618-002/9781400882618-002.xml (refs: 11, 12)
* https://arxiv.org/abs/2212.11279 (refs: 26, 70)
* https://arxiv.org/abs/1404.7828 (refs: 52, 96)
* https://arxiv.org/abs/1411.4555 (refs: 65, 103)
* https://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf (refs: 67, 76)
* https://arxiv.org/abs/1409.1556 (refs: 101, 106)
|date=September 2024}}
}}
}}
{{Machine learning|Artificial neural network}}
{{Machine learning|Artificial neural network}}


[[Artificial neural networks]] (ANNs) are models created using [[machine learning]] to perform a number of tasks. Their creation was inspired by [[neural circuit|neural circuitry]].<ref name="rosenblatt-1959">{{cite journal|last=Rosenblatt|first=F.|year=1958|title=The Perceptron: A Probabilistic Model For Information Storage And Organization In The Brain|journal=Psychological Review|volume=65|issue=6|pages=386–408|citeseerx=10.1.1.588.3775|doi=10.1037/h0042519|pmid=13602029|s2cid=12781225 }}</ref>{{refn|group=lower-alpha|Neurons generate an [[action potential]]—the release of neurotransmitters that are chemical inputs to other neurons—based on the sum of its incoming chemical inputs.}} While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist [[Frank Rosenblatt]], who developed the [[perceptron]].<ref name="rosenblatt-1959"/> Little research was conducted on ANNs in the 1970s and 1980s, with the [[Association for the Advancement of Artificial Intelligence|AAAI]] calling that period an "[[AI winter]]".<ref>{{Crevier 1993}}</ref>
[[Artificial neural networks]] (ANNs) are models created using [[machine learning]] to perform a [[Neural network (machine learning)#Applications|number of tasks]]. Their creation was inspired by biological [[Neural circuit|neural circuitry]].<ref name="rosenblatt-1959">{{cite journal|last=Rosenblatt|first=F.|year=1958|title=The Perceptron: A Probabilistic Model For Information Storage And Organization In The Brain|journal=Psychological Review|volume=65|issue=6|pages=386–408|citeseerx=10.1.1.588.3775|doi=10.1037/h0042519|pmid=13602029|s2cid=12781225 }}</ref>{{refn|group=lower-alpha|Neurons generate an [[action potential]]—the release of neurotransmitters that are chemical inputs to other neurons—based on the sum of its incoming chemical inputs.}} While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist [[Frank Rosenblatt]], who developed the [[perceptron]].<ref name="rosenblatt-1959"/> Little research was conducted on ANNs in the 1970s and 1980s, with the [[Association for the Advancement of Artificial Intelligence|AAAI]] calling this period an "[[AI winter]]".<ref>{{Crevier 1993}}</ref>


Later, advances in hardware and the development of the [[backpropagation]] algorithm as well as [[recurrent neural networks]] and [[convolutional neural networks]], renewed interest in ANNs. The 2010s, saw the development of a deep neural network (a neural network with many layers) called [[AlexNet]].<ref>{{Cite journal|last1=Krizhevsky|first1=Alex|last2=Sutskever|first2=Ilya|last3=Hinton|first3=Geoffrey E.|date=2017-05-24|title=ImageNet classification with deep convolutional neural networks|url=https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf|journal=Communications of the ACM|volume=60|issue=6|pages=84–90|doi=10.1145/3065386|s2cid=195908774|issn=0001-0782|doi-access=free}}</ref> It greatly outperformed other image recognition models, and is thought to have launched the ongoing [[AI spring]], and further increasing interest in ANNs.<ref name =":1">{{Cite web|url=https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/|title=The data that transformed AI research—and possibly the world|first=Dave|last=Gershgorn|website=Quartz|date=26 July 2017 }}</ref> The [[transformer architecture]] was first described in 2017 as a method to teach ANNs grammatical dependencies in language,<ref name="2017_Attention_Is_All_You_Need">{{cite journal |last1=Vaswani |first1=Ashish |last2=Shazeer |first2=Noam |last3=Parmar |first3=Niki |last4=Uszkoreit |first4=Jakob |last5=Jones |first5=Llion |last6=Gomez |first6=Aidan N |last7=Kaiser |first7=Łukasz |last8=Polosukhin |first8=Illia |date=2017 |title=Attention is All you Need |url=https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=30}}</ref> and is the predominant architecture used by [[large language models]], such as [[GPT-4]]. [[Diffusion models]] were first described in 2015, and began to be used by image generation models such as [[DALL-E]] in the 2020s.{{cn|date=January 2024}}
Later, advances in hardware and the development of the [[backpropagation]] algorithm, as well as [[recurrent neural networks]] and [[convolutional neural networks]], renewed interest in ANNs. The 2010s saw the development of a deep neural network (i.e., one with many [[Hidden layer|layers]]) called [[AlexNet]].<ref>{{Cite journal|last1=Krizhevsky|first1=Alex|last2=Sutskever|first2=Ilya|last3=Hinton|first3=Geoffrey E.|date=2017-05-24|title=ImageNet classification with deep convolutional neural networks|url=https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf|journal=Communications of the ACM|volume=60|issue=6|pages=84–90|doi=10.1145/3065386|s2cid=195908774|issn=0001-0782|doi-access=free}}</ref> It greatly outperformed other [[Computer vision#Recognition|image recognition]] models, and is thought to have launched the ongoing [[AI spring]], and further increasing interest in [[deep learning]].<ref name =":1">{{Cite web|url=https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/|title=The data that transformed AI research—and possibly the world|first=Dave|last=Gershgorn|website=Quartz|date=26 July 2017 }}</ref> The [[Transformer (deep learning architecture)|transformer architecture]] was first described in 2017 as a method to teach ANNs grammatical dependencies in language,<ref name="2017_Attention_Is_All_You_Need">{{cite journal |last1=Vaswani |first1=Ashish |last2=Shazeer |first2=Noam |last3=Parmar |first3=Niki |last4=Uszkoreit |first4=Jakob |last5=Jones |first5=Llion |last6=Gomez |first6=Aidan N |last7=Kaiser |first7=Łukasz |last8=Polosukhin |first8=Illia |date=2017 |title=Attention is All you Need |url=https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=30}}</ref> and is the predominant architecture used by [[large language models]] such as [[GPT-4]]. [[Diffusion models]] were first described in 2015, and became the basis of [[Image synthesis|image generation]] models such as [[DALL-E]] in the 2020s.{{cn|date=January 2024}}

== Linear neural network ==

The simplest kind of [[feedforward neural network]] is a linear network, which consists of a single layer of output nodes; the inputs are fed directly to the outputs via a series of weights. The sum of the products of the weights and the inputs is calculated in each node. The [[mean squared error]]s between these calculated outputs and a given target values are minimized by creating an adjustment to the weights. This technique has been known for over two centuries as the [[method of least squares]] or [[linear regression]]. It was used as a means of finding a good rough linear fit to a set of points by [[Adrien-Marie Legendre|Legendre]] (1805) and [[Gauss]] (1795) for the prediction of planetary movement.<ref name="legendre1805">Mansfield Merriman, "A List of Writings Relating to the Method of Least Squares"</ref><ref name="gauss1795">{{cite journal |first=Stephen M. |last=Stigler |year=1981 |title=Gauss and the Invention of Least Squares |journal=Ann. Stat. |volume=9 |issue=3 |pages=465–474 |doi=10.1214/aos/1176345451 |doi-access=free }}</ref><ref name=brertscher>{{cite book |last=Bretscher |first=Otto |title=Linear Algebra With Applications |edition=3rd |publisher=Prentice Hall |year=1995 |location=Upper Saddle River, NJ}}</ref><ref name=DLhistory/><ref name=stigler>
{{cite book |last = Stigler
|first = Stephen M.
|author-link = Stephen Stigler
|year = 1986
|title = The History of Statistics: The Measurement of Uncertainty before 1900
|location = Cambridge
|publisher = Harvard
|isbn = 0-674-40340-1
|url-access = registration
|url = https://archive.org/details/historyofstatist00stig
}}</ref>


== Perceptrons and other early neural networks ==
== Perceptrons and other early neural networks ==
{{main|Perceptron}}
The simplest feedforward network consists of a single weight layer without activation functions. It would be just a linear map, and training it would be linear regression. [[Linear regression]] by [[Least squares|least squares method]] was used by [[Adrien-Marie Legendre]] (1805) and [[Carl Friedrich Gauss]] (1795) for the prediction of planetary movement.<ref name="legendre18052">Merriman, Mansfield. ''A List of Writings Relating to the Method of Least Squares: With Historical and Critical Notes''. Vol. 4. Academy, 1877.</ref><ref name="gauss17952">{{cite journal |last=Stigler |first=Stephen M. |year=1981 |title=Gauss and the Invention of Least Squares |journal=Ann. Stat. |volume=9 |issue=3 |pages=465–474 |doi=10.1214/aos/1176345451 |doi-access=free}}</ref><ref name="brertscher2">{{cite book |last=Bretscher |first=Otto |title=Linear Algebra With Applications |publisher=Prentice Hall |year=1995 |edition=3rd |location=Upper Saddle River, NJ}}</ref><ref name="stigler2">{{cite book |last=Stigler |first=Stephen M. |author-link=Stephen Stigler |url=https://archive.org/details/historyofstatist00stig |title=The History of Statistics: The Measurement of Uncertainty before 1900 |publisher=Harvard |year=1986 |isbn=0-674-40340-1 |location=Cambridge |url-access=registration}}</ref>


[[Warren McCulloch]] and [[Walter Pitts]]<ref>{{cite journal|last=McCulloch|first=Warren|author2=Walter Pitts|title=A Logical Calculus of Ideas Immanent in Nervous Activity|journal=Bulletin of Mathematical Biophysics|year=1943|volume=5|issue=4|pages=115–133|doi=10.1007/BF02478259|pmid=<!--none-->}}</ref> (1943) also considered a non-learning computational model for neural networks.<ref>{{Cite news|url=https://www.degruyter.com/view/books/9781400882618/9781400882618-002/9781400882618-002.xml|title=Representation of Events in Nerve Nets and Finite Automata|last=Kleene|first=S.C.|work=Annals of Mathematics Studies|access-date=17 June 2017|publisher=Princeton University Press|year=1956|issue=34|pages=3–41}}</ref> This model paved the way for research to split into two approaches. One approach focused on biological processes while the other focused on the application of neural networks to [[artificial intelligence]]. This work led to work on nerve networks and their link to [[Finite state machine|finite automata]].<ref>{{Cite news|url=https://www.degruyter.com/view/books/9781400882618/9781400882618-002/9781400882618-002.xml|title=Representation of Events in Nerve Nets and Finite Automata|last=Kleene|first=S.C.|work=Annals of Mathematics Studies|access-date=2017-06-17|publisher=Princeton University Press|year=1956|issue=34|pages=3–41}}</ref>
''[[A logical calculus of the ideas immanent in nervous activity]]'' ([[Warren McCulloch]] and [[Walter Pitts]], 1943) studied several abstract models for neural networks using symbolic logic of [[Rudolf Carnap]] and ''[[Principia Mathematica]]''. The paper argued that several abstract models of neural networks (some learning, some not learning) have the same computational power as Turing machines.<ref>{{cite journal |last1=McCulloch |first1=Warren |author-link=Warren Sturgis McCulloch |last2=Pitts |first2=Walter |author-link2=Walter Pitts |year=1943 |title=A Logical Calculus of Ideas Immanent in Nervous Activity |url=https://link.springer.com/article/10.1007/BF02478259 |journal=Bulletin of Mathematical Biophysics |volume=5 |issue=4 |pages=115–133 |doi=10.1007/BF02478259 |pmid=<!--none-->}}</ref> This model paved the way for research to split into two approaches. One approach focused on biological processes while the other focused on the application of neural networks to [[artificial intelligence]]. This work led to work on nerve networks and their link to [[Finite state machine|finite automata]].<ref>{{Citation |last=Kleene |first=S. C. |title=Representation of Events in Nerve Nets and Finite Automata |date=1956-12-31 |work=Automata Studies. (AM-34) |pages=3–42 |editor-last=Shannon |editor-first=C. E. |url=https://www.degruyter.com/document/doi/10.1515/9781400882618-002/html |access-date=2024-10-14 |publisher=Princeton University Press |doi=10.1515/9781400882618-002 |isbn=978-1-4008-8261-8 |editor2-last=McCarthy |editor2-first=J.}}</ref>


In the early 1940s, [[Donald O. Hebb|D. O. Hebb]]<ref>{{cite book|url={{google books |plainurl=y |id=ddB4AgAAQBAJ}}|title=The Organization of Behavior|last=Hebb|first=Donald|publisher=Wiley|year=1949|isbn=978-1-135-63190-1|location=New York}}</ref> created a learning hypothesis based on the mechanism of [[Neuroplasticity|neural plasticity]] that became known as [[Hebbian learning]]. Hebbian learning is [[unsupervised learning]]. This evolved into models for [[long-term potentiation]]. Researchers started applying these ideas to computational models in 1948 with [[Unorganized machine|Turing's B-type machines]]. Farley and [[Wesley A. Clark|Clark]]<ref>{{cite journal|last=Farley|first=B.G.|author2=W.A. Clark|year=1954|title=Simulation of Self-Organizing Systems by Digital Computer|journal=IRE Transactions on Information Theory|volume=4|issue=4|pages=76–84|doi=10.1109/TIT.1954.1057468}}</ref> (1954) first used computational machines, then called "calculators", to simulate a Hebbian network. Other neural network computational machines were created by [[Nathaniel Rochester (computer scientist)|Rochester]], Holland, Habit and Duda (1956).<ref>{{cite journal|last=Rochester|first=N.|author2=J.H. Holland|author3=L.H. Habit|author4=W.L. Duda|year=1956|title=Tests on a cell assembly theory of the action of the brain, using a large digital computer|journal=IRE Transactions on Information Theory|volume=2|issue=3|pages=80–93|doi=10.1109/TIT.1956.1056810}}</ref>
In the early 1940s, [[Donald O. Hebb|D. O. Hebb]]<ref>{{cite book|url={{google books |plainurl=y |id=ddB4AgAAQBAJ}}|title=The Organization of Behavior|last=Hebb|first=Donald|publisher=Wiley|year=1949|isbn=978-1-135-63190-1|location=New York}}</ref> created a learning hypothesis based on the mechanism of [[Neuroplasticity|neural plasticity]] that became known as [[Hebbian learning]]. Hebbian learning is [[unsupervised learning]]. This evolved into models for [[long-term potentiation]]. Researchers started applying these ideas to computational models in 1948 with [[Unorganized machine|Turing's B-type machines]]. B. Farley and [[Wesley A. Clark]]<ref>{{cite journal|last=Farley|first=B.G.|author2=W.A. Clark|year=1954|title=Simulation of Self-Organizing Systems by Digital Computer|journal=IRE Transactions on Information Theory|volume=4|issue=4|pages=76–84|doi=10.1109/TIT.1954.1057468}}</ref> (1954) first used computational machines, then called "calculators", to simulate a Hebbian network. Other neural network computational machines were created by [[Nathaniel Rochester (computer scientist)|Rochester]], Holland, Habit and Duda (1956).<ref>{{cite journal|last=Rochester|first=N.|author2=J.H. Holland|author3=L.H. Habit|author4=W.L. Duda|year=1956|title=Tests on a cell assembly theory of the action of the brain, using a large digital computer|journal=IRE Transactions on Information Theory|volume=2|issue=3|pages=80–93|doi=10.1109/TIT.1956.1056810}}</ref>
[[Frank Rosenblatt|Rosenblatt]]<ref name="rosenblatt-1959"/> (1958) created the [[perceptron]], an algorithm for pattern recognition. With mathematical notation, Rosenblatt described circuitry not in the basic perceptron, such as the [[exclusive-or]] circuit that could not be processed by neural networks at the time. In 1959, a biological model proposed by [[Nobel laureate]]s [[David H. Hubel|Hubel]] and [[Torsten Wiesel|Wiesel]] was based on their discovery of two types of cells in the [[primary visual cortex]]: [[simple cell]]s and [[complex cell]]s.<ref>{{cite book|url={{google books |plainurl=y |id=8YrxWojxUA4C|page=106}}|title=Brain and visual perception: the story of a 25-year collaboration|author=David H. Hubel and Torsten N. Wiesel|publisher=Oxford University Press US|year=2005|isbn=978-0-19-517618-6|page=106}}</ref>
[[Frank Rosenblatt]]<ref name="rosenblatt-1959"/> (1958) created the [[perceptron]], an algorithm for pattern recognition. A [[multilayer perceptron]] (MLP) comprised 3 layers: an input layer, a hidden layer with randomized weights that did not learn, and an output layer. With mathematical notation, Rosenblatt described circuitry not in the basic perceptron, such as the [[exclusive-or]] circuit that could not be processed by neural networks at the time. In 1959, a biological model proposed by [[Nobel laureate]]s [[David H. Hubel|Hubel]] and [[Torsten Wiesel|Wiesel]] was based on their discovery of two types of cells in the [[primary visual cortex]]: [[simple cell]]s and [[complex cell]]s.<ref>{{cite book|url={{google books |plainurl=y |id=8YrxWojxUA4C|page=106}}|title=Brain and visual perception: the story of a 25-year collaboration|author=David H. Hubel and Torsten N. Wiesel|publisher=Oxford University Press US|year=2005|isbn=978-0-19-517618-6|page=106}}</ref> He later published a 1962 book also introduced variants and computer experiments, including a version with four-layer perceptrons where the last two layers have learned weights (and thus a proper multilayer perceptron).<ref name="rosenblatt19622">{{cite book |last=Rosenblatt |first=Frank |author-link=Frank Rosenblatt |title=Principles of Neurodynamics |publisher=Spartan, New York |year=1962}}</ref>{{rp|section 16}} Some consider that the 1962 book developed and explored all of the basic ingredients of the deep learning systems of today.<ref name="Who Is the Father of Deep Learning?">{{cite book |last1=Tappert |first1=Charles C. |title=2019 International Conference on Computational Science and Computational Intelligence (CSCI) |publisher=IEEE |year=2019 |isbn=978-1-7281-5584-5 |pages=343–348 |chapter=Who Is the Father of Deep Learning? |doi=10.1109/CSCI49370.2019.00067 |access-date=31 May 2021 |chapter-url=https://ieeexplore.ieee.org/document/9070967 |s2cid=216043128}}</ref>


Some say that research stagnated following [[Marvin Minsky|Minsky]] and [[Seymour Papert|Papert]] (1969),<ref>{{cite book|url={{google books |plainurl=y |id=Ow1OAQAAIAAJ}}|title=Perceptrons: An Introduction to Computational Geometry|last1=Minsky|first1=Marvin|last2=Papert|first2=Seymour|publisher=MIT Press|year=1969|isbn=978-0-262-63022-1}}</ref> who discovered that basic perceptrons were incapable of processing the exclusive-or circuit and that computers lacked sufficient power to process useful neural networks. However, by the time this book came out, methods for training [[multilayer perceptrons]] (MLPs) by [[deep learning]] were already known.<ref name=DLhistory />
Some say that research stagnated following [[Marvin Minsky]] and [[Seymour Papert|Papert]] ''[[Perceptrons (book)|Perceptrons]]'' (1969).<ref>{{cite book|url={{google books |plainurl=y |id=Ow1OAQAAIAAJ}}|title=Perceptrons: An Introduction to Computational Geometry|last1=Minsky|first1=Marvin|last2=Papert|first2=Seymour|publisher=MIT Press|year=1969|isbn=978-0-262-63022-1}}</ref>


[[Group method of data handling]], a method to train arbitrarily deep neural networks was published by [[Alexey Ivakhnenko]] and Lapa in 1967, which they regarded as a form of polynomial regression,<ref name="ivak19652">{{cite book |last1=Ivakhnenko |first1=A. G. |url={{google books |plainurl=y |id=rGFgAAAAMAAJ}} |title=Cybernetics and Forecasting Techniques |last2=Lapa |first2=V. G. |publisher=American Elsevier Publishing Co. |year=1967 |isbn=978-0-444-00020-0}}</ref> or a generalization of Rosenblatt's perceptron.<ref>{{Cite journal |last=Ivakhnenko |first=A.G. |date=March 1970 |title=Heuristic self-organization in problems of engineering cybernetics |url=https://linkinghub.elsevier.com/retrieve/pii/0005109870900920 |journal=Automatica |language=en |volume=6 |issue=2 |pages=207–219 |doi=10.1016/0005-1098(70)90092-0}}</ref> A 1971 paper described a deep network with eight layers trained by this method.<ref name="ivak1971">{{Cite journal |last=Ivakhnenko |first=Alexey |date=1971 |title=Polynomial theory of complex systems |url=http://gmdh.net/articles/history/polynomial.pdf |url-status=live |journal=IEEE Transactions on Systems, Man, and Cybernetics |volume=SMC-1 |issue=4 |pages=364–378 |doi=10.1109/TSMC.1971.4308320 |archive-url=https://web.archive.org/web/20170829230621/http://www.gmdh.net/articles/history/polynomial.pdf |archive-date=2017-08-29 |access-date=2019-11-05}}</ref>
== First deep learning ==


The first deep learning [[multilayer perceptron]] trained by [[stochastic gradient descent]]<ref name="robbins19512">{{Cite journal |last1=Robbins |first1=H. |author-link=Herbert Robbins |last2=Monro |first2=S. |year=1951 |title=A Stochastic Approximation Method |journal=The Annals of Mathematical Statistics |volume=22 |issue=3 |pages=400 |doi=10.1214/aoms/1177729586 |doi-access=free}}</ref> was published in 1967 by [[Shun'ichi Amari]].<ref name="Amari19672">{{cite journal |last1=Amari |first1=Shun'ichi |author-link=Shun'ichi Amari |date=1967 |title=A theory of adaptive pattern classifier |journal=IEEE Transactions |volume=EC |issue=16 |pages=279–307}}</ref> In computer experiments conducted by Amari's student Saito, a five layer MLP with two modifiable layers learned [[Knowledge representation|internal representations]] to classify non-linearily separable pattern classes.<ref name="DLhistory3">{{cite arXiv |eprint=2212.11279 |class=cs.NE |first=Jürgen |last=Schmidhuber |author-link=Jürgen Schmidhuber |title=Annotated History of Modern AI and Deep Learning |date=2022}}</ref> Subsequent developments in hardware and hyperparameter tunings have made end-to-end [[stochastic gradient descent]] the currently dominant training technique.
The first [[deep learning]] MLP was published by [[Alexey Grigorevich Ivakhnenko]] and Valentin Lapa in 1965, as the [[Group method of data handling|Group Method of Data Handling]].<ref name="SCHIDHUB2">{{cite journal|last=Schmidhuber|first=J.|year=2015|title=Deep Learning in Neural Networks: An Overview|journal=Neural Networks|volume=61|pages=85–117|arxiv=1404.7828|doi=10.1016/j.neunet.2014.09.003|pmid=25462637|s2cid=11715509}}</ref><ref name="ivak1965">{{cite book|url={{google books |plainurl=y |id=FhwVNQAACAAJ}}|title=Cybernetic Predicting Devices|last=Ivakhnenko|first=A. G.|publisher=CCM Information Corporation|year=1973}}</ref><ref name="ivak1967">{{cite book|url={{google books |plainurl=y |id=rGFgAAAAMAAJ}}|title=Cybernetics and forecasting techniques|last2=Grigorʹevich Lapa|first2=Valentin|publisher=American Elsevier Pub. Co.|year=1967|first1=A. G.|last1=Ivakhnenko}}</ref> This method employs incremental layer by layer training based on [[regression analysis]], where useless units in hidden layers are pruned with the help of a validation set.

The first [[deep learning]] MLP trained by [[stochastic gradient descent]]<ref name="robbins1951">{{Cite journal | last1 = Robbins | first1 = H. | author-link = Herbert Robbins| last2 = Monro | first2 = S. | doi = 10.1214/aoms/1177729586 | title = A Stochastic Approximation Method | journal = The Annals of Mathematical Statistics | volume = 22 | issue = 3 | pages = 400 | year = 1951 | doi-access = free }}</ref> was published in 1967 by [[Shun'ichi Amari]].<ref name="Amari1967">{{cite journal |last1=Amari |first1=Shun'ichi |author-link=Shun'ichi Amari|title=A theory of adaptive pattern classifier|journal= IEEE Transactions |date=1967 |volume=EC |issue=16 |pages=279–307}}</ref><ref name=DLhistory />
In computer experiments conducted by Amari's student Saito, a five layer MLP with two modifiable layers learned useful [[Knowledge representation|internal representations]] to classify non-linearily separable pattern classes.<ref name=DLhistory />


== Backpropagation ==
== Backpropagation ==
Line 46: Line 39:
{{main|Backpropagation}}
{{main|Backpropagation}}


[[Backpropagation]] is an efficient application of the [[chain rule]] derived by [[Gottfried Wilhelm Leibniz]] in 1673<ref name="leibniz16762">{{Cite book |last=Leibniz |first=Gottfried Wilhelm Freiherr von |url=https://books.google.com/books?id=bOIGAAAAYAAJ&q=leibniz+altered+manuscripts&pg=PA90 |title=The Early Mathematical Manuscripts of Leibniz: Translated from the Latin Texts Published by Carl Immanuel Gerhardt with Critical and Historical Notes (Leibniz published the chain rule in a 1676 memoir) |date=1920 |publisher=Open court publishing Company |isbn=9780598818461 |language=en}}</ref> to networks of differentiable nodes. The terminology "back-propagating errors" was actually introduced in 1962 by Rosenblatt,<ref name="rosenblatt19622"/> but he did not know how to implement this, although [[Henry J. Kelley]] had a continuous precursor of backpropagation in 1960 in the context of [[control theory]].<ref name="kelley19602">{{cite journal |last1=Kelley |first1=Henry J. |author-link=Henry J. Kelley |year=1960 |title=Gradient theory of optimal flight paths |journal=ARS Journal |volume=30 |issue=10 |pages=947–954 |doi=10.2514/8.5282}}</ref> The modern form of backpropagation was developed multiple times in early 1970s. The earliest published instance was [[Seppo Linnainmaa]]'s master thesis (1970).<ref name="lin19703">{{cite thesis |first=Seppo |last=Linnainmaa |author-link=Seppo Linnainmaa |year=1970 |type=Masters |title=The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors |language=fi |publisher=University of Helsinki |page=6–7}}</ref><ref name="lin19763">{{cite journal |last1=Linnainmaa |first1=Seppo |author-link=Seppo Linnainmaa |year=1976 |title=Taylor expansion of the accumulated rounding error |journal=BIT Numerical Mathematics |volume=16 |issue=2 |pages=146–160 |doi=10.1007/bf01931367 |s2cid=122357351}}</ref> [[Paul Werbos]] developed it independently in 1971,<ref name=":14">{{Cite book |url=https://direct.mit.edu/books/book/4886/Talking-NetsAn-Oral-History-of-Neural-Networks |title=Talking Nets: An Oral History of Neural Networks |date=2000 |publisher=The MIT Press |isbn=978-0-262-26715-1 |editor-last=Anderson |editor-first=James A. |language=en |doi=10.7551/mitpress/6626.003.0016 |editor-last2=Rosenfeld |editor-first2=Edward}}</ref> but had difficulty publishing it until 1982.<ref name="werbos19823">{{cite book |last=Werbos |first=Paul |author-link=Paul Werbos |title=System modeling and optimization |publisher=Springer |year=1982 |pages=762–770 |chapter=Applications of advances in nonlinear sensitivity analysis |access-date=2 July 2017 |chapter-url=http://werbos.com/Neural/SensitivityIFIPSeptember1981.pdf |archive-url=https://web.archive.org/web/20160414055503/http://werbos.com/Neural/SensitivityIFIPSeptember1981.pdf |archive-date=14 April 2016 |url-status=live}}</ref> In 1986, [[David E. Rumelhart]] et al. popularized backpropagation.<ref>{{Cite journal |last1=Rumelhart |first1=David E. |last2=Hinton |first2=Geoffrey E. |last3=Williams |first3=Ronald J. |date=October 1986 |title=Learning representations by back-propagating errors |url=https://www.nature.com/articles/323533a0 |journal=Nature |language=en |volume=323 |issue=6088 |pages=533–536 |doi=10.1038/323533a0 |bibcode=1986Natur.323..533R |issn=1476-4687}}</ref>
The [[backpropagation]] algorithm is an efficient application of the [[Gottfried Wilhelm Leibniz|Leibniz]] [[chain rule]] (1673)<ref name="leibniz1676">{{Cite book|last=Leibniz|first=Gottfried Wilhelm Freiherr von|url=https://books.google.com/books?id=bOIGAAAAYAAJ&q=leibniz+altered+manuscripts&pg=PA90|title=The Early Mathematical Manuscripts of Leibniz: Translated from the Latin Texts Published by Carl Immanuel Gerhardt with Critical and Historical Notes (Leibniz published the chain rule in a 1676 memoir)|date=1920|publisher=Open court publishing Company|isbn=9780598818461 |language=en}}</ref> to networks of differentiable nodes.<ref name=DLhistory/> It is also known as
the reverse mode of [[automatic differentiation]] or [[reverse accumulation]], due to [[Seppo Linnainmaa]] (1970).<ref name="lin1970">{{cite thesis|first=Seppo|last=Linnainmaa|author-link=Seppo Linnainmaa|year=1970|type=Masters|title=The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors|language=fi|publisher=University of Helsinki|pages=6–7}}</ref><ref name="lin1976">{{cite journal|last1=Linnainmaa|first1=Seppo|author-link=Seppo Linnainmaa|year=1976|title=Taylor expansion of the accumulated rounding error|journal=BIT Numerical Mathematics|volume=16|issue=2|pages=146–160|doi=10.1007/bf01931367|s2cid=122357351}}</ref><ref name="grie2012">{{cite book |last=Griewank |first=Andreas |year=2012 |chapter=Who Invented the Reverse Mode of Differentiation? |title=Optimization Stories |series=Documenta Matematica, Extra Volume ISMP |pages=389–400 |s2cid=15568746 }}</ref><ref name="grie2008">{{cite book|url={{google books |plainurl=y |id=xoiiLaRxcbEC}}|title=Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation, Second Edition|last1=Griewank|first1=Andreas|last2=Walther|first2=Andrea|author2-link=Andrea Walther|publisher=SIAM|year=2008|isbn=978-0-89871-776-1}}</ref><ref name=DLhistory /> The term "back-propagating errors" was introduced in 1962 by [[Frank Rosenblatt]],<ref name="rosenblatt1962">{{cite book|last=Rosenblatt|first=Frank|author-link=Frank Rosenblatt|title=Principles of Neurodynamics|year=1962|publisher=Spartan, New York}}</ref><ref name=DLhistory /> but he did not have an implementation of this procedure, although [[Henry J. Kelley]] had a continuous precursor of backpropagation<ref name="kelley1960">{{cite journal|last1=Kelley|first1=Henry J.|author-link=Henry J. Kelley|year=1960|title=Gradient theory of optimal flight paths|journal=ARS Journal|volume=30|issue=10|pages=947–954|doi=10.2514/8.5282}}</ref> already in 1960 in the context of [[control theory]].<ref name=DLhistory /> In 1982, [[Paul Werbos]] applied backpropagation to MLPs in the way that has become standard.<ref name="werbos1982">{{Cite book|title=System modeling and optimization|last=Werbos|first=Paul|publisher=Springer|year=1982|pages=762–770|chapter=Applications of advances in nonlinear sensitivity analysis|author-link=Paul Werbos|chapter-url=http://werbos.com/Neural/SensitivityIFIPSeptember1981.pdf|access-date=2 July 2017|archive-date=14 April 2016|archive-url=https://web.archive.org/web/20160414055503/http://werbos.com/Neural/SensitivityIFIPSeptember1981.pdf|url-status=live}}</ref> In 1986, [[David E. Rumelhart]] et al. published an experimental analysis of the technique.<ref name="rumelhart1986">Rumelhart, David E., Geoffrey E. Hinton, and R. J. Williams. "[https://apps.dtic.mil/dtic/tr/fulltext/u2/a164453.pdf Learning Internal Representations by Error Propagation] {{Webarchive|url=https://web.archive.org/web/20221013070443/https://apps.dtic.mil/dtic/tr/fulltext/u2/a164453.pdf |date=2022-10-13 }}". David E. Rumelhart, James L. McClelland, and the PDP research group. (editors), Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1: Foundation. MIT Press, 1986.</ref>


== Recurrent network architectures ==
== Recurrent network architectures ==
Line 53: Line 45:
{{main|Recurrent neural network}}
{{main|Recurrent neural network}}


[[Wilhelm Lenz]] and [[Ernst Ising]] created and analyzed the [[Ising model]] (1925)<ref name="brush67">{{cite journal |doi=10.1103/RevModPhys.39.883|title=History of the Lenz-Ising Model|year=1967|last1=Brush|first1=Stephen G.|journal=Reviews of Modern Physics|volume=39|issue=4|pages=883–893|bibcode=1967RvMP...39..883B}}</ref> which is essentially a non-learning artificial [[recurrent neural network]] (RNN) consisting of neuron-like threshold elements.<ref name=DLhistory>{{cite arXiv|last=Schmidhuber|first=Juergen|author-link=Juergen Schmidhuber|date=2022|title=Annotated History of Modern AI and Deep Learning |class=cs.NE|eprint=2212.11279}}</ref> In 1972, [[Shun'ichi Amari]] made this architecture adaptive.<ref name="Amari1972">{{cite journal |last1=Amari |first1=Shun-Ichi |title=Learning patterns and pattern sequences by self-organizing nets of threshold elements|journal= IEEE Transactions |date=1972 |volume=C |issue=21 |pages=1197–1206 }}</ref><ref name=DLhistory /> His learning RNN was popularised by [[John Hopfield]] in 1982.<ref name="Hopfield1982">{{cite journal |last1=Hopfield |first1=J. J. |title=Neural networks and physical systems with emergent collective computational abilities |journal= Proceedings of the National Academy of Sciences|date=1982 |volume=79 |issue=8 |pages=2554–2558 |doi=10.1073/pnas.79.8.2554 |pmid=6953413 |pmc=346238 |bibcode=1982PNAS...79.2554H |doi-access=free }}</ref>
One origin of RNN was [[statistical mechanics]]. The [[Ising model]] was developed by [[Wilhelm Lenz]]<ref name="lenz1920">{{Citation |last=Lenz |first=W. |title=Beiträge zum Verständnis der magnetischen Eigenschaften in festen Körpern |journal=Physikalische Zeitschrift |volume=21 |pages=613–615 |year=1920 |postscript=. |author-link=Wilhelm Lenz}}</ref> and [[Ernst Ising]]<ref name="ising1925">{{citation |last=Ising |first=E. |title=Beitrag zur Theorie des Ferromagnetismus |journal=Z. Phys. |volume=31 |issue=1 |pages=253–258 |year=1925 |bibcode=1925ZPhy...31..253I |doi=10.1007/BF02980577 |s2cid=122157319}}</ref> in the 1920s<ref>{{cite journal |last1=Brush |first1=Stephen G. |year=1967 |title=History of the Lenz-Ising Model |journal=Reviews of Modern Physics |volume=39 |issue=4 |pages=883–893 |bibcode=1967RvMP...39..883B |doi=10.1103/RevModPhys.39.883}}</ref> as a simple statistical mechanical model of magnets at equilibrium. [[Roy J. Glauber|Glauber]] in 1963 studied the Ising model evolving in time, as a process towards equilibrium ([[Glauber dynamics]]), adding in the component of time.<ref name=":22">{{cite journal |last1=Glauber |first1=Roy J. |date=February 1963 |title=Roy J. Glauber "Time-Dependent Statistics of the Ising Model" |url=https://aip.scitation.org/doi/abs/10.1063/1.1703954 |journal=Journal of Mathematical Physics |volume=4 |issue=2 |pages=294–307 |doi=10.1063/1.1703954 |access-date=2021-03-21}}</ref> [[Shun'ichi Amari]] in 1972 proposed to modify the weights of an Ising model by [[Hebbian theory|Hebbian learning]] rule as a model of associative memory, adding in the component of learning.<ref>{{Cite journal |last=Amari |first=S.-I. |date=November 1972 |title=Learning Patterns and Pattern Sequences by Self-Organizing Nets of Threshold Elements |url=https://ieeexplore.ieee.org/document/1672070 |journal=IEEE Transactions on Computers |volume=C-21 |issue=11 |pages=1197–1206 |doi=10.1109/T-C.1972.223477 |issn=0018-9340}}</ref> This was popularized as the [[Hopfield network]] (1982).<ref name="Hopfield19822">{{cite journal |last1=Hopfield |first1=J. J. |date=1982 |title=Neural networks and physical systems with emergent collective computational abilities |journal=Proceedings of the National Academy of Sciences |volume=79 |issue=8 |pages=2554–2558 |bibcode=1982PNAS...79.2554H |doi=10.1073/pnas.79.8.2554 |pmc=346238 |pmid=6953413 |doi-access=free}}</ref>


Another origin of RNN was neuroscience. The word "recurrent" is used to describe loop-like structures in anatomy. In 1901, [[Santiago Ramón y Cajal|Cajal]] observed "recurrent semicircles" in the [[Cerebellum|cerebellar cortex]].<ref>{{Cite journal |last1=Espinosa-Sanchez |first1=Juan Manuel |last2=Gomez-Marin |first2=Alex |last3=de Castro |first3=Fernando |date=2023-07-05 |title=The Importance of Cajal's and Lorente de Nó's Neuroscience to the Birth of Cybernetics |url=http://journals.sagepub.com/doi/10.1177/10738584231179932 |journal=The Neuroscientist |language=en |doi=10.1177/10738584231179932 |issn=1073-8584 |pmid=37403768 |hdl=10261/348372|hdl-access=free }}</ref> In 1933, [[Rafael Lorente de Nó|Lorente de Nó]] discovered "recurrent, reciprocal connections" by [[Golgi's method]], and proposed that excitatory loops explain certain aspects of the [[Vestibulo–ocular reflex|vestibulo-ocular reflex]].<ref>{{Cite journal |last=de NÓ |first=R. Lorente |date=1933-08-01 |title=Vestibulo-Ocular Reflex Arc |url=http://archneurpsyc.jamanetwork.com/article.aspx?doi=10.1001/archneurpsyc.1933.02240140009001 |journal=Archives of Neurology and Psychiatry |volume=30 |issue=2 |pages=245 |doi=10.1001/archneurpsyc.1933.02240140009001 |issn=0096-6754}}</ref><ref>{{Cite journal |last=Larriva-Sahd |first=Jorge A. |date=2014-12-03 |title=Some predictions of Rafael Lorente de Nó 80 years later |journal=Frontiers in Neuroanatomy |volume=8 |pages=147 |doi=10.3389/fnana.2014.00147 |issn=1662-5129 |pmc=4253658 |pmid=25520630 |doi-access=free}}</ref> [[Donald O. Hebb|Hebb]] considered "reverberating circuit" as an explanation for short-term memory.<ref>{{Cite web |title=reverberating circuit |url=https://www.oxfordreference.com/display/10.1093/oi/authority.20110803100417461 |access-date=2024-07-27 |website=Oxford Reference}}</ref> {{Harvard citation|McCulloch|Pitts|1943}} considered neural networks that contains cycles, and noted that the current activity of such networks can be affected by activity indefinitely far in the past.
== Self-organizing maps ==


Two early influential works were the [[Recurrent neural network#Jordan network|Jordan network]] (1986) and the [[Recurrent neural network#Elman network|Elman network]] (1990), which applied RNN to study [[cognitive psychology]]. In 1993, a neural history compressor system solved a "Very Deep Learning" task that required more than 1000 subsequent [[Layer (deep learning)|layers]] in an RNN unfolded in time.<ref name="schmidhuber19933">{{Cite book |last=Schmidhuber |first=Jürgen |url=ftp://ftp.idsia.ch/pub/juergen/habilitation.pdf |title=Habilitation thesis: System modeling and optimization |year=1993}}{{Dead link|date=June 2024|bot=InternetArchiveBot|fix-attempted=yes}} Page 150 ff demonstrates credit assignment across the equivalent of 1,200 layers in an unfolded RNN.</ref>
{{main|Self-organizing map}}

=== LSTM ===
[[Sepp Hochreiter]]'s diploma thesis (1991)<ref name="HOCH19912">S. Hochreiter., "[http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf Untersuchungen zu dynamischen neuronalen Netzen]". {{Webarchive|url=https://web.archive.org/web/20150306075401/http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf|date=2015-03-06}}. ''Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber'', 1991.</ref> proposed the neural history compressor, and identified and analyzed the [[vanishing gradient problem]].<ref name="HOCH19912" /><ref name="HOCH20012">{{cite book |last=Hochreiter |first=S. |title=A Field Guide to Dynamical Recurrent Networks |date=15 January 2001 |publisher=John Wiley & Sons |isbn=978-0-7803-5369-5 |editor-last1=Kolen |editor-first1=John F. |chapter=Gradient flow in recurrent nets: the difficulty of learning long-term dependencies |display-authors=etal |editor-last2=Kremer |editor-first2=Stefan C. |chapter-url={{google books |plainurl=y |id=NWOcMVA64aAC}}}}</ref> In 1993, a neural history compressor system solved a "Very Deep Learning" task that required more than 1000 subsequent [[Layer (deep learning)|layers]] in an RNN unfolded in time.<ref name="schmidhuber19923">{{cite journal |last1=Schmidhuber |first1=Jürgen |year=1992 |title=Learning complex, extended sequences using the principle of history compression (based on TR FKI-148, 1991) |url=ftp://ftp.idsia.ch/pub/juergen/chunker.pdf |journal=Neural Computation |volume=4 |issue=2 |pages=234–242 |doi=10.1162/neco.1992.4.2.234 |s2cid=18271205}}{{Dead link|date=April 2024|bot=InternetArchiveBot|fix-attempted=yes}}</ref><ref name="schmidhuber19933"/> Hochreiter proposed recurrent [[Residual neural network|residual]] connections to solve the vanishing gradient problem. This led to the [[long short-term memory]] (LSTM), published in 1995.<ref name="auto">{{Cite Q|Q98967430}}</ref> LSTM can learn "very deep learning" tasks<ref name="SCHIDHUB3">{{cite journal |last=Schmidhuber |first=J. |year=2015 |title=Deep Learning in Neural Networks: An Overview |journal=Neural Networks |volume=61 |pages=85–117 |arxiv=1404.7828 |doi=10.1016/j.neunet.2014.09.003 |pmid=25462637 |s2cid=11715509}}</ref> with long credit assignment paths that require memories of events that happened thousands of discrete time steps before. That LSTM was not yet the modern architecture, which required a "forget gate", introduced in 1999,<ref name="lstm19992">{{Cite book |last1=Gers |first1=Felix |title=9th International Conference on Artificial Neural Networks: ICANN '99 |last2=Schmidhuber |first2=Jürgen |last3=Cummins |first3=Fred |year=1999 |isbn=0-85296-721-7 |volume=1999 |pages=850–855 |chapter=Learning to forget: Continual prediction with LSTM |doi=10.1049/cp:19991218}}</ref> which became the standard RNN architecture.

[[Long short-term memory]] (LSTM) networks were invented by [[Sepp Hochreiter|Hochreiter]] and [[Jürgen Schmidhuber|Schmidhuber]] in 1995 and set accuracy records in multiple applications domains.<ref name="auto"/><ref name="lstm2">{{Cite journal |last1=Hochreiter |first1=Sepp |author-link=Sepp Hochreiter |last2=Schmidhuber |first2=Jürgen |date=1997-11-01 |title=Long Short-Term Memory |journal=Neural Computation |volume=9 |issue=8 |pages=1735–1780 |doi=10.1162/neco.1997.9.8.1735 |pmid=9377276 |s2cid=1915014}}</ref> It became the default choice for RNN architecture.


Around 2006, LSTM started to revolutionize [[speech recognition]], outperforming traditional models in certain speech applications.<ref>{{Cite journal |last1=Graves |first1=Alex |last2=Schmidhuber |first2=Jürgen |date=2005-07-01 |title=Framewise phoneme classification with bidirectional LSTM and other neural network architectures |journal=Neural Networks |series=IJCNN 2005 |volume=18 |issue=5 |pages=602–610 |citeseerx=10.1.1.331.5800 |doi=10.1016/j.neunet.2005.06.042 |pmid=16112549 |s2cid=1856462}}</ref><ref name="fernandez2007keyword2">{{Cite conference |last1=Fernández |first1=Santiago |last2=Graves |first2=Alex |last3=Schmidhuber |first3=Jürgen |year=2007 |title=An Application of Recurrent Neural Networks to Discriminative Keyword Spotting |url=http://dl.acm.org/citation.cfm?id=1778066.1778092 |series=ICANN'07 |location=Berlin, Heidelberg |publisher=Springer-Verlag |pages=220–229 |isbn=978-3-540-74693-5 |book-title=Proceedings of the 17th International Conference on Artificial Neural Networks}}</ref> LSTM also improved large-vocabulary speech recognition<ref name="sak20142">{{Cite web |last1=Sak |first1=Haşim |last2=Senior |first2=Andrew |last3=Beaufays |first3=Françoise |year=2014 |title=Long Short-Term Memory recurrent neural network architectures for large scale acoustic modeling |url=https://research.google.com/pubs/archive/43905.pdf |publisher=Google Research}}</ref><ref name="liwu20152">{{cite arXiv |eprint=1410.4281 |class=cs.CL |first1=Xiangang |last1=Li |first2=Xihong |last2=Wu |title=Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition |date=2014-10-15}}</ref> and [[text-to-speech]] synthesis<ref name="fan20152">{{cite conference |last1=Fan |first1=Bo |last2=Wang |first2=Lijuan |last3=Soong |first3=Frank K. |last4=Xie |first4=Lei |date=2015 |title=Photo-Real Talking Head with Deep Bidirectional LSTM |pages=4884–8 |doi=10.1109/ICASSP.2015.7178899 |isbn=978-1-4673-6997-8 |chapter-url= |editor= |book-title=Proceedings of ICASSP 2015 IEEE International Conference on Acoustics, Speech and Signal Processing}}</ref> and was used in [[Google Voice Search|Google voice search]], and dictation on [[Android (operating system)|Android devices]].<ref name="sak20152">{{Cite web |last1=Sak |first1=Haşim |last2=Senior |first2=Andrew |last3=Rao |first3=Kanishka |last4=Beaufays |first4=Françoise |last5=Schalkwyk |first5=Johan |date=September 2015 |title=Google voice search: faster and more accurate |url=http://googleresearch.blogspot.ch/2015/09/google-voice-search-faster-and-more.html}}</ref>
[[Self-organizing map]]s (SOMs) were described by [[Teuvo Kohonen]] in 1982.<ref name="KohonenMap">{{cite journal |title= Kohonen Network |last1= Kohonen |first1= Teuvo |last2= Honkela |first2= Timo |year= 2007 |journal= Scholarpedia |volume= 2 |issue= 1 |pages= 1568 |doi= 10.4249/scholarpedia.1568 |bibcode= 2007SchpJ...2.1568K |doi-access= free }}</ref><ref>{{cite journal |last= Kohonen |first= Teuvo |year= 1982 |title= Self-Organized Formation of Topologically Correct Feature Maps |journal= Biological Cybernetics |volume= 43 |number= 1 |pages= 59–69 |doi= 10.1007/bf00337288|s2cid= 206775459 }}</ref> SOMs are neurophysiologically inspired<ref>{{cite journal | last1 = Von der Malsburg | first1 = C | year = 1973 | title = Self-organization of orientation sensitive cells in the striate cortex | journal = Kybernetik | volume = 14 | issue = 2| pages = 85–100 | doi=10.1007/bf00288907| pmid = 4786750 | s2cid = 3351573 }}</ref> [[artificial neural network]]s that learn [[dimensionality reduction|low-dimensional]] representations of high-dimensional data while preserving the [[topology|topological structure]] of the data. They are trained using [[competitive learning]].


LSTM broke records for improved [[machine translation]],<ref name="sutskever20142">{{Cite journal |last1=Sutskever |first1=Ilya |last2=Vinyals |first2=Oriol |last3=Le |first3=Quoc V. |year=2014 |title=Sequence to Sequence Learning with Neural Networks |url=https://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf |journal=Electronic Proceedings of the Neural Information Processing Systems Conference |volume=27 |page=5346 |arxiv=1409.3215 |bibcode=2014arXiv1409.3215S}}</ref> [[Language Modeling|language modeling]]<ref name="vinyals20162">{{cite arXiv |eprint=1602.02410 |class=cs.CL |first1=Rafal |last1=Jozefowicz |first2=Oriol |last2=Vinyals |title=Exploring the Limits of Language Modeling |date=2016-02-07 |last3=Schuster |first3=Mike |last4=Shazeer |first4=Noam |last5=Wu |first5=Yonghui}}</ref> and Multilingual Language Processing.<ref name="gillick20152">{{cite arXiv |eprint=1512.00103 |class=cs.CL |first1=Dan |last1=Gillick |first2=Cliff |last2=Brunk |title=Multilingual Language Processing From Bytes |date=2015-11-30 |last3=Vinyals |first3=Oriol |last4=Subramanya |first4=Amarnag}}</ref> LSTM combined with [[Convolutional neural network|convolutional neural networks]] (CNNs) improved [[automatic image captioning]].<ref name="vinyals20152">{{cite arXiv |eprint=1411.4555 |class=cs.CV |first1=Oriol |last1=Vinyals |first2=Alexander |last2=Toshev |title=Show and Tell: A Neural Image Caption Generator |date=2014-11-17 |last3=Bengio |first3=Samy |last4=Erhan |first4=Dumitru}}</ref>
SOMs create internal representations reminiscent of the [[cortical homunculus]],<ref>{{Cite web|title=Homunculus {{!}} Meaning & Definition in UK English {{!}} Lexico.com|url=https://www.lexico.com/definition/homunculus|archive-url=https://web.archive.org/web/20210518054743/https://www.lexico.com/definition/homunculus|url-status=dead|archive-date=May 18, 2021|access-date=6 February 2022|website=Lexico Dictionaries {{!}} English|language=en}}</ref> a distorted representation of the [[human body]], based on a neurological "map" of the areas and proportions of the [[human brain]] dedicated to processing [[Sensory processing|sensory function]]s, for different parts of the body.


== Convolutional neural networks (CNNs) ==
== Convolutional neural networks (CNNs) ==
Line 68: Line 65:


The origin of the CNN architecture is the "[[neocognitron]]"<ref name=fukuneoscholar>{{cite journal |last1=Fukushima |first1=K. |year=2007 |title=Neocognitron |journal=Scholarpedia |volume=2 |issue=1 |page=1717 |doi=10.4249/scholarpedia.1717 |bibcode=2007SchpJ...2.1717F |doi-access=free}}</ref> introduced by [[Kunihiko Fukushima]] in 1980.<ref name="intro">{{cite journal |last=Fukushima |first=Kunihiko |title=Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position |journal=Biological Cybernetics |year=1980 |volume=36 |issue=4 |pages=193–202 |url=https://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf |access-date=16 November 2013 |doi=10.1007/BF00344251 |pmid=7370364 |s2cid=206775608}}</ref><ref>{{cite journal |first1=Yann |last1=LeCun |first2=Yoshua |last2=Bengio |first3=Geoffrey |last3=Hinton |title=Deep learning |journal=Nature |volume=521 |issue=7553 |year=2015 |pages=436–444 |doi=10.1038/nature14539 |pmid=26017442 |bibcode=2015Natur.521..436L |s2cid=3074096|url=https://hal.science/hal-04206682/file/Lecun2015.pdf }}</ref>
The origin of the CNN architecture is the "[[neocognitron]]"<ref name=fukuneoscholar>{{cite journal |last1=Fukushima |first1=K. |year=2007 |title=Neocognitron |journal=Scholarpedia |volume=2 |issue=1 |page=1717 |doi=10.4249/scholarpedia.1717 |bibcode=2007SchpJ...2.1717F |doi-access=free}}</ref> introduced by [[Kunihiko Fukushima]] in 1980.<ref name="intro">{{cite journal |last=Fukushima |first=Kunihiko |title=Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position |journal=Biological Cybernetics |year=1980 |volume=36 |issue=4 |pages=193–202 |url=https://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf |access-date=16 November 2013 |doi=10.1007/BF00344251 |pmid=7370364 |s2cid=206775608}}</ref><ref>{{cite journal |first1=Yann |last1=LeCun |first2=Yoshua |last2=Bengio |first3=Geoffrey |last3=Hinton |title=Deep learning |journal=Nature |volume=521 |issue=7553 |year=2015 |pages=436–444 |doi=10.1038/nature14539 |pmid=26017442 |bibcode=2015Natur.521..436L |s2cid=3074096|url=https://hal.science/hal-04206682/file/Lecun2015.pdf }}</ref>
It was inspired by work of [[David H. Hubel|Hubel]] and [[Torsten Wiesel|Wiesel]] in the 1950s and 1960s which showed that cat [[visual cortex|visual cortices]] contain neurons that individually respond to small regions of the [[visual field]].
It was inspired by work of [[David H. Hubel|Hubel]] and [[Torsten Wiesel|Wiesel]] in the 1950s and 1960s which showed that cat [[visual cortex|visual cortices]] contain neurons that individually respond to small regions of the [[visual field]].
The neocognitron introduced the two basic types of layers in CNNs: convolutional layers, and downsampling layers. A convolutional layer contains units whose receptive fields cover a patch of the previous layer. The weight vector (the set of adaptive parameters) of such a unit is often called a filter. Units can share filters. Downsampling layers contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes the average of the activations of the units in its patch. This downsampling helps to correctly classify objects in visual scenes even when the objects are shifted.
The neocognitron introduced the two basic types of layers in CNNs: convolutional layers, and downsampling layers. A convolutional layer contains units whose receptive fields cover a patch of the previous layer. The weight vector (the set of adaptive parameters) of such a unit is often called a filter. Units can share filters. Downsampling layers contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes the average of the activations of the units in its patch. This downsampling helps to correctly classify objects in visual scenes even when the objects are shifted.


In 1969, [[Kunihiko Fukushima]] also introduced the [[rectifier (neural networks)|ReLU]] (rectified linear unit) [[activation function]].<ref name="Fukushima1969">{{cite journal |first1=K. |last1=Fukushima |title=Visual feature extraction by a multilayered network of analog threshold elements |journal=IEEE Transactions on Systems Science and Cybernetics |volume=5 |issue=4 |date=1969 |pages=322–333 |doi=10.1109/TSSC.1969.300225}}</ref><ref name=DLhistory /> The rectifier has become the most popular activation function for CNNs and [[deep learning|deep neural networks]] in general.<ref>{{cite arXiv |last1=Ramachandran |first1=Prajit |last2=Barret |first2=Zoph |last3=Quoc |first3=V. Le |date=October 16, 2017 |title=Searching for Activation Functions |eprint=1710.05941 |class=cs.NE}}</ref>
In 1969, [[Kunihiko Fukushima]] also introduced the [[rectifier (neural networks)|ReLU]] (rectified linear unit) [[activation function]].<ref name="Fukushima1969">{{cite journal |first1=K. |last1=Fukushima |title=Visual feature extraction by a multilayered network of analog threshold elements |journal=IEEE Transactions on Systems Science and Cybernetics |volume=5 |issue=4 |date=1969 |pages=322–333 |doi=10.1109/TSSC.1969.300225}}</ref><ref name="DLhistory">{{cite arXiv |eprint=2212.11279 |class=cs.NE |first=Juergen |last=Schmidhuber |author-link=Juergen Schmidhuber |title=Annotated History of Modern AI and Deep Learning |date=2022}}</ref> The rectifier has become the most popular activation function for CNNs and [[deep learning|deep neural networks]] in general.<ref>{{cite arXiv |last1=Ramachandran |first1=Prajit |last2=Barret |first2=Zoph |last3=Quoc |first3=V. Le |date=October 16, 2017 |title=Searching for Activation Functions |eprint=1710.05941 |class=cs.NE}}</ref>


The [[time delay neural network]] (TDNN) was introduced in 1987 by [[Alex Waibel]] and was one of the first CNNs, as it achieved shift invariance.<ref name=Waibel1987>{{cite conference |title=Phoneme Recognition Using Time-Delay Neural Networks |last1=Waibel |first1=Alex |date=December 1987 |location=Tokyo, Japan |conference=Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE)}}</ref> It did so by utilizing weight sharing in combination with [[backpropagation]] training.<ref name="speechsignal">[[Alex Waibel|Alexander Waibel]] et al., ''[http://www.inf.ufrgs.br/~engel/data/media/file/cmp121/waibel89_TDNN.pdf Phoneme Recognition Using Time-Delay Neural Networks]'' IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume 37, No. 3, pp. 328. – 339 March 1989.</ref> Thus, while also using a pyramidal structure as in the neocognitron, it performed a global optimization of the weights instead of a local one.<ref name=Waibel1987/>
The [[time delay neural network]] (TDNN) was introduced in 1987 by [[Alex Waibel]] and was one of the first CNNs, as it achieved shift invariance.<ref name=Waibel1987>{{cite conference |title=Phoneme Recognition Using Time-Delay Neural Networks |last1=Waibel |first1=Alex |date=December 1987 |location=Tokyo, Japan |conference=Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE)}}</ref> It did so by utilizing weight sharing in combination with [[backpropagation]] training.<ref name="speechsignal">[[Alex Waibel|Alexander Waibel]] et al., ''[http://www.inf.ufrgs.br/~engel/data/media/file/cmp121/waibel89_TDNN.pdf Phoneme Recognition Using Time-Delay Neural Networks]'' IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume 37, No. 3, pp. 328. – 339 March 1989.</ref> Thus, while also using a pyramidal structure as in the neocognitron, it performed a global optimization of the weights instead of a local one.<ref name=Waibel1987/>


In 1988, Wei Zhang et al. applied [[backpropagation]] to a CNN (a simplified Neocognitron with convolutional interconnections between the image feature layers and the last fully connected layer) for alphabet recognition. They also proposed an implementation of the CNN with an optical computing system.<ref name="wz1988">{{cite journal |last=Zhang |first=Wei |date=1988 |title=Shift-invariant pattern recognition neural network and its optical architecture |url=https://drive.google.com/file/d/1nN_5odSG_QVae54EsQN_qSz-0ZsX6wA0/view?usp=sharing |journal=Proceedings of Annual Conference of the Japan Society of Applied Physics}}</ref><ref name="wz1990">{{cite journal |last=Zhang |first=Wei |date=1990 |title=Parallel distributed processing model with local space-invariant interconnections and its optical architecture |url=https://drive.google.com/file/d/0B65v6Wo67Tk5ODRzZmhSR29VeDg/view?usp=sharing |journal=Applied Optics |volume=29 |issue=32 |pages=4790–7 |doi=10.1364/AO.29.004790 |pmid=20577468 |bibcode=1990ApOpt..29.4790Z}}</ref>
In 1988, Wei Zhang et al. applied [[backpropagation]]
to a CNN (a simplified Neocognitron with convolutional interconnections between the image feature layers and the last fully connected layer) for alphabet recognition. They also proposed an implementation of the CNN with an optical computing system.<ref name="wz1988">{{cite journal |last=Zhang |first=Wei |date=1988 |title=Shift-invariant pattern recognition neural network and its optical architecture |url=https://drive.google.com/file/d/1nN_5odSG_QVae54EsQN_qSz-0ZsX6wA0/view?usp=sharing |journal=Proceedings of Annual Conference of the Japan Society of Applied Physics}}</ref><ref name="wz1990">{{cite journal |last=Zhang |first=Wei |date=1990 |title=Parallel distributed processing model with local space-invariant interconnections and its optical architecture |url=https://drive.google.com/file/d/0B65v6Wo67Tk5ODRzZmhSR29VeDg/view?usp=sharing |journal=Applied Optics |volume=29 |issue=32 |pages=4790–7 |doi=10.1364/AO.29.004790 |pmid=20577468 |bibcode=1990ApOpt..29.4790Z}}</ref>


In 1989, [[Yann LeCun]] et al. trained a CNN with the purpose of [[Handwriting recognition|recognizing handwritten ZIP code]]s on mail. While the algorithm worked, training required 3 days.<ref name="LECUN1989">LeCun ''et al.'', "Backpropagation Applied to Handwritten Zip Code Recognition," ''Neural Computation'', 1, pp. 541–551, 1989.</ref> Learning was fully automatic, performed better than manual coefficient design, and was suited to a broader range of image recognition problems and image types.
[[Kunihiko Fukushima]] published the [[neocognitron]] in 1980.<ref name="intro2">{{cite journal |last=Fukushima |first=Kunihiko |year=1980 |title=Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position |url=https://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf |url-status=live |journal=Biological Cybernetics |volume=36 |issue=4 |pages=193–202 |doi=10.1007/BF00344251 |pmid=7370364 |s2cid=206775608 |archive-url=https://web.archive.org/web/20140603013137/http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf |archive-date=3 June 2014 |access-date=16 November 2013}}</ref> [[Max pooling]] appears in a 1982 publication on the neocognitron.<ref>{{Cite journal |last1=Fukushima |first1=Kunihiko |last2=Miyake |first2=Sei |date=1982-01-01 |title=Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position |url=https://www.sciencedirect.com/science/article/abs/pii/0031320382900243 |journal=Pattern Recognition |volume=15 |issue=6 |pages=455–469 |doi=10.1016/0031-3203(82)90024-3 |bibcode=1982PatRe..15..455F |issn=0031-3203}}</ref> In 1989, [[Yann LeCun]] et al. trained a CNN with the purpose of [[Handwriting recognition|recognizing handwritten ZIP code]]s on mail. While the algorithm worked, training required 3 days.<ref name="LECUN1989">LeCun ''et al.'', "Backpropagation Applied to Handwritten Zip Code Recognition," ''Neural Computation'', 1, pp. 541–551, 1989.</ref><ref>{{Cite journal |last1=LeCun |first1=Yann |last2=Boser |first2=Bernhard |last3=Denker |first3=John |last4=Henderson |first4=Donnie |last5=Howard |first5=R. |last6=Hubbard |first6=Wayne |last7=Jackel |first7=Lawrence |date=1989 |title=Handwritten Digit Recognition with a Back-Propagation Network |url=https://proceedings.neurips.cc/paper/1989/hash/53c3bce66e43be4f209556518c2fcb54-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Morgan-Kaufmann |volume=2}}</ref> It used max pooling. Learning was fully automatic, performed better than manual coefficient design, and was suited to a broader range of image recognition problems and image types.
Subsequently, Wei Zhang, et al. modified their model by removing the last fully connected layer and applied it for medical image object segmentation in 1991<ref>{{cite journal |last=Zhang |first=Wei |date=1991 |title=Image processing of human corneal endothelium based on a learning network |url=https://drive.google.com/file/d/0B65v6Wo67Tk5cm5DTlNGd0NPUmM/view?usp=sharing |journal=Applied Optics |volume=30 |issue=29 |pages=4211–7 |doi=10.1364/AO.30.004211 |pmid=20706526 |bibcode=1991ApOpt..30.4211Z}}</ref> and breast cancer detection in mammograms in 1994.<ref>{{cite journal |last=Zhang |first=Wei |date=1994 |title=Computerized detection of clustered microcalcifications in digital mammograms using a shift-invariant artificial neural network |url=https://drive.google.com/file/d/0B65v6Wo67Tk5Ml9qeW5nQ3poVTQ/view?usp=sharing |journal=Medical Physics |volume=21 |issue=4 |pages=517–24 |doi=10.1118/1.597177 |pmid=8058017 |bibcode=1994MedPh..21..517Z}}</ref>
Subsequently, Wei Zhang, et al. modified their model by removing the last fully connected layer and applied it for medical image object segmentation in 1991<ref>{{cite journal |last=Zhang |first=Wei |date=1991 |title=Image processing of human corneal endothelium based on a learning network |url=https://drive.google.com/file/d/0B65v6Wo67Tk5cm5DTlNGd0NPUmM/view?usp=sharing |journal=Applied Optics |volume=30 |issue=29 |pages=4211–7 |doi=10.1364/AO.30.004211 |pmid=20706526 |bibcode=1991ApOpt..30.4211Z}}</ref> and breast cancer detection in mammograms in 1994.<ref>{{cite journal |last=Zhang |first=Wei |date=1994 |title=Computerized detection of clustered microcalcifications in digital mammograms using a shift-invariant artificial neural network |url=https://drive.google.com/file/d/0B65v6Wo67Tk5Ml9qeW5nQ3poVTQ/view?usp=sharing |journal=Medical Physics |volume=21 |issue=4 |pages=517–24 |doi=10.1118/1.597177 |pmid=8058017 |bibcode=1994MedPh..21..517Z}}</ref>


In a variant of the neocognitron called the cresceptron, instead of using Fukushima's spatial averaging, J. Weng et al. also used max-pooling where a downsampling unit computes the maximum of the activations of the units in its patch.<ref name="Weng1992">J. Weng, N. Ahuja and T. S. Huang, "[http://www.cse.msu.edu/~weng/research/CresceptronIJCNN1992.pdf Cresceptron: a self-organizing neural network which grows adaptively]," ''Proc. International Joint Conference on Neural Networks'', Baltimore, Maryland, vol I, pp. 576–581, June, 1992.</ref><ref name="Weng19932">J. Weng, N. Ahuja and T. S. Huang, "[http://www.cse.msu.edu/~weng/research/CresceptronICCV1993.pdf Learning recognition and segmentation of 3-D objects from 2-D images]," ''Proc. 4th International Conf. Computer Vision'', Berlin, Germany, pp. 121–128, May, 1993.</ref><ref name="Weng1997">J. Weng, N. Ahuja and T. S. Huang, "[http://www.cse.msu.edu/~weng/research/CresceptronIJCV.pdf Learning recognition and segmentation using the Cresceptron]," ''International Journal of Computer Vision'', vol. 25, no. 2, pp. 105–139, Nov. 1997.</ref><ref name="weng1993">{{cite book |first1=J |last1=Weng |first2=N |last2=Ahuja |first3=TS |last3=Huang |title=1993 (4th) International Conference on Computer Vision |chapter=Learning recognition and segmentation of 3-D objects from 2-D images |s2cid=8619176 |year=1993 |pages=121–128 |doi=10.1109/ICCV.1993.378228 |isbn=0-8186-3870-2}}</ref>
In 1990 Yamaguchi et al. introduced max-pooling, a fixed filtering operation that calculates and propagates the maximum value of a given region. They combined TDNNs with max-pooling in order to realize a speaker independent isolated word recognition system.<ref name=Yamaguchi111990>{{cite conference |title=A Neural Network for Speaker-Independent Isolated Word Recognition |last1=Yamaguchi |first1=Kouichi |last2=Sakamoto |first2=Kenji |last3=Akabane |first3=Toshio |last4=Fujimoto |first4=Yoshiji |date=November 1990 |location=Kobe, Japan |conference=First International Conference on Spoken Language Processing (ICSLP 90) |url=https://www.isca-speech.org/archive/icslp_1990/i90_1077.html |access-date=2019-09-04 |archive-date=2021-03-07 |archive-url=https://web.archive.org/web/20210307233750/https://www.isca-speech.org/archive/icslp_1990/i90_1077.html |url-status=dead }}</ref>
In a variant of the neocognitron called the cresceptron, instead of using Fukushima's spatial averaging, J. Weng et al. also used max-pooling where a downsampling unit computes the maximum of the activations of the units in its patch.<ref name="Weng1992">J. Weng, N. Ahuja and T. S. Huang, "[http://www.cse.msu.edu/~weng/research/CresceptronIJCNN1992.pdf Cresceptron: a self-organizing neural network which grows adaptively]," ''Proc. International Joint Conference on Neural Networks'', Baltimore, Maryland, vol I, pp. 576–581, June, 1992.</ref><ref name="Weng19932">J. Weng, N. Ahuja and T. S. Huang, "[http://www.cse.msu.edu/~weng/research/CresceptronICCV1993.pdf Learning recognition and segmentation of 3-D objects from 2-D images]," ''Proc. 4th International Conf. Computer Vision'', Berlin, Germany, pp. 121–128, May, 1993.</ref><ref name="Weng1997">J. Weng, N. Ahuja and T. S. Huang, "[http://www.cse.msu.edu/~weng/research/CresceptronIJCV.pdf Learning recognition and segmentation using the Cresceptron]," ''International Journal of Computer Vision'', vol. 25, no. 2, pp. 105–139, Nov. 1997.</ref><ref name="weng1993">{{cite book |first1=J |last1=Weng |first2=N |last2=Ahuja |first3=TS |last3=Huang |title=1993 (4th) International Conference on Computer Vision |chapter=Learning recognition and segmentation of 3-D objects from 2-D images |s2cid=8619176 |journal=Proc. 4th International Conf. Computer Vision |year=1993 |pages=121–128 |doi=10.1109/ICCV.1993.378228 |isbn=0-8186-3870-2}}</ref> Max-pooling is often used in modern CNNs.<ref name="schdeepscholar">{{cite journal |last1=Schmidhuber |first1=Jürgen |title=Deep Learning |journal=Scholarpedia |url=http://www.scholarpedia.org/article/Deep_Learning |date=2015 |volume=10 |issue=11 |pages=1527–54 |pmid=16764513 |doi=10.1162/neco.2006.18.7.1527 |citeseerx=10.1.1.76.1541 |s2cid=2309950}}</ref>


LeNet-5, a 7-level CNN by [[Yann LeCun]] et al. in 1998,<ref name="lecun98">{{cite journal |last=LeCun |first=Yann |author2=Léon Bottou |author3=Yoshua Bengio |author4=Patrick Haffner |title=Gradient-based learning applied to document recognition |journal=Proceedings of the IEEE |year=1998 |volume=86 |issue=11 |pages=2278–2324 |url=http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf |access-date=October 7, 2016 |doi=10.1109/5.726791 |citeseerx=10.1.1.32.9552|s2cid=14542261 }}</ref> that classifies digits, was applied by several banks to recognize hand-written numbers on checks ({{Lang-en-GB|cheques}}) digitized in 32x32 pixel images. The ability to process higher-resolution images requires larger and more layers of CNNs, so this technique is constrained by the availability of computing resources.
LeNet-5, a 7-level CNN by [[Yann LeCun]] et al. in 1998,<ref name="lecun98">{{cite journal |last=LeCun |first=Yann |author2=Léon Bottou |author3=Yoshua Bengio |author4=Patrick Haffner |title=Gradient-based learning applied to document recognition |journal=Proceedings of the IEEE |year=1998 |volume=86 |issue=11 |pages=2278–2324 |url=http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf |access-date=October 7, 2016 |doi=10.1109/5.726791 |citeseerx=10.1.1.32.9552|s2cid=14542261 }}</ref> that classifies digits, was applied by several banks to recognize hand-written numbers on checks ({{Langx|en-GB|cheques}}) digitized in 32x32 pixel images. The ability to process higher-resolution images requires larger and more layers of CNNs, so this technique is constrained by the availability of computing resources.


In 2010, Backpropagation training through [[Convolutional neural network#Pooling layer|max-pooling]] was accelerated by GPUs and shown to perform better than other pooling variants.<ref name="Scherer2010">Dominik Scherer, Andreas C. Müller, and Sven Behnke: "[https://www.ais.uni-bonn.de/papers/icann2010_maxpool.pdf Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition]," ''In 20th International Conference Artificial Neural Networks (ICANN)'', pp. 92–101, 2010. {{doi|10.1007/978-3-642-15825-4_10}}.</ref>
In 2010, Backpropagation training through [[Convolutional neural network#Pooling layer|max-pooling]] was accelerated by GPUs and shown to perform better than other pooling variants.<ref name="Scherer2010">Dominik Scherer, Andreas C. Müller, and Sven Behnke: "[https://www.ais.uni-bonn.de/papers/icann2010_maxpool.pdf Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition]," ''In 20th International Conference Artificial Neural Networks (ICANN)'', pp. 92–101, 2010. {{doi|10.1007/978-3-642-15825-4_10}}.</ref>
Behnke (2003) relied only on the sign of the gradient ([[Rprop]])<ref>{{cite book|url=http://www.ais.uni-bonn.de/books/LNCS2766.pdf|title=Hierarchical Neural Networks for Image Interpretation.|author=Sven Behnke|publisher=Springer|year=2003|series=Lecture Notes in Computer Science|volume=2766}}</ref> on problems such as image reconstruction and face localization. Rprop is a [[First-order approximation|first-order]] [[optimization (mathematics)|optimization]] [[algorithm]] created by Martin Riedmiller and Heinrich Braun in 1992.<ref name="riedmiller1992">Martin Riedmiller und Heinrich Braun: Rprop – A Fast Adaptive Learning Algorithm. Proceedings of the International Symposium on Computer and Information Science VII, 1992</ref>
Behnke (2003) relied only on the sign of the gradient ([[Rprop]])<ref>{{cite book|url=http://www.ais.uni-bonn.de/books/LNCS2766.pdf|title=Hierarchical Neural Networks for Image Interpretation.|author=Sven Behnke|publisher=Springer|year=2003|series=Lecture Notes in Computer Science|volume=2766}}</ref> on problems such as image reconstruction and face localization. Rprop is a [[First-order approximation|first-order]] [[optimization (mathematics)|optimization]] [[algorithm]] created by Martin Riedmiller and Heinrich Braun in 1992.<ref name="riedmiller1992">Martin Riedmiller und Heinrich Braun: Rprop – A Fast Adaptive Learning Algorithm. Proceedings of the International Symposium on Computer and Information Science VII, 1992</ref>


== Deep learning ==
In 2011, a deep GPU-based CNN called "DanNet" by Dan Ciresan, Ueli Meier, and [[Juergen Schmidhuber]] achieved human-competitive performance for the first time in computer vision contests.<ref name=":92" /> Subsequently, a similar GPU-based CNN by [[Alex Krizhevsky]], [[Ilya Sutskever]], and [[Geoffrey Hinton]] won the [[ImageNet Large Scale Visual Recognition Challenge]] 2012.<ref name="krizhevsky2012"/> A very deep CNN with over 100 layers by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun of [[Microsoft]] won the ImageNet 2015 contest.<ref>{{cite book |last1=He |first1=Kaiming |last2=Zhang |first2=Xiangyu |last3=Ren |first3=Shaoqing |last4=Sun |first4=Jian |title=2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) |chapter=Deep Residual Learning for Image Recognition |pages=770–778 |date=2016 |chapter-url=https://openaccess.thecvf.com/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf |doi=10.1109/CVPR.2016.90 |arxiv=1512.03385 |isbn=978-1-4673-8851-1 |s2cid=206594692}}</ref>
The deep learning revolution started around CNN- and GPU-based computer vision.


Although CNNs trained by backpropagation had been around for decades and GPU implementations of NNs for years,<ref name="jung2004">{{cite journal |last1=Oh |first1=K.-S. |last2=Jung |first2=K. |year=2004 |title=GPU implementation of neural networks |journal=Pattern Recognition |volume=37 |issue=6 |pages=1311–1314 |bibcode=2004PatRe..37.1311O |doi=10.1016/j.patcog.2004.01.013}}</ref> including CNNs,<ref name="chellapilla2006">{{Citation |last1=Chellapilla |first1=Kumar |title=High performance convolutional neural networks for document processing |date=2006 |url=https://hal.inria.fr/inria-00112631/document |access-date=2021-02-14 |archive-url=https://web.archive.org/web/20200518193413/https://hal.inria.fr/inria-00112631/document |archive-date=2020-05-18 |url-status=live |last2=Puri |first2=Sidd |last3=Simard |first3=Patrice}}</ref> faster implementations of CNNs on GPUs were needed to progress on computer vision. Later, as deep learning becomes widespread, specialized hardware and algorithm optimizations were developed specifically for deep learning.<ref name="sze2017">{{cite arXiv |eprint=1703.09039 |class=cs.CV |first1=Vivienne |last1=Sze |first2=Yu-Hsin |last2=Chen |author1-link=Vivienne Sze |title=Efficient Processing of Deep Neural Networks: A Tutorial and Survey |last3=Yang |first3=Tien-Ju |last4=Emer |first4=Joel |year=2017}}</ref>
ANNs were able to guarantee shift invariance to deal with small and large natural objects in large cluttered scenes, only when invariance extended beyond shift, to all ANN-learned concepts, such as location, type (object class label), scale, lighting and others. This was realized in Developmental Networks (DNs)<ref name="Weng2011">J. Weng, "[http://www.cse.msu.edu/~weng/research/WhyPass-Weng-NI-2011.pdf Why Have We Passed 'Neural Networks Do not Abstract Well'?]," ''Natural Intelligence: the INNS Magazine'', vol. 1, no.1, pp. 13–22, 2011.</ref> whose embodiments are Where-What Networks, WWN-1 (2008)<ref name="Weng08">Z. Ji, J. Weng, and D. Prokhorov, "[http://www.cse.msu.edu/~weng/research/ICDL08_0077.pdf Where-What Network 1: Where and What Assist Each Other Through Top-down Connections]," ''Proc. 7th International Conference on Development and Learning (ICDL'08)'', Monterey, CA, Aug. 9–12, pp. 1–6, 2008.</ref> through WWN-7 (2013).<ref name="Weng13">X. Wu, G. Guo, and J. Weng, "[http://www.cse.msu.edu/~weng/research/WWN7-Wu-ICBM-2013.pdf Skull-closed Autonomous Development: WWN-7 Dealing with Scales]," ''Proc. International Conference on Brain-Mind'', July 27–28, East Lansing, Michigan, pp. 1–9, 2013.</ref>


A key advance for the deep learning revolution was hardware advances, especially GPU. Some early work dated back to 2004.<ref name="jung2004" /><ref name="chellapilla2006" /> In 2009, Raina, Madhavan, and [[Andrew Ng]] reported a 100M deep belief network trained on 30 Nvidia [[GeForce GTX 280]] GPUs, an early demonstration of GPU-based deep learning. They reported up to 70 times faster training.<ref>{{Cite book |last1=Raina |first1=Rajat |last2=Madhavan |first2=Anand |last3=Ng |first3=Andrew Y. |chapter=Large-scale deep unsupervised learning using graphics processors |date=2009-06-14 |title=Proceedings of the 26th Annual International Conference on Machine Learning |chapter-url=https://doi.org/10.1145/1553374.1553486 |series=ICML '09 |location=New York, NY, USA |publisher=Association for Computing Machinery |pages=873–880 |doi=10.1145/1553374.1553486 |isbn=978-1-60558-516-1}}</ref>
== Artificial curiosity and generative adversarial networks ==

In 2011, a CNN named ''DanNet<ref name=":32">{{Cite journal |last1=Cireşan |first1=Dan Claudiu |last2=Meier |first2=Ueli |last3=Gambardella |first3=Luca Maria |last4=Schmidhuber |first4=Jürgen |date=21 September 2010 |title=Deep, Big, Simple Neural Nets for Handwritten Digit Recognition |journal=Neural Computation |volume=22 |issue=12 |pages=3207–3220 |arxiv=1003.0358 |doi=10.1162/neco_a_00052 |issn=0899-7667 |pmid=20858131 |s2cid=1918673}}</ref>''<ref name=":62">{{Cite journal |last1=Ciresan |first1=D. C. |last2=Meier |first2=U. |last3=Masci |first3=J. |last4=Gambardella |first4=L.M. |last5=Schmidhuber |first5=J. |date=2011 |title=Flexible, High Performance Convolutional Neural Networks for Image Classification |url=http://ijcai.org/papers11/Papers/IJCAI11-210.pdf |url-status=live |journal=International Joint Conference on Artificial Intelligence |doi=10.5591/978-1-57735-516-8/ijcai11-210 |archive-url=https://web.archive.org/web/20140929094040/http://ijcai.org/papers11/Papers/IJCAI11-210.pdf |archive-date=2014-09-29 |access-date=2017-06-13}}</ref> by Dan Ciresan, Ueli Meier, Jonathan Masci, [[Luca Maria Gambardella]], and [[Jürgen Schmidhuber]] achieved for the first time superhuman performance in a visual pattern recognition contest, outperforming traditional methods by a factor of 3.<ref name="SCHIDHUB3"/> It then won more contests.<ref name=":82">{{Cite book |last1=Ciresan |first1=Dan |url=http://papers.nips.cc/paper/4741-deep-neural-networks-segment-neuronal-membranes-in-electron-microscopy-images.pdf |title=Advances in Neural Information Processing Systems 25 |last2=Giusti |first2=Alessandro |last3=Gambardella |first3=Luca M. |last4=Schmidhuber |first4=Jürgen |date=2012 |publisher=Curran Associates, Inc. |editor-last=Pereira |editor-first=F. |pages=2843–2851 |access-date=2017-06-13 |editor-last2=Burges |editor-first2=C. J. C. |editor-last3=Bottou |editor-first3=L. |editor-last4=Weinberger |editor-first4=K. Q. |archive-url=https://web.archive.org/web/20170809081713/http://papers.nips.cc/paper/4741-deep-neural-networks-segment-neuronal-membranes-in-electron-microscopy-images.pdf |archive-date=2017-08-09 |url-status=live}}</ref><ref name="ciresan2013miccai">{{Cite book |last1=Ciresan |first1=D. |title=Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013 |last2=Giusti |first2=A. |last3=Gambardella |first3=L.M. |last4=Schmidhuber |first4=J. |date=2013 |isbn=978-3-642-38708-1 |series=Lecture Notes in Computer Science |volume=7908 |pages=411–418 |chapter=Mitosis Detection in Breast Cancer Histology Images with Deep Neural Networks |doi=10.1007/978-3-642-40763-5_51 |pmid=24579167 |issue=Pt 2}}</ref> They also showed how [[Max pooling|max-pooling]] CNNs on GPU improved performance significantly.<ref name=":9">{{Cite book |last1=Ciresan |first1=D. |title=2012 IEEE Conference on Computer Vision and Pattern Recognition |last2=Meier |first2=U. |last3=Schmidhuber |first3=J. |year=2012 |isbn=978-1-4673-1228-8 |pages=3642–3649 |chapter=Multi-column deep neural networks for image classification |doi=10.1109/cvpr.2012.6248110 |arxiv=1202.2745 |s2cid=2161592}}</ref>

Many discoveries were empirical and focused on engineering. For example, in 2011, Xavier Glorot, Antoine Bordes and [[Yoshua Bengio]] found that the [[rectifier (neural networks)|ReLU]]<ref name="Fukushima1969" /> worked better than widely used activation functions prior to 2011.

In October 2012, [[AlexNet]] by [[Alex Krizhevsky]], [[Ilya Sutskever]], and [[Geoffrey Hinton]]<ref name="krizhevsky20122">{{cite journal |last1=Krizhevsky |first1=Alex |last2=Sutskever |first2=Ilya |last3=Hinton |first3=Geoffrey |date=2012 |title=ImageNet Classification with Deep Convolutional Neural Networks |url=https://www.cs.toronto.edu/~kriz/imagenet_classification_with_deep_convolutional.pdf |url-status=live |journal=NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada |archive-url=https://web.archive.org/web/20170110123024/http://www.cs.toronto.edu/~kriz/imagenet_classification_with_deep_convolutional.pdf |archive-date=2017-01-10 |access-date=2017-05-24}}</ref> won the large-scale [[ImageNet competition]] by a significant margin over shallow machine learning methods. Further incremental improvements included the VGG-16 network by [[Karen Simonyan]] and [[Andrew Zisserman]]<ref name="VGG">{{cite arXiv |eprint=1409.1556 |class=cs.CV |first1=Karen |last1=Simonyan |first2=Zisserman |last2=Andrew |title=Very Deep Convolution Networks for Large Scale Image Recognition |year=2014}}</ref> and Google's [[Inceptionv3]].<ref name="szegedy">{{Cite journal |last=Szegedy |first=Christian |date=2015 |title=Going deeper with convolutions |url=https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/43022.pdf |journal=Cvpr2015|arxiv=1409.4842 }}</ref>

The success in image classification was then extended to the more challenging task of [[Automatic image annotation|generating descriptions]] (captions) for images, often as a combination of CNNs and LSTMs.<ref name="1411.4555">{{cite arXiv |eprint=1411.4555 |class=cs.CV |first1=Oriol |last1=Vinyals |first2=Alexander |last2=Toshev |title=Show and Tell: A Neural Image Caption Generator |last3=Bengio |first3=Samy |last4=Erhan |first4=Dumitru |year=2014}}.</ref><ref name="1411.4952">{{cite arXiv |eprint=1411.4952 |class=cs.CV |first1=Hao |last1=Fang |first2=Saurabh |last2=Gupta |title=From Captions to Visual Concepts and Back |last3=Iandola |first3=Forrest |last4=Srivastava |first4=Rupesh |last5=Deng |first5=Li |last6=Dollár |first6=Piotr |last7=Gao |first7=Jianfeng |last8=He |first8=Xiaodong |last9=Mitchell |first9=Margaret |last10=Platt |first10=John C |last11=Lawrence Zitnick |first11=C |last12=Zweig |first12=Geoffrey |year=2014}}.</ref><ref name="1411.2539">{{cite arXiv |eprint=1411.2539 |class=cs.LG |first1=Ryan |last1=Kiros |first2=Ruslan |last2=Salakhutdinov |title=Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models |last3=Zemel |first3=Richard S |year=2014}}.</ref>

In 2014, the state of the art was training “very deep neural network” with 20 to 30 layers.<ref>{{Citation |last1=Simonyan |first1=Karen |title=Very Deep Convolutional Networks for Large-Scale Image Recognition |date=2015-04-10 |arxiv=1409.1556 |last2=Zisserman |first2=Andrew}}</ref> Stacking too many layers led to a steep reduction in [[Training, validation, and test data sets|training]] accuracy,<ref name="prelu2">{{cite arXiv |eprint=1502.01852 |class=cs.CV |first1=Kaiming |last1=He |first2=Xiangyu |last2=Zhang |title=Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification |last3=Ren |first3=Shaoqing |last4=Sun |first4=Jian |year=2016}}</ref> known as the "degradation" problem.<ref name="resnet2">{{Cite conference |last1=He |first1=Kaiming |last2=Zhang |first2=Xiangyu |last3=Ren |first3=Shaoqing |last4=Sun |first4=Jian |date=10 Dec 2015 |title=Deep Residual Learning for Image Recognition |arxiv=1512.03385}}</ref> In 2015, two techniques were developed concurrently to train very deep networks: [[highway network]]<ref name="highway20153">{{cite arXiv |eprint=1505.00387 |class=cs.LG |first1=Rupesh Kumar |last1=Srivastava |first2=Klaus |last2=Greff |title=Highway Networks |date=2 May 2015 |last3=Schmidhuber |first3=Jürgen}}</ref> and [[residual neural network]] (ResNet).<ref name="resnet20153">{{Cite conference |last1=He |first1=Kaiming |last2=Zhang |first2=Xiangyu |last3=Ren |first3=Shaoqing |last4=Sun |first4=Jian |date=2016 |title=Deep Residual Learning for Image Recognition |url=https://ieeexplore.ieee.org/document/7780459 |location=Las Vegas, NV, USA |publisher=IEEE |pages=770–778 |arxiv=1512.03385 |doi=10.1109/CVPR.2016.90 |isbn=978-1-4673-8851-1 |journal=2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}}</ref> The ResNet research team attempted to train deeper ones by empirically testing various tricks for training deeper networks until they discovered the deep residual network architecture.<ref>{{Cite web |last=Linn |first=Allison |date=2015-12-10 |title=Microsoft researchers win ImageNet computer vision challenge |url=https://blogs.microsoft.com/ai/microsoft-researchers-win-imagenet-computer-vision-challenge/ |access-date=2024-06-29 |website=The AI Blog |language=en-US}}</ref>

== Generative adversarial networks ==


{{main|Generative adversarial network}}
{{main|Generative adversarial network}}


In 1991, [[Juergen Schmidhuber]] published adversarial [[neural network]]s that contest with each other in the form of a [[zero-sum game]], where one network's gain is the other network's loss.<ref name="curiosity1991">{{cite conference| title = A possibility for implementing curiosity and boredom in model-building neural controllers | last1 = Schmidhuber | first1 = Jürgen | author-link = Juergen Schmidhuber | date = 1991 | publisher = MIT Press/Bradford Books| book-title = Proc. SAB'1991| pages = 222–227}}</ref><ref name="fun2010">{{cite journal|last1=Schmidhuber|first1=Jürgen|author-link=Jürgen Schmidhuber|year=2010|title=Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010)|journal= IEEE Transactions on Autonomous Mental Development|volume=2|issue=3|pages=230–247|doi=10.1109/TAMD.2010.2056368 |s2cid=234198 }}</ref><ref name="gancurpm2020">{{Cite journal|last=Schmidhuber|first=Jürgen| author-link = Juergen Schmidhuber |date=2020|title=Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991)|journal=Neural Networks |language=en|volume=127|pages=58–66|doi=10.1016/j.neunet.2020.04.008 |pmid=32334341 |arxiv=1906.04493 |s2cid=216056336 }}</ref> The first network is a [[generative model]] that models a [[probability distribution]] over output patterns. The second network learns by [[gradient descent]] to predict the reactions of the environment to these patterns. This was called "artificial curiosity." Earlier adversarial machine learning systems "neither involved unsupervised neural networks nor were about modeling data nor used gradient descent."<ref name="gancurpm2020"/>
In 1991, [[Jürgen Schmidhuber|Juergen Schmidhuber]] published "artificial curiosity", [[Neural network|neural networks]] in a [[zero-sum game]].<ref name="curiosity19912">{{cite conference |last1=Schmidhuber |first1=Jürgen |author-link=Juergen Schmidhuber |date=1991 |title=A possibility for implementing curiosity and boredom in model-building neural controllers |publisher=MIT Press/Bradford Books |pages=222–227 |book-title=Proc. SAB'1991}}</ref> The first network is a [[generative model]] that models a [[probability distribution]] over output patterns. The second network learns by [[gradient descent]] to predict the reactions of the environment to these patterns. GANs can be regarded as a case where the environmental reaction is 1 or 0 depending on whether the first network's output is in a given set.<ref name="gancurpm20202">{{Cite journal |last=Schmidhuber |first=Jürgen |author-link=Juergen Schmidhuber |date=2020 |title=Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991) |journal=Neural Networks |language=en |volume=127 |pages=58–66 |arxiv=1906.04493 |doi=10.1016/j.neunet.2020.04.008 |pmid=32334341 |s2cid=216056336}}</ref> It was extended to "predictability minimization" to create disentangled representations of input patterns''.''<ref name="pm1992">{{Cite journal |last=Schmidhuber |first=Jürgen |author-link=Juergen Schmidhuber |date=November 1992 |title=Learning Factorial Codes by Predictability Minimization |journal=Neural Computation |language=en |volume=4 |issue=6 |pages=863–879 |doi=10.1162/neco.1992.4.6.863 |s2cid=42023620}}</ref><ref name="pm1996">{{Cite journal |last1=Schmidhuber |first1=Jürgen |last2=Eldracher |first2=Martin |last3=Foltin |first3=Bernhard |date=1996 |title=Semilinear predictability minimzation produces well-known feature detectors |journal=Neural Computation |language=en |volume=8 |issue=4 |pages=773–786 |doi=10.1162/neco.1996.8.4.773 |s2cid=16154391}}</ref>


Other people had similar ideas but did not develop them similarly. An idea involving adversarial networks was published in a 2010 blog post by Olli Niemitalo.<ref name="olli2010">{{cite web |last1=Niemitalo |first1=Olli |date=February 24, 2010 |title=A method for training artificial neural networks to generate missing data within a variable context |url=http://yehar.com:80/blog/?p=167 |url-status=live |archive-url=https://web.archive.org/web/20120312111546/http://yehar.com/blog/?p=167 |archive-date=March 12, 2012 |access-date=February 22, 2019 |newspaper=Internet Archive (Wayback Machine)}}</ref> This idea was never implemented and did not involve [[stochasticity]] in the generator and thus was not a generative model. It is now known as a conditional GAN or cGAN.<ref name="reddit3">{{cite web |year=2019 |title=GANs were invented in 2010? |url=https://www.reddit.com/r/MachineLearning/comments/bnqm0p/d_gans_were_invented_in_2010/ |access-date=2019-05-28 |website=reddit r/MachineLearning |language=en-US}}</ref> An idea similar to GANs was used to model animal behavior by Li, Gauci and Gross in 2013.<ref name="Li-etal-GECCO2013">{{cite conference |last1=Li |first1=Wei |last2=Gauci |first2=Melvin |last3=Gross |first3=Roderich |date=July 6, 2013 |title=Proceeding of the fifteenth annual conference on Genetic and evolutionary computation conference - GECCO '13 |location=Amsterdam, the Netherlands |publisher=ACM |pages=223–230 |doi=10.1145/2463372.2465801 |isbn=9781450319638 |chapter=A Coevolutionary Approach to Learn Animal Behavior Through Controlled Interaction |book-title=Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO 2013)}}</ref>
In 2014, this adversarial principle was used in a [[generative adversarial network]] (GAN) by [[Ian Goodfellow]] et al.<ref name="GANnips">{{cite conference|last1=Goodfellow|first1=Ian|last2=Pouget-Abadie|first2=Jean|last3=Mirza|first3=Mehdi|last4=Xu|first4=Bing|last5=Warde-Farley|first5=David|last6=Ozair|first6=Sherjil|last7=Courville|first7=Aaron|last8=Bengio|first8=Yoshua|year=2014|title=Generative Adversarial Networks|url=https://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf|conference=Proceedings of the International Conference on Neural Information Processing Systems (NIPS 2014)|pages=2672–2680|access-date=20 August 2019|archive-date=22 November 2019|archive-url=https://web.archive.org/web/20191122034612/http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf|url-status=live}}</ref> Here the environmental reaction is 1 or 0 depending on whether the first network's output is in a given set. This can be used to create realistic [[deepfake]]s.<ref>{{Cite web|url=https://lab.witness.org/projects/synthetic-media-and-deep-fakes/|title=Prepare, Don't Panic: Synthetic Media and Deepfakes|publisher=witness.org|access-date=25 November 2020|archive-date=2 December 2020|archive-url=https://web.archive.org/web/20201202231744/https://lab.witness.org/projects/synthetic-media-and-deep-fakes/|url-status=live}}</ref>


Another inspiration for GANs was noise-contrastive estimation,<ref>{{cite journal |last1=Gutmann |first1=Michael |last2=Hyvärinen |first2=Aapo |title=Noise-Contrastive Estimation |url=http://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf |journal=International Conference on AI and Statistics}}</ref> which uses the same loss function as GANs and which Goodfellow studied during his PhD in 2010–2014.
In 1992, Schmidhuber also published another type of gradient-based adversarial neural networks where the goal of the [[zero-sum game]] is to create disentangled representations of input patterns. This was called ''predictability minimization.''<ref name="pm1992">{{Cite journal|last=Schmidhuber|first=Jürgen | author-link = Juergen Schmidhuber |date=November 1992|title=Learning Factorial Codes by Predictability Minimization|journal=Neural Computation|language=en|volume=4|issue=6|pages=863–879|doi=10.1162/neco.1992.4.6.863|s2cid=42023620 }}</ref><ref name="pm1996">{{Cite journal|last1=Schmidhuber|first1=Jürgen|last2=Eldracher|first2=Martin|last3=Foltin|first3=Bernhard|date=1996|title=Semilinear predictability minimzation produces well-known feature detectors|journal=Neural Computation|language=en|volume=8|issue=4|pages=773–786|doi=10.1162/neco.1996.8.4.773 |s2cid=16154391 }}</ref>


[[Generative adversarial network]] (GAN) by ([[Ian Goodfellow]] et al., 2014)<ref name="GANnips2">{{cite conference |last1=Goodfellow |first1=Ian |last2=Pouget-Abadie |first2=Jean |last3=Mirza |first3=Mehdi |last4=Xu |first4=Bing |last5=Warde-Farley |first5=David |last6=Ozair |first6=Sherjil |last7=Courville |first7=Aaron |last8=Bengio |first8=Yoshua |year=2014 |title=Generative Adversarial Networks |url=https://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf |conference=Proceedings of the International Conference on Neural Information Processing Systems (NIPS 2014) |pages=2672–2680 |archive-url=https://web.archive.org/web/20191122034612/http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf |archive-date=22 November 2019 |access-date=20 August 2019 |url-status=live}}</ref> became state of the art in generative modeling during 2014-2018 period. Excellent image quality is achieved by [[Nvidia]]'s [[StyleGAN]] (2018)<ref name="SyncedReview20182">{{Cite web |date=December 14, 2018 |title=GAN 2.0: NVIDIA's Hyperrealistic Face Generator |url=https://syncedreview.com/2018/12/14/gan-2-0-nvidias-hyperrealistic-face-generator/ |access-date=October 3, 2019 |website=SyncedReview.com}}</ref> based on the Progressive GAN by Tero Karras et al.<ref name="progressiveGAN20172">{{cite arXiv |eprint=1710.10196 |class=cs.NE |first1=T. |last1=Karras |first2=T. |last2=Aila |title=Progressive Growing of GANs for Improved Quality, Stability, and Variation |date=26 February 2018 |last3=Laine |first3=S. |last4=Lehtinen |first4=J.}}</ref> Here the GAN generator is grown from small to large scale in a pyramidal fashion. Image generation by GAN reached popular success, and provoked discussions concerning [[Deepfake|deepfakes]].<ref>{{Cite web |title=Prepare, Don't Panic: Synthetic Media and Deepfakes |url=https://lab.witness.org/projects/synthetic-media-and-deep-fakes/ |url-status=live |archive-url=https://web.archive.org/web/20201202231744/https://lab.witness.org/projects/synthetic-media-and-deep-fakes/ |archive-date=2 December 2020 |access-date=25 November 2020 |publisher=witness.org}}</ref> [[Diffusion model|Diffusion models]] (2015)<ref>{{Cite journal |last1=Sohl-Dickstein |first1=Jascha |last2=Weiss |first2=Eric |last3=Maheswaranathan |first3=Niru |last4=Ganguli |first4=Surya |date=2015-06-01 |title=Deep Unsupervised Learning using Nonequilibrium Thermodynamics |url=http://proceedings.mlr.press/v37/sohl-dickstein15.pdf |journal=Proceedings of the 32nd International Conference on Machine Learning |language=en |publisher=PMLR |volume=37 |pages=2256–2265|arxiv=1503.03585 }}</ref> eclipsed GANs in generative modeling since then, with systems such as [[DALL-E|DALL·E 2]] (2022) and [[Stable Diffusion]] (2022).
[[Nvidia]]'s [[StyleGAN]] (2018)<ref name="SyncedReview2018">{{Cite web |url=https://syncedreview.com/2018/12/14/gan-2-0-nvidias-hyperrealistic-face-generator/ |title=GAN 2.0: NVIDIA's Hyperrealistic Face Generator |date=December 14, 2018 |website=SyncedReview.com|access-date=October 3, 2019}}</ref> is based on the Progressive GAN by Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.<ref name="progressiveGAN2017">{{Cite journal |last1=Karras |first1=Tero |last2=Aila |first2=Timo |last3=Laine |first3=Samuli |last4=Lehtinen |first4=Jaakko |date=October 1, 2017 |title=Progressive Growing of GANs for Improved Quality, Stability, and Variation |arxiv=1710.10196 |url=https://ui.adsabs.harvard.edu/abs/2017arXiv171010196K}}</ref> Here the GAN generator is grown from small to large scale in a pyramidal fashion. StyleGANs improve consistency between fine and coarse details in the generator network.


== Attention mechanism and Transformer ==
== Transformers and their variants ==
{{Main|Attention (machine learning)|Transformer (deep learning architecture)}}
The human [[Attentional control|selective attention]] had been studied in neuroscience and cognitive psychology.<ref>{{Cite book |last1=Kramer |first1=Arthur F. |url=http://www.oxfordscholarship.com/view/10.1093/acprof:oso/9780195305722.001.0001/acprof-9780195305722 |title=Attention: From Theory to Practice |last2=Wiegmann |first2=Douglas A. |last3=Kirlik |first3=Alex |date=2006-12-28 |publisher=Oxford University Press |isbn=978-0-19-530572-2 |chapter=1 Attention: From History to Application |doi=10.1093/acprof:oso/9780195305722.003.0001}}</ref> Selective attention of audition was studied in the [[cocktail party effect]] ([[Colin Cherry]], 1953).<ref name="Cherry 1953">{{cite journal |vauthors=Cherry EC |year=1953 |title=Some Experiments on the Recognition of Speech, with One and with Two Ears |url=http://www.ee.columbia.edu/~dpwe/papers/Cherry53-cpe.pdf |journal=The Journal of the Acoustical Society of America |volume=25 |issue=5 |pages=975–79 |bibcode=1953ASAJ...25..975C |doi=10.1121/1.1907229 |issn=0001-4966 |hdl-access=free |hdl=11858/00-001M-0000-002A-F750-3}}</ref> ([[Donald Broadbent]], 1958) proposed the [[Broadbent's filter model of attention|filter model of attention]].<ref name="Broadbent">{{cite book |last=Broadbent |first=D |author-link=Donald Broadbent |title=Perception and Communication |publisher=Pergamon Press |year=1958 |location=London}}</ref> Selective attention of vision was studied in the 1960s by [[George Sperling]]'s [[Iconic memory#Sperling's partial report procedure|partial report paradigm]]. It was also noticed that [[Saccade|saccade control]] is modulated by cognitive processes, in that the eye moves preferentially towards areas of high [[Salience (neuroscience)|salience]]. As the fovea of the eye is small, the eye cannot sharply resolve all of the visual field at once. The use of saccade control allows the eye to quickly scan important features of a scene.<ref>{{Cite journal |last1=Kowler |first1=Eileen |last2=Anderson |first2=Eric |last3=Dosher |first3=Barbara |last4=Blaser |first4=Erik |date=1995-07-01 |title=The role of attention in the programming of saccades |url=https://dx.doi.org/10.1016/0042-6989%2894%2900279-U |journal=Vision Research |volume=35 |issue=13 |pages=1897–1916 |doi=10.1016/0042-6989(94)00279-U |pmid=7660596 |issn=0042-6989}}</ref>


These researches inspired algorithms, such as a variant of the [[Neocognitron]].<ref>{{Cite journal |last=Fukushima |first=Kunihiko |date=1987-12-01 |title=Neural network model for selective attention in visual pattern recognition and associative recall |url=https://opg.optica.org/abstract.cfm?URI=ao-26-23-4985 |journal=Applied Optics |language=en |volume=26 |issue=23 |pages=4985–4992 |doi=10.1364/AO.26.004985 |pmid=20523477 |bibcode=1987ApOpt..26.4985F |issn=0003-6935}}</ref><ref>{{cite arXiv|last1=Ba |first1=Jimmy |title=Multiple Object Recognition with Visual Attention |date=2015-04-23 |last2=Mnih |first2=Volodymyr |last3=Kavukcuoglu |first3=Koray|class=cs.LG |eprint=1412.7755 }}</ref> Conversely, developments in neural networks had inspired circuit models of biological visual attention.<ref>{{Citation |last1=Koch |first1=Christof |title=Shifts in Selective Visual Attention: Towards the Underlying Neural Circuitry |date=1987 |work=Matters of Intelligence: Conceptual Structures in Cognitive Neuroscience |pages=115–141 |editor-last=Vaina |editor-first=Lucia M. |url=https://doi.org/10.1007/978-94-009-3833-5_5 |access-date=2024-08-06 |place=Dordrecht |publisher=Springer Netherlands |language=en |doi=10.1007/978-94-009-3833-5_5 |isbn=978-94-009-3833-5 |last2=Ullman |first2=Shimon}}</ref><ref name=":12">{{Cite journal |last=Soydaner |first=Derya |date=August 2022 |title=Attention mechanism in neural networks: where it comes and where it goes |url=https://link.springer.com/10.1007/s00521-022-07366-3 |journal=Neural Computing and Applications |language=en |volume=34 |issue=16 |pages=13371–13385 |doi=10.1007/s00521-022-07366-3 |issn=0941-0643}}</ref>
{{main|Transformer (machine learning model)}}


A key aspect of attention mechanism is the use of multiplicative operations, which had been studied under the names of ''[[Higher-order neural network|higher-order neural networks]]'',<ref>{{Cite journal |last1=Giles |first1=C. Lee |last2=Maxwell |first2=Tom |date=1987-12-01 |title=Learning, invariance, and generalization in high-order neural networks |url=https://opg.optica.org/abstract.cfm?URI=ao-26-23-4972 |journal=Applied Optics |language=en |volume=26 |issue=23 |pages=4972–4978 |doi=10.1364/AO.26.004972 |pmid=20523475 |issn=0003-6935}}</ref> ''multiplication units'',<ref>{{Cite journal |last1=Feldman |first1=J. A. |last2=Ballard |first2=D. H. |date=1982-07-01 |title=Connectionist models and their properties |url=https://www.sciencedirect.com/science/article/pii/S0364021382800013 |journal=Cognitive Science |volume=6 |issue=3 |pages=205–254 |doi=10.1016/S0364-0213(82)80001-3 |issn=0364-0213}}</ref> ''sigma-pi units'',<ref name="PDP">{{Cite book |last1=Rumelhart |first1=David E. |url=https://stanford.edu/~jlmcc/papers/PDP/Chapter2.pdf |title=Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations, Chapter 2 |last2=Mcclelland |first2=James L. |last3=Group |first3=PDP Research |date=1987-07-29 |publisher=Bradford Books |isbn=978-0-262-68053-0 |location=Cambridge, Mass |language=en}}</ref> ''fast weight controllers'',<ref>{{Cite journal |last=Schmidhuber |first=Jürgen |date=January 1992 |title=Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks |url=https://direct.mit.edu/neco/article/4/1/131-139/5620 |journal=Neural Computation |language=en |volume=4 |issue=1 |pages=131–139 |doi=10.1162/neco.1992.4.1.131 |issn=0899-7667}}</ref> and ''hyper-networks''.<ref>{{cite arXiv|last1=Ha |first1=David |title=HyperNetworks |date=2016-12-01 |eprint=1609.09106 |last2=Dai |first2=Andrew |last3=Le |first3=Quoc V.|class=cs.LG }}</ref>
Many modern large language models such as [[ChatGPT]], [[GPT-4]], and [[BERT (language model)|BERT]] use a [[feedforward neural network]] called [[Transformer (machine learning model)|Transformer]] by Ashish Vaswani et. al. in their 2017 paper "Attention Is All You Need."<ref name="vaswani2017">{{cite arXiv|last8=Polosukhin|first8=Illia|last7=Kaiser|first7=Lukasz|last6=Gomez|first6=Aidan N.|last5=Jones|first5=Llion|last4=Uszkoreit|first4=Jakob|last3=Parmar|first3=Niki|last2=Shazeer|first2=Noam|last1=Vaswani|first1=Ashish|date=2017-06-12|title=Attention Is All You Need|eprint=1706.03762|class=cs.CL}}</ref>
Transformers have increasingly become the model of choice for [[natural language processing]] problems,<ref name="wolf2020">{{cite book|last1=Wolf|first1=Thomas|last2=Debut|first2=Lysandre|last3=Sanh|first3=Victor|last4=Chaumond|first4=Julien|last5=Delangue|first5=Clement|last6=Moi|first6=Anthony|last7=Cistac|first7=Pierric|last8=Rault|first8=Tim|last9=Louf|first9=Remi|last10=Funtowicz|first10=Morgan|last11=Davison|first11=Joe|last12=Shleifer|first12=Sam|last13=von Platen|first13=Patrick|last14=Ma|first14=Clara|last15=Jernite|first15=Yacine|last16=Plu|first16=Julien|last17=Xu|first17=Canwen|last18=Le Scao|first18=Teven|last19=Gugger|first19=Sylvain|last20=Drame|first20=Mariama|last21=Lhoest|first21=Quentin|last22=Rush|first22=Alexander|title=Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations|chapter=Transformers: State-of-the-Art Natural Language Processing|year=2020|pages=38–45|doi=10.18653/v1/2020.emnlp-demos.6|s2cid=208117506}}</ref> replacing [[recurrent neural network]]s (RNNs) such as [[long short-term memory]] (LSTM).<ref name="lstm1997">{{Cite journal |last1=Hochreiter|first1=Sepp|author-link=Sepp Hochreiter|last2=Schmidhuber|first2=Jürgen|s2cid=1915014|author-link2=Jürgen Schmidhuber|date=1 November 1997|title=Long Short-Term Memory|journal=Neural Computation|volume=9|issue=8 |pages=1735–1780 |doi=10.1162/neco.1997.9.8.1735|pmid=9377276|issn=0899-7667}}</ref>


=== Recurrent attention ===
Basic ideas for this go back a long way: in 1992, [[Juergen Schmidhuber]] published the Transformer with "linearized self-attention" (save for a normalization operator),<ref name="transform1992">{{Cite journal |last1=Schmidhuber|first1=Jürgen|author-link1=Jürgen Schmidhuber|date=1 November 1992|title=Learning to control fast-weight memories: an alternative to recurrent nets.|journal=Neural Computation|volume=4|issue=1 |pages=131–139|doi=10.1162/neco.1992.4.1.131 |s2cid=16683347 }}</ref>
During the deep learning era, attention mechanism was developed solve similar problems in encoding-decoding.<ref name=":03">{{Cite journal |last1=Niu |first1=Zhaoyang |last2=Zhong |first2=Guoqiang |last3=Yu |first3=Hui |date=2021-09-10 |title=A review on the attention mechanism of deep learning |url=https://www.sciencedirect.com/science/article/pii/S092523122100477X |journal=Neurocomputing |volume=452 |pages=48–62 |doi=10.1016/j.neucom.2021.03.091 |issn=0925-2312}}</ref>
which is also called the "linear Transformer."<ref name="choromanski2020">{{cite arXiv|eprint=2009.14794|class=cs.CL|last1=Choromanski |first1=Krzysztof |last2=Likhosherstov |first2=Valerii |last3=Dohan |first3=David |last4=Song |first4=Xingyou |last5=Gane |first5=Andreea |last6=Sarlos |first6=Tamas |last7=Hawkins |first7=Peter |last8=Davis |first8=Jared |last9=Mohiuddin |first9=Afroz |last10=Kaiser |first10=Lukasz |last11=Belanger |first11=David |last12=Colwell |first12=Lucy |last13=Weller |first13=Adrian |title=Rethinking Attention with Performers |year=2020 }}</ref><ref name="schlag2021"/><ref name=DLhistory/> He advertised it as an "alternative to RNNs"<ref name="transform1992"/> that can learn "internal spotlights of attention,"<ref name="attention1993">{{Cite conference | last1=Schmidhuber|first1=Jürgen|author-link1=Jürgen Schmidhuber|title= Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets|publisher=Springer|date=1993 |pages=460–463 |book-title=ICANN 1993}}</ref> and experimentally applied it to problems of variable binding.<ref name="transform1992"/> Here a slow [[feedforward neural network]] learns by [[gradient descent]] to control the fast weights of another neural network through [[outer product]]s of self-generated activation patterns called "FROM" and "TO" which in Transformer terminology are called "key" and "value" for "[[Attention (machine learning)|self-attention]]."<ref name="schlag2021">{{Cite conference | last1=Schlag|first1=Imanol| last2=Irie|first2=Kazuki| last3=Schmidhuber|first3=Jürgen|author-link1=Juergen Schmidhuber|title= Linear Transformers Are Secretly Fast Weight Programmers|publisher=Springer|date=2021 |pages=9355–9366 |book-title= ICML 2021}}</ref> This fast weight "attention mapping" is applied to queries. The 2017 Transformer<ref name="vaswani2017"/> combines this with a [[softmax]] operator and a projection matrix.<ref name=DLhistory/>


The idea of encoder-decoder sequence transduction had been developed in the early 2010s. The papers most commonly cited as the originators that produced seq2seq are two papers from 2014.<ref name=":2">{{Cite journal |last1=Cho |first1=Kyunghyun |last2=van Merrienboer |first2=Bart |last3=Gulcehre |first3=Caglar |last4=Bahdanau |first4=Dzmitry |last5=Bougares |first5=Fethi |last6=Schwenk |first6=Holger |last7=Bengio |first7=Yoshua |date=2014-06-03 |title=Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation |arxiv=1406.1078 }}</ref><ref name="sequence">{{cite arXiv |eprint=1409.3215 |class=cs.CL |first1=Ilya |last1=Sutskever |first2=Oriol |last2=Vinyals |title=Sequence to sequence learning with neural networks |date=14 Dec 2014 |last3=Le |first3=Quoc Viet}}</ref> A [[seq2seq]] architecture employs two RNN, typically LSTM, an "encoder" and a "decoder", for sequence transduction, such as machine translation. They became state of the art in machine translation, and was instrumental in the development of [[Attention (machine learning)|attention mechanism]] and [[Transformer (deep learning architecture)|Transformer]].
== Deep learning with unsupervised or self-supervised pre-training ==


An image captioning model was proposed in 2015, citing inspiration from the seq2seq model.<ref>{{Cite journal |last1=Vinyals |first1=Oriol |last2=Toshev |first2=Alexander |last3=Bengio |first3=Samy |last4=Erhan |first4=Dumitru |date=2015 |title=Show and Tell: A Neural Image Caption Generator |url=https://www.cv-foundation.org/openaccess/content_cvpr_2015/html/Vinyals_Show_and_Tell_2015_CVPR_paper.html |pages=3156–3164|arxiv=1411.4555 }}</ref> that would encode an input image into a fixed-length vector. (Xu et al. 2015),<ref>{{Cite journal |last1=Xu |first1=Kelvin |last2=Ba |first2=Jimmy |last3=Kiros |first3=Ryan |last4=Cho |first4=Kyunghyun |last5=Courville |first5=Aaron |last6=Salakhudinov |first6=Ruslan |last7=Zemel |first7=Rich |last8=Bengio |first8=Yoshua |date=2015-06-01 |title=Show, Attend and Tell: Neural Image Caption Generation with Visual Attention |url=https://proceedings.mlr.press/v37/xuc15.html |journal=Proceedings of the 32nd International Conference on Machine Learning |language=en |publisher=PMLR |pages=2048–2057|arxiv=1502.03044 }}</ref> citing (Bahdanau et al. 2014),<ref name=":23">{{cite arXiv |last1=Bahdanau |first1=Dzmitry |title=Neural Machine Translation by Jointly Learning to Align and Translate |date=2016-05-19 |last2=Cho |first2=Kyunghyun |last3=Bengio |first3=Yoshua|class=cs.CL |eprint=1409.0473 }}</ref> applied the attention mechanism as used in the seq2seq model to image captioning.
In the 1980s, [[backpropagation]] did not work well for deep FNNs and RNNs. Here the word "deep" refers to the number of layers through which the data is transformed. More precisely, deep learning systems have a substantial ''credit assignment path'' (CAP) depth.<ref name="SCHIDHUB">{{cite journal|last=Schmidhuber|first=J.|s2cid=11715509|year=2015|title=Deep Learning in Neural Networks: An Overview|journal=Neural Networks|volume=61|pages=85–117|arxiv=1404.7828|doi=10.1016/j.neunet.2014.09.003|pmid=25462637}}</ref> The CAP is the chain of transformations from input to output. CAPs describe potentially causal connections between input and output. For an FNN, the depth of the CAPs is that of the network and is the number of hidden layers plus one (as the output layer is also parameterized). For RNNs, in which a signal may propagate through a layer more than once, the CAP depth is potentially unlimited.


=== Transformer ===
To overcome this problem, [[Juergen Schmidhuber]] (1992) proposed a self-supervised hierarchy of RNNs pre-trained one level at a time by [[self-supervised learning]].<ref name="schmidhuber1992">{{cite journal |last1=Schmidhuber |first1=Jürgen |year=1992 |title=Learning complex, extended sequences using the principle of history compression |url=ftp://ftp.idsia.ch/pub/juergen/chunker.pdf |journal=Neural Computation |volume=4 |issue=2 |pages=234–242 |doi=10.1162/neco.1992.4.2.234 |s2cid=18271205 }}{{Dead link|date=June 2024 |bot=InternetArchiveBot |fix-attempted=yes }}</ref> This "neural history compressor" uses [[predictive coding]] to learn [[Knowledge representation|internal representations]] at multiple self-organizing time scales.<ref name=DLhistory/>
One problem with seq2seq models was their use of recurrent neural networks, which are not parallelizable as both the encoder and the decoder processes the sequence token-by-token. The ''decomposable attention'' attempted to solve this problem by processing the input sequence in parallel, before computing a "soft alignment matrix" ("alignment" is the terminology used by (Bahdanau et al. 2014)<ref name=":23" />). This allowed parallel processing.
The deep architecture may be used to reproduce the original data from the top level feature activations.<ref name="schmidhuber1992"/>
The RNN hierarchy can be "collapsed" into a single RNN, by "distilling" a higher level "chunker" network into a lower level "automatizer" network.<ref name="schmidhuber1992" /><ref name=DLhistory/> In 1993, a chunker solved a deep learning task whose CAP depth exceeded 1000.<ref name="schmidhuber1993">{{Cite book |url=ftp://ftp.idsia.ch/pub/juergen/habilitation.pdf |title=Habilitation Thesis |last=Schmidhuber |first=Jürgen |year=1993 }}{{Dead link|date=June 2024 |bot=InternetArchiveBot |fix-attempted=yes }}</ref>
Such history compressors can substantially facilitate downstream supervised deep learning.<ref name=DLhistory/>


The idea of using attention mechanism for self-attention, instead of in an encoder-decoder (cross-attention), was also proposed during this period, such as in [[Differentiable neural computer|differentiable neural computers]] and [[Neural Turing machine|neural Turing machines]].<ref>{{cite arXiv |last1=Graves |first1=Alex |title=Neural Turing Machines |date=2014-12-10 |last2=Wayne |first2=Greg |last3=Danihelka |first3=Ivo|class=cs.NE |eprint=1410.5401 }}</ref> It was termed ''intra-attention''<ref name="parikh">{{cite arXiv |last1=Cheng |first1=Jianpeng |title=Long Short-Term Memory-Networks for Machine Reading |date=2016-09-20 |eprint=1601.06733 |last2=Dong |first2=Li |last3=Lapata |first3=Mirella|class=cs.CL }}</ref> where an LSTM is augmented with a memory network as it encodes an input sequence.
[[Geoffrey Hinton]] et al. (2006) proposed learning a high-level [[Knowledge representation|internal representation]] using successive layers of binary or real-valued [[latent variable]]s with a [[restricted Boltzmann machine]]<ref name="smolensky1986">{{cite book|title=Parallel Distributed Processing: Explorations in the Microstructure of Cognition|last1=Smolensky|first1=P.|year=1986|isbn=9780262680530|editor=D. E. Rumelhart|volume=1|pages=[https://archive.org/details/paralleldistribu00rume/page/194 194–281]|chapter=Information processing in dynamical systems: Foundations of harmony theory.|author-link1=Paul Smolensky|editor2=J. L. McClelland|editor3=PDP Research Group|chapter-url=http://portal.acm.org/citation.cfm?id=104290|url=https://archive.org/details/paralleldistribu00rume/page/194}}</ref> to model each layer. This RBM is a [[generative model|generative]] [[stochastic neural network|stochastic]] [[feedforward neural network]] that can learn a [[probability distribution]] over its set of inputs. Once sufficiently many layers have been learned, the deep architecture may be used as a [[generative model]] by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.<ref name="hinton2006">{{cite journal|last1=Hinton|first1=G. E.|last2=Osindero|first2=S.|last3=Teh|first3=Y.|year=2006|title=A fast learning algorithm for deep belief nets|url=http://www.cs.toronto.edu/~hinton/absps/fastnc.pdf|journal=[[Neural Computation (journal)|Neural Computation]]|volume=18|issue=7|pages=1527–1554|citeseerx=10.1.1.76.1541|doi=10.1162/neco.2006.18.7.1527|pmid=16764513|s2cid=2309950|author-link1=Geoffrey Hinton}}</ref><ref name="hinton2009">{{Cite journal|last=Hinton|first=Geoffrey|date=2009-05-31|title=Deep belief networks|journal=Scholarpedia|volume=4|issue=5|pages=5947|bibcode=2009SchpJ...4.5947H|doi=10.4249/scholarpedia.5947 |doi-access=free |issn=1941-6016}}</ref> In 2012, [[Andrew Ng]] and [[Jeff Dean (computer scientist)|Jeff Dean]] created an FNN that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images taken from [[YouTube]] videos.<ref name="ng2012">{{cite arXiv|eprint=1112.6209|class=cs.LG|first1=Andrew|last1=Ng|first2=Jeff|last2=Dean|title=Building High-level Features Using Large Scale Unsupervised Learning|year=2012}}</ref>


These strands of development were combined in the Transformer architecture, published in ''[[Attention Is All You Need]]'' (2017). Subsequently, attention mechanisms were extended within the framework of Transformer architecture.
== The vanishing gradient problem and its solutions==


Seq2seq models with attention still suffered from the same issue with recurrent networks, which is that they are hard to parallelize, which prevented them to be accelerated on GPUs. In 2016, ''decomposable attention'' applied attention mechanism to the [[Feedforward neural network|feedforward network]], which are easy to parallelize.<ref>{{cite arXiv |eprint=1606.01933 |class=cs.CL |first1=Ankur P. |last1=Parikh |first2=Oscar |last2=Täckström |title=A Decomposable Attention Model for Natural Language Inference |date=2016-09-25 |last3=Das |first3=Dipanjan |last4=Uszkoreit |first4=Jakob}}</ref> One of its authors, Jakob Uszkoreit, suspected that attention ''without'' recurrence is sufficient for language translation, thus the title "attention is ''all'' you need".<ref>{{Cite magazine |last=Levy |first=Steven |title=8 Google Employees Invented Modern AI. Here's the Inside Story |url=https://www.wired.com/story/eight-google-employees-invented-modern-ai-transformers-paper/ |url-status=live |archive-url=https://web.archive.org/web/20240320101528/https://www.wired.com/story/eight-google-employees-invented-modern-ai-transformers-paper/ |archive-date=20 March 2024 |access-date=2024-08-06 |magazine=Wired |language=en-US |issn=1059-1028}}</ref>
{{main|Long short-term memory}}


In 2017, the original (100M-sized) encoder-decoder transformer model was proposed in the "[[Attention is all you need]]" paper. At the time, the focus of the research was on improving [[seq2seq]] for [[machine translation]], by removing its recurrence to processes all tokens in parallel, but preserving its dot-product attention mechanism to keep its text processing performance.<ref name="2017_Attention_Is_All_You_Need2">{{cite journal |last1=Vaswani |first1=Ashish |author1-link=Ashish Vaswani |last2=Shazeer |first2=Noam |last3=Parmar |first3=Niki |last4=Uszkoreit |first4=Jakob |last5=Jones |first5=Llion |last6=Gomez |first6=Aidan N |author6-link=Aidan Gomez |last7=Kaiser |first7=Łukasz |last8=Polosukhin |first8=Illia |date=2017 |title=Attention is All you Need |url=https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf |journal=Advances in Neural Information Processing Systems |publisher=Curran Associates, Inc. |volume=30}}</ref> Its parallelizability was an important factor to its widespread use in large neural networks.<ref>{{cite arXiv |last1=Peng |first1=Bo |title=RWKV: Reinventing RNNs for the Transformer Era |date=2023-12-10 |eprint=2305.13048 |last2=Alcaide |first2=Eric |last3=Anthony |first3=Quentin |last4=Albalak |first4=Alon |last5=Arcadinho |first5=Samuel |last6=Biderman |first6=Stella |last7=Cao |first7=Huanqi |last8=Cheng |first8=Xin |last9=Chung |first9=Michael|class=cs.CL }}</ref>
[[Sepp Hochreiter]]'s diploma thesis (1991)<ref name="HOCH1991">S. Hochreiter., "[http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf Untersuchungen zu dynamischen neuronalen Netzen] {{Webarchive|url=https://web.archive.org/web/20150306075401/http://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdvisorSchmidhuber.pdf |date=2015-03-06 }}," ''Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber'', 1991.</ref> was called "one of the most important documents in the history of machine learning" by his supervisor [[Juergen Schmidhuber]].<ref name=DLhistory/> Hochreiter not only tested the neural history compressor,<ref name="schmidhuber1992"/> but also identified and analyzed the [[vanishing gradient problem]].<ref name="HOCH1991"/><ref name="HOCH2001">{{cite book|chapter-url={{google books |plainurl=y |id=NWOcMVA64aAC}}|title=A Field Guide to Dynamical Recurrent Networks|last=Hochreiter|first=S.|display-authors=etal|date=15 January 2001|publisher=John Wiley & Sons|isbn=978-0-7803-5369-5|chapter=Gradient flow in recurrent nets: the difficulty of learning long-term dependencies|editor-last2=Kremer|editor-first2=Stefan C.|editor-first1=John F.|editor-last1=Kolen}}</ref> He proposed recurrent [[Residual neural network|residual]] connections to solve this problem. This led to the deep learning method called [[long short-term memory]] (LSTM), published in 1997.<ref name=":0">{{Cite journal|last1=Hochreiter|first1=Sepp|last2=Schmidhuber|first2=Jürgen|s2cid=1915014|date=1 November 1997|title=Long Short-Term Memory|journal=Neural Computation|volume=9|issue=8|pages=1735–1780|doi=10.1162/neco.1997.9.8.1735|issn=0899-7667|pmid=9377276}}</ref> LSTM [[recurrent neural network]]s can learn "very deep learning" tasks<ref name="SCHIDHUB" /> with long credit assignment paths that require memories of events that happened thousands of discrete time steps before. The "vanilla LSTM" with forget gate was introduced in 1999 by [[Felix Gers]], [[Juergen Schmidhuber|Schmidhuber]] and Fred Cummins.<ref name="lstm1999">{{Cite book |doi = 10.1049/cp:19991218|chapter = Learning to forget: Continual prediction with LSTM|title = 9th International Conference on Artificial Neural Networks: ICANN '99|volume = 1999|pages = 850–855|year = 1999|last1 = Gers|first1 = Felix| last2 = Schmidhuber|first2 = Jürgen| last3 = Cummins|first3 = Fred| isbn = 0-85296-721-7}}</ref> [[LSTM]] has become the most cited neural network of the 20th century.<ref name=DLhistory/>


== Unsupervised and self-supervised learning ==
In 2015, Rupesh Kumar Srivastava, Klaus Greff, and Schmidhuber used [[LSTM]] principles to create the [[Highway network]], a [[feedforward neural network]] with hundreds of layers, much deeper than previous networks.<ref name="highway2015">{{cite arXiv|last1=Srivastava|first1=Rupesh Kumar|last2=Greff|first2=Klaus|last3=Schmidhuber|first3=Jürgen|title=Highway Networks|eprint=1505.00387|date=2 May 2015|class=cs.LG}}</ref><ref name="highway2015neurips">{{cite journal|last1=Srivastava|first1=Rupesh K|last2=Greff|first2=Klaus|last3=Schmidhuber|first3=Juergen|title=Training Very Deep Networks|journal=Advances in Neural Information Processing Systems |date=2015|volume=28|pages=2377–2385|url=http://papers.nips.cc/paper/5850-training-very-deep-networks|publisher=Curran Associates, Inc.}}</ref> 7 months later, Kaiming He, Xiangyu Zhang; Shaoqing Ren, and Jian Sun won the ImageNet 2015 competition with an open-gated or gateless [[Highway network]] variant called [[Residual neural network]].<ref name="resnet2015">{{Cite conference|last1=He|first1=Kaiming|last2=Zhang|first2=Xiangyu|last3=Ren|first3=Shaoqing|last4=Sun|first4=Jian|date=2016|title=Deep Residual Learning for Image Recognition|url=https://ieeexplore.ieee.org/document/7780459|journal=2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)|location=Las Vegas, NV, USA|publisher=IEEE|pages=770–778|arxiv=1512.03385|doi=10.1109/CVPR.2016.90|isbn=978-1-4673-8851-1}}</ref> This has become the most cited neural network of the 21st century.<ref name=DLhistory />


=== Self-organizing maps ===
In 2011, Xavier Glorot, Antoine Bordes and [[Yoshua Bengio]] found that the [[rectifier (neural networks)|ReLU]]<ref name="Fukushima1969"/> of [[Kunihiko Fukushima]] also helps to overcome the vanishing gradient problem,<ref name="glorot2011">{{cite conference |author1=Xavier Glorot |author2=Antoine Bordes |author3=[[Yoshua Bengio]] |year=2011 |title=Deep sparse rectifier neural networks |url=http://jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf |conference=AISTATS |quote=Rectifier and softplus activation functions. The second one is a smooth version of the first. |access-date=2023-04-14 |archive-date=2016-12-13 |archive-url=https://web.archive.org/web/20161213022121/http://jmlr.org/proceedings/papers/v15/glorot11a/glorot11a.pdf |url-status=dead }}</ref> compared to widely used activation functions prior to 2011.
{{main|Self-organizing map}}


[[Self-organizing map]]s (SOMs) were described by [[Teuvo Kohonen]] in 1982.<ref name="KohonenMap">{{cite journal |last1=Kohonen |first1=Teuvo |last2=Honkela |first2=Timo |year=2007 |title=Kohonen Network |journal=Scholarpedia |volume=2 |issue=1 |pages=1568 |bibcode=2007SchpJ...2.1568K |doi=10.4249/scholarpedia.1568 |doi-access=free}}</ref><ref>{{cite journal |last=Kohonen |first=Teuvo |year=1982 |title=Self-Organized Formation of Topologically Correct Feature Maps |journal=Biological Cybernetics |volume=43 |pages=59–69 |doi=10.1007/bf00337288 |s2cid=206775459 |number=1}}</ref> SOMs are neurophysiologically inspired<ref>{{cite journal |last1=Von der Malsburg |first1=C |year=1973 |title=Self-organization of orientation sensitive cells in the striate cortex |journal=Kybernetik |volume=14 |issue=2 |pages=85–100 |doi=10.1007/bf00288907 |pmid=4786750 |s2cid=3351573}}</ref> [[artificial neural network]]s that learn [[dimensionality reduction|low-dimensional]] representations of high-dimensional data while preserving the [[topology|topological structure]] of the data. They are trained using [[competitive learning]].
== Hardware-based designs ==
The development of [[metal–oxide–semiconductor]] (MOS) [[very-large-scale integration]] (VLSI), combining millions or [[List of Nvidia graphics processing units#Volta series|billions]] of [[MOS transistor]]s onto a single chip in the form of [[complementary MOS]] (CMOS) technology, enabled the development of practical artificial neural networks in the 1980s.<ref name="Mead">{{cite book|url=http://fennetic.net/irc/Christopher%20R.%20Carroll%20Carver%20Mead%20Mohammed%20Ismail%20Analog%20VLSI%20Implementation%20of%20Neural%20Systems.pdf|title=Analog VLSI Implementation of Neural Systems|date=8 May 1989|publisher=[[Kluwer Academic Publishers]]|isbn=978-1-4613-1639-8|last1=Mead|first1=Carver A.|author1-link=Carver Mead|last2=Ismail|first2=Mohammed|series=The Kluwer International Series in Engineering and Computer Science|volume=80|location=Norwell, MA|doi=10.1007/978-1-4613-1639-8}}</ref>


SOMs create internal representations reminiscent of the [[cortical homunculus]], a distorted representation of the [[human body]], based on a neurological "map" of the areas and proportions of the [[human brain]] dedicated to processing [[Sensory processing|sensory function]]s, for different parts of the body.
Computational devices were created in [[CMOS]], for both biophysical simulation and [[neuromorphic computing]] inspired by the structure and function of the human brain. [[Nanodevice]]s<ref>{{cite journal|last1=Yang|first1=J. J.|last2=Pickett|first2=M. D.|last3=Li|first3=X. M.|last4=Ohlberg|first4=D. A. A.|last5=Stewart|first5=D. R.|last6=Williams|first6=R. S.|year=2008|title=Memristive switching mechanism for metal/oxide/metal nanodevices|journal=Nat. Nanotechnol.|volume=3|issue=7|pages=429–433|doi=10.1038/nnano.2008.160|pmid=18654568}}</ref> for very large scale [[principal component]]s analyses and [[convolution]] may create a new class of neural computing because they are fundamentally [[Analog signal|analog]] rather than [[Digital data|digital]] (even though the first implementations may use digital devices).<ref>{{cite journal|last1=Strukov|first1=D. B.|last2=Snider|first2=G. S.|last3=Stewart|first3=D. R.|last4=Williams|first4=R. S.|year=2008|title=The missing memristor found|journal=Nature|volume=453|issue=7191|pages=80–83|bibcode=2008Natur.453...80S|doi=10.1038/nature06932|pmid=18451858|s2cid=4367148}}</ref> Ciresan and colleagues (2010)<ref name=":3">{{Cite journal|last1=Cireşan|first1=Dan Claudiu|last2=Meier|first2=Ueli|last3=Gambardella|first3=Luca Maria|last4=Schmidhuber|first4=Jürgen|date=2010-09-21|title=Deep, Big, Simple Neural Nets for Handwritten Digit Recognition|journal=Neural Computation|volume=22|issue=12|pages=3207–3220|arxiv=1003.0358|doi=10.1162/neco_a_00052|issn=0899-7667|pmid=20858131|s2cid=1918673}}</ref> showed that despite the vanishing gradient problem, GPUs make backpropagation feasible for many-layered feedforward neural networks.


== Contests ==
=== Boltzmann machines ===
During 1985–1995, inspired by statistical mechanics, several architectures and methods were developed by [[Terry Sejnowski]], [[Peter Dayan]], [[Geoffrey Hinton]], etc., including the [[Boltzmann machine]],<ref>{{Cite journal |last1=Ackley |first1=David H. |last2=Hinton |first2=Geoffrey E. |last3=Sejnowski |first3=Terrence J. |date=1985-01-01 |title=A learning algorithm for boltzmann machines |url=https://www.sciencedirect.com/science/article/pii/S0364021385800124 |journal=Cognitive Science |volume=9 |issue=1 |pages=147–169 |doi=10.1016/S0364-0213(85)80012-4 |issn=0364-0213}}</ref> [[restricted Boltzmann machine]],<ref>{{cite book |last=Smolensky |first=Paul |title=Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations |title-link=Connectionism |publisher=MIT Press |year=1986 |isbn=0-262-68053-X |editor1-last=Rumelhart |editor1-first=David E. |pages=[https://archive.org/details/paralleldistribu00rume/page/194 194–281] |chapter=Chapter 6: Information Processing in Dynamical Systems: Foundations of Harmony Theory |editor2-last=McLelland |editor2-first=James L. |chapter-url=https://stanford.edu/~jlmcc/papers/PDP/Volume%201/Chap6_PDP86.pdf}}</ref> [[Helmholtz machine]],<ref name="“nc95“">{{Cite journal |last1=Peter |first1=Dayan |author-link1=Peter Dayan |last2=Hinton |first2=Geoffrey E. |author-link2=Geoffrey Hinton |last3=Neal |first3=Radford M. |author-link3=Radford M. Neal |last4=Zemel |first4=Richard S. |author-link4=Richard Zemel |date=1995 |title=The Helmholtz machine. |journal=Neural Computation |volume=7 |issue=5 |pages=889–904 |doi=10.1162/neco.1995.7.5.889 |pmid=7584891 |s2cid=1890561 |hdl-access=free |hdl=21.11116/0000-0002-D6D3-E}} {{closed access}}</ref> and the [[wake-sleep algorithm]].<ref name=":13">{{Cite journal |last1=Hinton |first1=Geoffrey E. |author-link=Geoffrey Hinton |last2=Dayan |first2=Peter |author-link2=Peter Dayan |last3=Frey |first3=Brendan J. |author-link3=Brendan Frey |last4=Neal |first4=Radford |date=1995-05-26 |title=The wake-sleep algorithm for unsupervised neural networks |journal=Science |volume=268 |issue=5214 |pages=1158–1161 |bibcode=1995Sci...268.1158H |doi=10.1126/science.7761831 |pmid=7761831 |s2cid=871473}}</ref> These were designed for unsupervised learning of deep generative models. However, those were more computationally expensive compared to backpropagation. Boltzmann machine learning algorithm, published in 1985, was briefly popular before being eclipsed by the backpropagation algorithm in 1986. (p.&nbsp;112 <ref>{{Cite book |last=Sejnowski |first=Terrence J. |title=The deep learning revolution |date=2018 |publisher=The MIT Press |isbn=978-0-262-03803-4 |location=Cambridge, Massachusetts}}</ref>).
Between 2009 and 2012, [[recurrent neural network]]s and deep feedforward neural networks developed in [[Jürgen Schmidhuber|Schmidhuber]]'s research group won eight international competitions in [[pattern recognition]] and [[machine learning]].<ref>[http://www.kurzweilai.net/how-bio-inspired-deep-learning-keeps-winning-competitions 2012 Kurzweil AI Interview] {{Webarchive|url=https://web.archive.org/web/20180831075249/http://www.kurzweilai.net/how-bio-inspired-deep-learning-keeps-winning-competitions |date=2018-08-31 }} with [[Jürgen Schmidhuber]] on the eight competitions won by his Deep Learning team 2009–2012</ref><ref>{{Cite web|url=http://www.kurzweilai.net/how-bio-inspired-deep-learning-keeps-winning-competitions|title=How bio-inspired deep learning keeps winning competitions {{!}} KurzweilAI|website=www.kurzweilai.net|access-date=2017-06-16|archive-url=https://web.archive.org/web/20180831075249/http://www.kurzweilai.net/how-bio-inspired-deep-learning-keeps-winning-competitions|archive-date=2018-08-31|url-status=dead}}</ref> For example, the bi-directional and multi-dimensional [[long short-term memory]] (LSTM)<ref>Graves, Alex; and Schmidhuber, Jürgen; ''[http://www.idsia.ch/~juergen/nips2009.pdf Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks]'', in ''Advances in Neural Information Processing Systems 22 (NIPS'22), 7–10 December 2009, Vancouver, BC'', Neural Information Processing Systems (NIPS) Foundation, 2009, pp. 545–552.</ref><ref name="graves 855">{{cite journal|last1=Graves|first1=A.|last2=Liwicki|first2=M.|last3=Fernandez|first3=S.|last4=Bertolami|first4=R.|last5=Bunke|first5=H.|last6=Schmidhuber|first6=J.|year=2009|title=A Novel Connectionist System for Improved Unconstrained Handwriting Recognition|url=http://www.idsia.ch/~juergen/tpami_2008.pdf|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|volume=31|issue=5|pages=855–868|citeseerx=10.1.1.139.4502|doi=10.1109/tpami.2008.137|pmid=19299860|s2cid=14635907}}</ref><ref name="graves20093">{{Cite journal|last1=Graves|first1=Alex|last2=Schmidhuber|first2=Jürgen|date=2009|editor-last=Bengio|editor-first=Yoshua|title=Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks|url=https://papers.nips.cc/paper/3449-offline-handwriting-recognition-with-multidimensional-recurrent-neural-networks|journal=Neural Information Processing Systems (NIPS) Foundation|volume=21 |publisher=Curran Associates, Inc|pages=545–552|editor-last2=Schuurmans|editor-first2=Dale|editor-last3=Lafferty|editor-first3=John|editor-last4=Williams|editor-first4=Chris|editor-last5=Culotta|editor-first5=Aron}}</ref><ref>{{Cite journal|last1=Graves|first1=A.|last2=Liwicki|first2=M.|last3=Fernández|first3=S.|last4=Bertolami|first4=R.|last5=Bunke|first5=H.|last6=Schmidhuber|first6=J.|date=May 2009|title=A Novel Connectionist System for Unconstrained Handwriting Recognition|journal=IEEE Transactions on Pattern Analysis and Machine Intelligence|volume=31|issue=5|pages=855–868|citeseerx=10.1.1.139.4502|doi=10.1109/tpami.2008.137|issn=0162-8828|pmid=19299860|s2cid=14635907}}</ref> of [[Alex Graves (computer scientist)|Graves]] et al. won three competitions in connected handwriting recognition at the 2009 [[International Conference on Document Analysis and Recognition]] (ICDAR), without any prior knowledge about the three languages to be learned.<ref name="graves20093" /><ref name="graves 855" />


[[Geoffrey Hinton]] et al. (2006) proposed learning a high-level [[Knowledge representation|internal representation]] using successive layers of binary or real-valued [[latent variable]]s with a [[restricted Boltzmann machine]]<ref name="smolensky1986">{{cite book |last1=Smolensky |first1=P. |author-link1=Paul Smolensky |url=https://archive.org/details/paralleldistribu00rume/page/194 |title=Parallel Distributed Processing: Explorations in the Microstructure of Cognition |year=1986 |isbn=9780262680530 |editor=D. E. Rumelhart |volume=1 |pages=[https://archive.org/details/paralleldistribu00rume/page/194 194–281] |chapter=Information processing in dynamical systems: Foundations of harmony theory. |publisher=MIT Press |editor2=J. L. McClelland |editor3=PDP Research Group |chapter-url=http://portal.acm.org/citation.cfm?id=104290}}</ref> to model each layer. This RBM is a [[generative model|generative]] [[stochastic neural network|stochastic]] [[feedforward neural network]] that can learn a [[probability distribution]] over its set of inputs. Once sufficiently many layers have been learned, the deep architecture may be used as a [[generative model]] by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.<ref name="hinton2006">{{cite journal |last1=Hinton |first1=G. E. |author-link1=Geoffrey Hinton |last2=Osindero |first2=S. |last3=Teh |first3=Y. |year=2006 |title=A fast learning algorithm for deep belief nets |url=http://www.cs.toronto.edu/~hinton/absps/fastnc.pdf |journal=[[Neural Computation (journal)|Neural Computation]] |volume=18 |issue=7 |pages=1527–1554 |citeseerx=10.1.1.76.1541 |doi=10.1162/neco.2006.18.7.1527 |pmid=16764513 |s2cid=2309950}}</ref><ref name="hinton2009">{{Cite journal |last=Hinton |first=Geoffrey |date=2009-05-31 |title=Deep belief networks |journal=Scholarpedia |volume=4 |issue=5 |pages=5947 |bibcode=2009SchpJ...4.5947H |doi=10.4249/scholarpedia.5947 |issn=1941-6016 |doi-access=free}}</ref>
Ciresan and colleagues won [[pattern recognition]] contests, including the IJCNN 2011 Traffic Sign Recognition Competition,<ref name=":72">{{Cite journal|last1=Cireşan|first1=Dan|last2=Meier|first2=Ueli|last3=Masci|first3=Jonathan|last4=Schmidhuber|first4=Jürgen|date=August 2012|title=Multi-column deep neural network for traffic sign classification|journal=Neural Networks|series=Selected Papers from IJCNN 2011|volume=32|pages=333–338|citeseerx=10.1.1.226.8219|doi=10.1016/j.neunet.2012.02.023|pmid=22386783}}</ref> the ISBI 2012 Segmentation of Neuronal Structures in Electron Microscopy Stacks challenge<ref name=":8">{{Cite book|url=http://papers.nips.cc/paper/4741-deep-neural-networks-segment-neuronal-membranes-in-electron-microscopy-images.pdf|title=Advances in Neural Information Processing Systems 25|last1=Ciresan|first1=Dan|last2=Giusti|first2=Alessandro|last3=Gambardella|first3=Luca M.|last4=Schmidhuber|first4=Juergen|date=2012|publisher=Curran Associates, Inc.|editor-last=Pereira|editor-first=F.|pages=2843–2851|editor-last2=Burges|editor-first2=C. J. C.|editor-last3=Bottou|editor-first3=L.|editor-last4=Weinberger|editor-first4=K. Q.}}</ref> and others. Their neural networks were the first pattern recognizers to achieve human-competitive/superhuman performance<ref name=":92">{{Cite book|last1=Ciresan|first1=Dan|last2=Meier|first2=U.|last3=Schmidhuber|first3=J.|title=2012 IEEE Conference on Computer Vision and Pattern Recognition |chapter=Multi-column deep neural networks for image classification |date=June 2012|isbn=978-1-4673-1228-8|pages=3642–3649|arxiv=1202.2745|bibcode=2012arXiv1202.2745C|citeseerx=10.1.1.300.3283|doi=10.1109/cvpr.2012.6248110|s2cid=2161592}}</ref> on benchmarks such as traffic sign recognition (IJCNN 2012), or the [[MNIST database|MNIST handwritten digits problem]].


=== Deep learning ===
Researchers demonstrated (2010) that deep neural networks interfaced to a [[hidden Markov model]] with context-dependent states that define the neural network output layer can drastically reduce errors in large-vocabulary [[speech recognition]] tasks such as voice search.{{Citation needed|date=August 2019}}
In 2012, [[Andrew Ng]] and [[Jeff Dean (computer scientist)|Jeff Dean]] created an FNN that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images taken from [[YouTube]] videos.<ref name="ng2012">{{cite arXiv |eprint=1112.6209 |class=cs.LG |first1=Andrew |last1=Ng |first2=Jeff |last2=Dean |title=Building High-level Features Using Large Scale Unsupervised Learning |year=2012}}</ref>


== Other aspects ==
GPU-based implementations<ref name=":6">{{Cite journal|last1=Ciresan|first1=D. C.|last2=Meier|first2=U.|last3=Masci|first3=J.|last4=Gambardella|first4=L. M.|last5=Schmidhuber|first5=J.|date=2011|title=Flexible, High Performance Convolutional Neural Networks for Image Classification|url=http://ijcai.org/papers11/Papers/IJCAI11-210.pdf|journal=International Joint Conference on Artificial Intelligence|doi=10.5591/978-1-57735-516-8/ijcai11-210}}</ref> of this approach won many pattern recognition contests, including the IJCNN 2011 Traffic Sign Recognition Competition,<ref name=":72" /> the ISBI 2012 Segmentation of neuronal structures in EM stacks challenge,<ref name=":8" /> the [[ImageNet Competition]]<ref name="krizhevsky2012">{{cite journal|last1=Krizhevsky|first1=Alex|last2=Sutskever|first2=Ilya|last3=Hinton|first3=Geoffry|date=2012|title=ImageNet Classification with Deep Convolutional Neural Networks|url=https://www.cs.toronto.edu/~kriz/imagenet_classification_with_deep_convolutional.pdf|journal=NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada}}</ref> and others.

=== Knowledge distillation ===
[[Knowledge distillation]] or model distillation is the process of transferring knowledge from a large [[Statistical model|model]] to a smaller one. The idea of using the output of one neural network to train another neural network was studied as the teacher-student network configuration.<ref>{{Cite journal |last1=Watkin |first1=Timothy L. H. |last2=Rau |first2=Albrecht |last3=Biehl |first3=Michael |date=1993-04-01 |title=The statistical mechanics of learning a rule |url=https://link.aps.org/doi/10.1103/RevModPhys.65.499 |journal=Reviews of Modern Physics |volume=65 |issue=2 |pages=499–556 |doi=10.1103/RevModPhys.65.499|bibcode=1993RvMP...65..499W }}</ref> In 1992, several papers studied the statistical mechanics of teacher-student network configuration, where both networks are committee machines<ref>{{Cite journal |last1=Schwarze |first1=H |last2=Hertz |first2=J |date=1992-10-15 |title=Generalization in a Large Committee Machine |url=https://iopscience.iop.org/article/10.1209/0295-5075/20/4/015 |journal=Europhysics Letters (EPL) |volume=20 |issue=4 |pages=375–380 |doi=10.1209/0295-5075/20/4/015 |bibcode=1992EL.....20..375S |issn=0295-5075}}</ref><ref>{{Cite journal |last1=Mato |first1=G |last2=Parga |first2=N |date=1992-10-07 |title=Generalization properties of multilayered neural networks |url=https://iopscience.iop.org/article/10.1088/0305-4470/25/19/017 |journal=Journal of Physics A: Mathematical and General |volume=25 |issue=19 |pages=5047–5054 |doi=10.1088/0305-4470/25/19/017 |bibcode=1992JPhA...25.5047M |issn=0305-4470}}</ref> or both are parity machines.<ref>{{Cite journal |last1=Hansel |first1=D |last2=Mato |first2=G |last3=Meunier |first3=C |date=1992-11-01 |title=Memorization Without Generalization in a Multilayered Neural Network |url=https://iopscience.iop.org/article/10.1209/0295-5075/20/5/015 |journal=Europhysics Letters (EPL) |volume=20 |issue=5 |pages=471–476 |doi=10.1209/0295-5075/20/5/015 |bibcode=1992EL.....20..471H |issn=0295-5075}}</ref>

Another early example of network distillation was also published in 1992, in the field of [[Recurrent neural network|recurrent neural networks]] (RNNs). The problem was sequence prediction. It was solved by two RNNs. One of them ("atomizer") predicted the sequence, and another ("chunker") predicted the errors of the atomizer. Simultaneously, the atomizer predicted the internal states of the chunker. After the atomizer manages to predict the chunker's internal states well, it would start fixing the errors, and soon the chunker is obsoleted, leaving just one RNN in the end.<ref name="schmidhuber19922">{{cite journal |last1=Schmidhuber |first1=Jürgen |year=1992 |title=Learning complex, extended sequences using the principle of history compression |url=ftp://ftp.idsia.ch/pub/juergen/chunker.pdf |journal=Neural Computation |volume=4 |issue=2 |pages=234–242 |doi=10.1162/neco.1992.4.2.234 |s2cid=18271205 }}{{Dead link|date=August 2024 |bot=InternetArchiveBot |fix-attempted=yes }}</ref>

A related methodology was ''model compression'' or ''pruning'', where a trained network is reduced in size. It was inspired by neurobiological studies showing that the human brain is resistant to damage, and was studied in the 1980s, via methods such as Biased Weight Decay<ref>{{Cite journal |last1=Hanson |first1=Stephen |last2=Pratt |first2=Lorien |date=1988 |title=Comparing Biases for Minimal Network Construction with Back-Propagation |url=https://proceedings.neurips.cc/paper/1988/hash/1c9ac0159c94d8d0cbedc973445af2da-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Morgan-Kaufmann |volume=1}}</ref> and Optimal Brain Damage.<ref>{{Cite journal |last1=LeCun |first1=Yann |last2=Denker |first2=John |last3=Solla |first3=Sara |date=1989 |title=Optimal Brain Damage |url=https://proceedings.neurips.cc/paper/1989/hash/6c9882bbac1c7093bd25041881277658-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Morgan-Kaufmann |volume=2}}</ref>

== Hardware-based designs ==
The development of [[metal–oxide–semiconductor]] (MOS) [[very-large-scale integration]] (VLSI), combining millions or [[List of Nvidia graphics processing units#Volta series|billions]] of [[MOS transistor]]s onto a single chip in the form of [[complementary MOS]] (CMOS) technology, enabled the development of practical artificial neural networks in the 1980s.<ref name="Mead">{{cite book|url=http://fennetic.net/irc/Christopher%20R.%20Carroll%20Carver%20Mead%20Mohammed%20Ismail%20Analog%20VLSI%20Implementation%20of%20Neural%20Systems.pdf|title=Analog VLSI Implementation of Neural Systems|date=8 May 1989|publisher=[[Kluwer Academic Publishers]]|isbn=978-1-4613-1639-8|last1=Mead|first1=Carver A.|author1-link=Carver Mead|last2=Ismail|first2=Mohammed|series=The Kluwer International Series in Engineering and Computer Science|volume=80|location=Norwell, MA|doi=10.1007/978-1-4613-1639-8}}</ref>


Computational devices were created in [[CMOS]], for both biophysical simulation and [[neuromorphic computing]] inspired by the structure and function of the human brain. [[Nanodevice]]s<ref>{{cite journal|last1=Yang|first1=J. J.|last2=Pickett|first2=M. D.|last3=Li|first3=X. M.|last4=Ohlberg|first4=D. A. A.|last5=Stewart|first5=D. R.|last6=Williams|first6=R. S.|year=2008|title=Memristive switching mechanism for metal/oxide/metal nanodevices|journal=Nat. Nanotechnol.|volume=3|issue=7|pages=429–433|doi=10.1038/nnano.2008.160|pmid=18654568}}</ref> for very large scale [[principal component]]s analyses and [[convolution]] may create a new class of neural computing because they are fundamentally [[Analog signal|analog]] rather than [[Digital data|digital]] (even though the first implementations may use digital devices).<ref>{{cite journal|last1=Strukov|first1=D. B.|last2=Snider|first2=G. S.|last3=Stewart|first3=D. R.|last4=Williams|first4=R. S.|year=2008|title=The missing memristor found|journal=Nature|volume=453|issue=7191|pages=80–83|bibcode=2008Natur.453...80S|doi=10.1038/nature06932|pmid=18451858|s2cid=4367148}}</ref>
Deep, highly nonlinear neural architectures similar to the [[neocognitron]]<ref name="K. Fukushima. Neocognitron 1980">{{cite journal|author=Fukushima, K.|year=1980|title=Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position|journal=Biological Cybernetics|volume=36|issue=4|pages=93–202|doi=10.1007/BF00344251|pmid=7370364|s2cid=206775608}}</ref> and the "standard architecture of vision",<ref>{{cite journal|last1=Riesenhuber|first1=M|last2=Poggio|first2=T|year=1999|title=Hierarchical models of object recognition in cortex|journal=Nature Neuroscience|volume=2|issue=11|pages=1019–1025|doi=10.1038/14819|pmid=10526343|s2cid=8920227}}</ref> inspired by [[Simple cell|simple]] and [[complex cell]]s, were pre-trained with unsupervised methods by Hinton.<ref name="hinton2009" /><ref name="hinton2006" /> A team from his lab won a 2012 contest sponsored by [[Merck & Co.|Merck]] to design software to help find molecules that might identify new drugs.<ref>{{cite news|url=https://www.nytimes.com/2012/11/24/science/scientists-see-advances-in-deep-learning-a-part-of-artificial-intelligence.html|title=Scientists See Promise in Deep-Learning Programs|last=Markoff|first=John|date=November 23, 2012|newspaper=New York Times}}</ref>


== Notes ==
== Notes ==

Latest revision as of 05:49, 20 November 2024

Artificial neural networks (ANNs) are models created using machine learning to perform a number of tasks. Their creation was inspired by biological neural circuitry.[1][a] While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist Frank Rosenblatt, who developed the perceptron.[1] Little research was conducted on ANNs in the 1970s and 1980s, with the AAAI calling this period an "AI winter".[2]

Later, advances in hardware and the development of the backpropagation algorithm, as well as recurrent neural networks and convolutional neural networks, renewed interest in ANNs. The 2010s saw the development of a deep neural network (i.e., one with many layers) called AlexNet.[3] It greatly outperformed other image recognition models, and is thought to have launched the ongoing AI spring, and further increasing interest in deep learning.[4] The transformer architecture was first described in 2017 as a method to teach ANNs grammatical dependencies in language,[5] and is the predominant architecture used by large language models such as GPT-4. Diffusion models were first described in 2015, and became the basis of image generation models such as DALL-E in the 2020s.[citation needed]

Perceptrons and other early neural networks

[edit]

The simplest feedforward network consists of a single weight layer without activation functions. It would be just a linear map, and training it would be linear regression. Linear regression by least squares method was used by Adrien-Marie Legendre (1805) and Carl Friedrich Gauss (1795) for the prediction of planetary movement.[6][7][8][9]

A logical calculus of the ideas immanent in nervous activity (Warren McCulloch and Walter Pitts, 1943) studied several abstract models for neural networks using symbolic logic of Rudolf Carnap and Principia Mathematica. The paper argued that several abstract models of neural networks (some learning, some not learning) have the same computational power as Turing machines.[10] This model paved the way for research to split into two approaches. One approach focused on biological processes while the other focused on the application of neural networks to artificial intelligence. This work led to work on nerve networks and their link to finite automata.[11]

In the early 1940s, D. O. Hebb[12] created a learning hypothesis based on the mechanism of neural plasticity that became known as Hebbian learning. Hebbian learning is unsupervised learning. This evolved into models for long-term potentiation. Researchers started applying these ideas to computational models in 1948 with Turing's B-type machines. B. Farley and Wesley A. Clark[13] (1954) first used computational machines, then called "calculators", to simulate a Hebbian network. Other neural network computational machines were created by Rochester, Holland, Habit and Duda (1956).[14]

Frank Rosenblatt[1] (1958) created the perceptron, an algorithm for pattern recognition. A multilayer perceptron (MLP) comprised 3 layers: an input layer, a hidden layer with randomized weights that did not learn, and an output layer. With mathematical notation, Rosenblatt described circuitry not in the basic perceptron, such as the exclusive-or circuit that could not be processed by neural networks at the time. In 1959, a biological model proposed by Nobel laureates Hubel and Wiesel was based on their discovery of two types of cells in the primary visual cortex: simple cells and complex cells.[15] He later published a 1962 book also introduced variants and computer experiments, including a version with four-layer perceptrons where the last two layers have learned weights (and thus a proper multilayer perceptron).[16]: section 16  Some consider that the 1962 book developed and explored all of the basic ingredients of the deep learning systems of today.[17]

Some say that research stagnated following Marvin Minsky and Papert Perceptrons (1969).[18]

Group method of data handling, a method to train arbitrarily deep neural networks was published by Alexey Ivakhnenko and Lapa in 1967, which they regarded as a form of polynomial regression,[19] or a generalization of Rosenblatt's perceptron.[20] A 1971 paper described a deep network with eight layers trained by this method.[21]

The first deep learning multilayer perceptron trained by stochastic gradient descent[22] was published in 1967 by Shun'ichi Amari.[23] In computer experiments conducted by Amari's student Saito, a five layer MLP with two modifiable layers learned internal representations to classify non-linearily separable pattern classes.[24] Subsequent developments in hardware and hyperparameter tunings have made end-to-end stochastic gradient descent the currently dominant training technique.

Backpropagation

[edit]

Backpropagation is an efficient application of the chain rule derived by Gottfried Wilhelm Leibniz in 1673[25] to networks of differentiable nodes. The terminology "back-propagating errors" was actually introduced in 1962 by Rosenblatt,[16] but he did not know how to implement this, although Henry J. Kelley had a continuous precursor of backpropagation in 1960 in the context of control theory.[26] The modern form of backpropagation was developed multiple times in early 1970s. The earliest published instance was Seppo Linnainmaa's master thesis (1970).[27][28] Paul Werbos developed it independently in 1971,[29] but had difficulty publishing it until 1982.[30] In 1986, David E. Rumelhart et al. popularized backpropagation.[31]

Recurrent network architectures

[edit]

One origin of RNN was statistical mechanics. The Ising model was developed by Wilhelm Lenz[32] and Ernst Ising[33] in the 1920s[34] as a simple statistical mechanical model of magnets at equilibrium. Glauber in 1963 studied the Ising model evolving in time, as a process towards equilibrium (Glauber dynamics), adding in the component of time.[35] Shun'ichi Amari in 1972 proposed to modify the weights of an Ising model by Hebbian learning rule as a model of associative memory, adding in the component of learning.[36] This was popularized as the Hopfield network (1982).[37]

Another origin of RNN was neuroscience. The word "recurrent" is used to describe loop-like structures in anatomy. In 1901, Cajal observed "recurrent semicircles" in the cerebellar cortex.[38] In 1933, Lorente de Nó discovered "recurrent, reciprocal connections" by Golgi's method, and proposed that excitatory loops explain certain aspects of the vestibulo-ocular reflex.[39][40] Hebb considered "reverberating circuit" as an explanation for short-term memory.[41] (McCulloch & Pitts 1943) considered neural networks that contains cycles, and noted that the current activity of such networks can be affected by activity indefinitely far in the past.

Two early influential works were the Jordan network (1986) and the Elman network (1990), which applied RNN to study cognitive psychology. In 1993, a neural history compressor system solved a "Very Deep Learning" task that required more than 1000 subsequent layers in an RNN unfolded in time.[42]

LSTM

[edit]

Sepp Hochreiter's diploma thesis (1991)[43] proposed the neural history compressor, and identified and analyzed the vanishing gradient problem.[43][44] In 1993, a neural history compressor system solved a "Very Deep Learning" task that required more than 1000 subsequent layers in an RNN unfolded in time.[45][42] Hochreiter proposed recurrent residual connections to solve the vanishing gradient problem. This led to the long short-term memory (LSTM), published in 1995.[46] LSTM can learn "very deep learning" tasks[47] with long credit assignment paths that require memories of events that happened thousands of discrete time steps before. That LSTM was not yet the modern architecture, which required a "forget gate", introduced in 1999,[48] which became the standard RNN architecture.

Long short-term memory (LSTM) networks were invented by Hochreiter and Schmidhuber in 1995 and set accuracy records in multiple applications domains.[46][49] It became the default choice for RNN architecture.

Around 2006, LSTM started to revolutionize speech recognition, outperforming traditional models in certain speech applications.[50][51] LSTM also improved large-vocabulary speech recognition[52][53] and text-to-speech synthesis[54] and was used in Google voice search, and dictation on Android devices.[55]

LSTM broke records for improved machine translation,[56] language modeling[57] and Multilingual Language Processing.[58] LSTM combined with convolutional neural networks (CNNs) improved automatic image captioning.[59]

Convolutional neural networks (CNNs)

[edit]

The origin of the CNN architecture is the "neocognitron"[60] introduced by Kunihiko Fukushima in 1980.[61][62] It was inspired by work of Hubel and Wiesel in the 1950s and 1960s which showed that cat visual cortices contain neurons that individually respond to small regions of the visual field. The neocognitron introduced the two basic types of layers in CNNs: convolutional layers, and downsampling layers. A convolutional layer contains units whose receptive fields cover a patch of the previous layer. The weight vector (the set of adaptive parameters) of such a unit is often called a filter. Units can share filters. Downsampling layers contain units whose receptive fields cover patches of previous convolutional layers. Such a unit typically computes the average of the activations of the units in its patch. This downsampling helps to correctly classify objects in visual scenes even when the objects are shifted.

In 1969, Kunihiko Fukushima also introduced the ReLU (rectified linear unit) activation function.[63][64] The rectifier has become the most popular activation function for CNNs and deep neural networks in general.[65]

The time delay neural network (TDNN) was introduced in 1987 by Alex Waibel and was one of the first CNNs, as it achieved shift invariance.[66] It did so by utilizing weight sharing in combination with backpropagation training.[67] Thus, while also using a pyramidal structure as in the neocognitron, it performed a global optimization of the weights instead of a local one.[66]

In 1988, Wei Zhang et al. applied backpropagation to a CNN (a simplified Neocognitron with convolutional interconnections between the image feature layers and the last fully connected layer) for alphabet recognition. They also proposed an implementation of the CNN with an optical computing system.[68][69]

Kunihiko Fukushima published the neocognitron in 1980.[70] Max pooling appears in a 1982 publication on the neocognitron.[71] In 1989, Yann LeCun et al. trained a CNN with the purpose of recognizing handwritten ZIP codes on mail. While the algorithm worked, training required 3 days.[72][73] It used max pooling. Learning was fully automatic, performed better than manual coefficient design, and was suited to a broader range of image recognition problems and image types. Subsequently, Wei Zhang, et al. modified their model by removing the last fully connected layer and applied it for medical image object segmentation in 1991[74] and breast cancer detection in mammograms in 1994.[75]

In a variant of the neocognitron called the cresceptron, instead of using Fukushima's spatial averaging, J. Weng et al. also used max-pooling where a downsampling unit computes the maximum of the activations of the units in its patch.[76][77][78][79]

LeNet-5, a 7-level CNN by Yann LeCun et al. in 1998,[80] that classifies digits, was applied by several banks to recognize hand-written numbers on checks (British English: cheques) digitized in 32x32 pixel images. The ability to process higher-resolution images requires larger and more layers of CNNs, so this technique is constrained by the availability of computing resources.

In 2010, Backpropagation training through max-pooling was accelerated by GPUs and shown to perform better than other pooling variants.[81] Behnke (2003) relied only on the sign of the gradient (Rprop)[82] on problems such as image reconstruction and face localization. Rprop is a first-order optimization algorithm created by Martin Riedmiller and Heinrich Braun in 1992.[83]

Deep learning

[edit]

The deep learning revolution started around CNN- and GPU-based computer vision.

Although CNNs trained by backpropagation had been around for decades and GPU implementations of NNs for years,[84] including CNNs,[85] faster implementations of CNNs on GPUs were needed to progress on computer vision. Later, as deep learning becomes widespread, specialized hardware and algorithm optimizations were developed specifically for deep learning.[86]

A key advance for the deep learning revolution was hardware advances, especially GPU. Some early work dated back to 2004.[84][85] In 2009, Raina, Madhavan, and Andrew Ng reported a 100M deep belief network trained on 30 Nvidia GeForce GTX 280 GPUs, an early demonstration of GPU-based deep learning. They reported up to 70 times faster training.[87]

In 2011, a CNN named DanNet[88][89] by Dan Ciresan, Ueli Meier, Jonathan Masci, Luca Maria Gambardella, and Jürgen Schmidhuber achieved for the first time superhuman performance in a visual pattern recognition contest, outperforming traditional methods by a factor of 3.[47] It then won more contests.[90][91] They also showed how max-pooling CNNs on GPU improved performance significantly.[92]

Many discoveries were empirical and focused on engineering. For example, in 2011, Xavier Glorot, Antoine Bordes and Yoshua Bengio found that the ReLU[63] worked better than widely used activation functions prior to 2011.

In October 2012, AlexNet by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton[93] won the large-scale ImageNet competition by a significant margin over shallow machine learning methods. Further incremental improvements included the VGG-16 network by Karen Simonyan and Andrew Zisserman[94] and Google's Inceptionv3.[95]

The success in image classification was then extended to the more challenging task of generating descriptions (captions) for images, often as a combination of CNNs and LSTMs.[96][97][98]

In 2014, the state of the art was training “very deep neural network” with 20 to 30 layers.[99] Stacking too many layers led to a steep reduction in training accuracy,[100] known as the "degradation" problem.[101] In 2015, two techniques were developed concurrently to train very deep networks: highway network[102] and residual neural network (ResNet).[103] The ResNet research team attempted to train deeper ones by empirically testing various tricks for training deeper networks until they discovered the deep residual network architecture.[104]

Generative adversarial networks

[edit]

In 1991, Juergen Schmidhuber published "artificial curiosity", neural networks in a zero-sum game.[105] The first network is a generative model that models a probability distribution over output patterns. The second network learns by gradient descent to predict the reactions of the environment to these patterns. GANs can be regarded as a case where the environmental reaction is 1 or 0 depending on whether the first network's output is in a given set.[106] It was extended to "predictability minimization" to create disentangled representations of input patterns.[107][108]

Other people had similar ideas but did not develop them similarly. An idea involving adversarial networks was published in a 2010 blog post by Olli Niemitalo.[109] This idea was never implemented and did not involve stochasticity in the generator and thus was not a generative model. It is now known as a conditional GAN or cGAN.[110] An idea similar to GANs was used to model animal behavior by Li, Gauci and Gross in 2013.[111]

Another inspiration for GANs was noise-contrastive estimation,[112] which uses the same loss function as GANs and which Goodfellow studied during his PhD in 2010–2014.

Generative adversarial network (GAN) by (Ian Goodfellow et al., 2014)[113] became state of the art in generative modeling during 2014-2018 period. Excellent image quality is achieved by Nvidia's StyleGAN (2018)[114] based on the Progressive GAN by Tero Karras et al.[115] Here the GAN generator is grown from small to large scale in a pyramidal fashion. Image generation by GAN reached popular success, and provoked discussions concerning deepfakes.[116] Diffusion models (2015)[117] eclipsed GANs in generative modeling since then, with systems such as DALL·E 2 (2022) and Stable Diffusion (2022).

Attention mechanism and Transformer

[edit]

The human selective attention had been studied in neuroscience and cognitive psychology.[118] Selective attention of audition was studied in the cocktail party effect (Colin Cherry, 1953).[119] (Donald Broadbent, 1958) proposed the filter model of attention.[120] Selective attention of vision was studied in the 1960s by George Sperling's partial report paradigm. It was also noticed that saccade control is modulated by cognitive processes, in that the eye moves preferentially towards areas of high salience. As the fovea of the eye is small, the eye cannot sharply resolve all of the visual field at once. The use of saccade control allows the eye to quickly scan important features of a scene.[121]

These researches inspired algorithms, such as a variant of the Neocognitron.[122][123] Conversely, developments in neural networks had inspired circuit models of biological visual attention.[124][125]

A key aspect of attention mechanism is the use of multiplicative operations, which had been studied under the names of higher-order neural networks,[126] multiplication units,[127] sigma-pi units,[128] fast weight controllers,[129] and hyper-networks.[130]

Recurrent attention

[edit]

During the deep learning era, attention mechanism was developed solve similar problems in encoding-decoding.[131]

The idea of encoder-decoder sequence transduction had been developed in the early 2010s. The papers most commonly cited as the originators that produced seq2seq are two papers from 2014.[132][133] A seq2seq architecture employs two RNN, typically LSTM, an "encoder" and a "decoder", for sequence transduction, such as machine translation. They became state of the art in machine translation, and was instrumental in the development of attention mechanism and Transformer.

An image captioning model was proposed in 2015, citing inspiration from the seq2seq model.[134] that would encode an input image into a fixed-length vector. (Xu et al. 2015),[135] citing (Bahdanau et al. 2014),[136] applied the attention mechanism as used in the seq2seq model to image captioning.

Transformer

[edit]

One problem with seq2seq models was their use of recurrent neural networks, which are not parallelizable as both the encoder and the decoder processes the sequence token-by-token. The decomposable attention attempted to solve this problem by processing the input sequence in parallel, before computing a "soft alignment matrix" ("alignment" is the terminology used by (Bahdanau et al. 2014)[136]). This allowed parallel processing.

The idea of using attention mechanism for self-attention, instead of in an encoder-decoder (cross-attention), was also proposed during this period, such as in differentiable neural computers and neural Turing machines.[137] It was termed intra-attention[138] where an LSTM is augmented with a memory network as it encodes an input sequence.

These strands of development were combined in the Transformer architecture, published in Attention Is All You Need (2017). Subsequently, attention mechanisms were extended within the framework of Transformer architecture.

Seq2seq models with attention still suffered from the same issue with recurrent networks, which is that they are hard to parallelize, which prevented them to be accelerated on GPUs. In 2016, decomposable attention applied attention mechanism to the feedforward network, which are easy to parallelize.[139] One of its authors, Jakob Uszkoreit, suspected that attention without recurrence is sufficient for language translation, thus the title "attention is all you need".[140]

In 2017, the original (100M-sized) encoder-decoder transformer model was proposed in the "Attention is all you need" paper. At the time, the focus of the research was on improving seq2seq for machine translation, by removing its recurrence to processes all tokens in parallel, but preserving its dot-product attention mechanism to keep its text processing performance.[141] Its parallelizability was an important factor to its widespread use in large neural networks.[142]

Unsupervised and self-supervised learning

[edit]

Self-organizing maps

[edit]

Self-organizing maps (SOMs) were described by Teuvo Kohonen in 1982.[143][144] SOMs are neurophysiologically inspired[145] artificial neural networks that learn low-dimensional representations of high-dimensional data while preserving the topological structure of the data. They are trained using competitive learning.

SOMs create internal representations reminiscent of the cortical homunculus, a distorted representation of the human body, based on a neurological "map" of the areas and proportions of the human brain dedicated to processing sensory functions, for different parts of the body.

Boltzmann machines

[edit]

During 1985–1995, inspired by statistical mechanics, several architectures and methods were developed by Terry Sejnowski, Peter Dayan, Geoffrey Hinton, etc., including the Boltzmann machine,[146] restricted Boltzmann machine,[147] Helmholtz machine,[148] and the wake-sleep algorithm.[149] These were designed for unsupervised learning of deep generative models. However, those were more computationally expensive compared to backpropagation. Boltzmann machine learning algorithm, published in 1985, was briefly popular before being eclipsed by the backpropagation algorithm in 1986. (p. 112 [150]).

Geoffrey Hinton et al. (2006) proposed learning a high-level internal representation using successive layers of binary or real-valued latent variables with a restricted Boltzmann machine[151] to model each layer. This RBM is a generative stochastic feedforward neural network that can learn a probability distribution over its set of inputs. Once sufficiently many layers have been learned, the deep architecture may be used as a generative model by reproducing the data when sampling down the model (an "ancestral pass") from the top level feature activations.[152][153]

Deep learning

[edit]

In 2012, Andrew Ng and Jeff Dean created an FNN that learned to recognize higher-level concepts, such as cats, only from watching unlabeled images taken from YouTube videos.[154]

Other aspects

[edit]

Knowledge distillation

[edit]

Knowledge distillation or model distillation is the process of transferring knowledge from a large model to a smaller one. The idea of using the output of one neural network to train another neural network was studied as the teacher-student network configuration.[155] In 1992, several papers studied the statistical mechanics of teacher-student network configuration, where both networks are committee machines[156][157] or both are parity machines.[158]

Another early example of network distillation was also published in 1992, in the field of recurrent neural networks (RNNs). The problem was sequence prediction. It was solved by two RNNs. One of them ("atomizer") predicted the sequence, and another ("chunker") predicted the errors of the atomizer. Simultaneously, the atomizer predicted the internal states of the chunker. After the atomizer manages to predict the chunker's internal states well, it would start fixing the errors, and soon the chunker is obsoleted, leaving just one RNN in the end.[159]

A related methodology was model compression or pruning, where a trained network is reduced in size. It was inspired by neurobiological studies showing that the human brain is resistant to damage, and was studied in the 1980s, via methods such as Biased Weight Decay[160] and Optimal Brain Damage.[161]

Hardware-based designs

[edit]

The development of metal–oxide–semiconductor (MOS) very-large-scale integration (VLSI), combining millions or billions of MOS transistors onto a single chip in the form of complementary MOS (CMOS) technology, enabled the development of practical artificial neural networks in the 1980s.[162]

Computational devices were created in CMOS, for both biophysical simulation and neuromorphic computing inspired by the structure and function of the human brain. Nanodevices[163] for very large scale principal components analyses and convolution may create a new class of neural computing because they are fundamentally analog rather than digital (even though the first implementations may use digital devices).[164]

Notes

[edit]
  1. ^ Neurons generate an action potential—the release of neurotransmitters that are chemical inputs to other neurons—based on the sum of its incoming chemical inputs.

References

[edit]
  1. ^ a b c Rosenblatt, F. (1958). "The Perceptron: A Probabilistic Model For Information Storage And Organization In The Brain". Psychological Review. 65 (6): 386–408. CiteSeerX 10.1.1.588.3775. doi:10.1037/h0042519. PMID 13602029. S2CID 12781225.
  2. ^ Crevier, Daniel (1993). AI: The Tumultuous Search for Artificial Intelligence. New York, NY: BasicBooks. ISBN 0-465-02997-3.
  3. ^ Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E. (2017-05-24). "ImageNet classification with deep convolutional neural networks" (PDF). Communications of the ACM. 60 (6): 84–90. doi:10.1145/3065386. ISSN 0001-0782. S2CID 195908774.
  4. ^ Gershgorn, Dave (26 July 2017). "The data that transformed AI research—and possibly the world". Quartz.
  5. ^ Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need" (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
  6. ^ Merriman, Mansfield. A List of Writings Relating to the Method of Least Squares: With Historical and Critical Notes. Vol. 4. Academy, 1877.
  7. ^ Stigler, Stephen M. (1981). "Gauss and the Invention of Least Squares". Ann. Stat. 9 (3): 465–474. doi:10.1214/aos/1176345451.
  8. ^ Bretscher, Otto (1995). Linear Algebra With Applications (3rd ed.). Upper Saddle River, NJ: Prentice Hall.
  9. ^ Stigler, Stephen M. (1986). The History of Statistics: The Measurement of Uncertainty before 1900. Cambridge: Harvard. ISBN 0-674-40340-1.
  10. ^ McCulloch, Warren; Pitts, Walter (1943). "A Logical Calculus of Ideas Immanent in Nervous Activity". Bulletin of Mathematical Biophysics. 5 (4): 115–133. doi:10.1007/BF02478259.
  11. ^ Kleene, S. C. (1956-12-31), Shannon, C. E.; McCarthy, J. (eds.), "Representation of Events in Nerve Nets and Finite Automata", Automata Studies. (AM-34), Princeton University Press, pp. 3–42, doi:10.1515/9781400882618-002, ISBN 978-1-4008-8261-8, retrieved 2024-10-14
  12. ^ Hebb, Donald (1949). The Organization of Behavior. New York: Wiley. ISBN 978-1-135-63190-1.
  13. ^ Farley, B.G.; W.A. Clark (1954). "Simulation of Self-Organizing Systems by Digital Computer". IRE Transactions on Information Theory. 4 (4): 76–84. doi:10.1109/TIT.1954.1057468.
  14. ^ Rochester, N.; J.H. Holland; L.H. Habit; W.L. Duda (1956). "Tests on a cell assembly theory of the action of the brain, using a large digital computer". IRE Transactions on Information Theory. 2 (3): 80–93. doi:10.1109/TIT.1956.1056810.
  15. ^ David H. Hubel and Torsten N. Wiesel (2005). Brain and visual perception: the story of a 25-year collaboration. Oxford University Press US. p. 106. ISBN 978-0-19-517618-6.
  16. ^ a b Rosenblatt, Frank (1962). Principles of Neurodynamics. Spartan, New York.
  17. ^ Tappert, Charles C. (2019). "Who Is the Father of Deep Learning?". 2019 International Conference on Computational Science and Computational Intelligence (CSCI). IEEE. pp. 343–348. doi:10.1109/CSCI49370.2019.00067. ISBN 978-1-7281-5584-5. S2CID 216043128. Retrieved 31 May 2021.
  18. ^ Minsky, Marvin; Papert, Seymour (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press. ISBN 978-0-262-63022-1.
  19. ^ Ivakhnenko, A. G.; Lapa, V. G. (1967). Cybernetics and Forecasting Techniques. American Elsevier Publishing Co. ISBN 978-0-444-00020-0.
  20. ^ Ivakhnenko, A.G. (March 1970). "Heuristic self-organization in problems of engineering cybernetics". Automatica. 6 (2): 207–219. doi:10.1016/0005-1098(70)90092-0.
  21. ^ Ivakhnenko, Alexey (1971). "Polynomial theory of complex systems" (PDF). IEEE Transactions on Systems, Man, and Cybernetics. SMC-1 (4): 364–378. doi:10.1109/TSMC.1971.4308320. Archived (PDF) from the original on 2017-08-29. Retrieved 2019-11-05.
  22. ^ Robbins, H.; Monro, S. (1951). "A Stochastic Approximation Method". The Annals of Mathematical Statistics. 22 (3): 400. doi:10.1214/aoms/1177729586.
  23. ^ Amari, Shun'ichi (1967). "A theory of adaptive pattern classifier". IEEE Transactions. EC (16): 279–307.
  24. ^ Schmidhuber, Jürgen (2022). "Annotated History of Modern AI and Deep Learning". arXiv:2212.11279 [cs.NE].
  25. ^ Leibniz, Gottfried Wilhelm Freiherr von (1920). The Early Mathematical Manuscripts of Leibniz: Translated from the Latin Texts Published by Carl Immanuel Gerhardt with Critical and Historical Notes (Leibniz published the chain rule in a 1676 memoir). Open court publishing Company. ISBN 9780598818461.
  26. ^ Kelley, Henry J. (1960). "Gradient theory of optimal flight paths". ARS Journal. 30 (10): 947–954. doi:10.2514/8.5282.
  27. ^ Linnainmaa, Seppo (1970). The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors (Masters) (in Finnish). University of Helsinki. p. 6–7.
  28. ^ Linnainmaa, Seppo (1976). "Taylor expansion of the accumulated rounding error". BIT Numerical Mathematics. 16 (2): 146–160. doi:10.1007/bf01931367. S2CID 122357351.
  29. ^ Anderson, James A.; Rosenfeld, Edward, eds. (2000). Talking Nets: An Oral History of Neural Networks. The MIT Press. doi:10.7551/mitpress/6626.003.0016. ISBN 978-0-262-26715-1.
  30. ^ Werbos, Paul (1982). "Applications of advances in nonlinear sensitivity analysis" (PDF). System modeling and optimization. Springer. pp. 762–770. Archived (PDF) from the original on 14 April 2016. Retrieved 2 July 2017.
  31. ^ Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (October 1986). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536. Bibcode:1986Natur.323..533R. doi:10.1038/323533a0. ISSN 1476-4687.
  32. ^ Lenz, W. (1920), "Beiträge zum Verständnis der magnetischen Eigenschaften in festen Körpern", Physikalische Zeitschrift, 21: 613–615.
  33. ^ Ising, E. (1925), "Beitrag zur Theorie des Ferromagnetismus", Z. Phys., 31 (1): 253–258, Bibcode:1925ZPhy...31..253I, doi:10.1007/BF02980577, S2CID 122157319
  34. ^ Brush, Stephen G. (1967). "History of the Lenz-Ising Model". Reviews of Modern Physics. 39 (4): 883–893. Bibcode:1967RvMP...39..883B. doi:10.1103/RevModPhys.39.883.
  35. ^ Glauber, Roy J. (February 1963). "Roy J. Glauber "Time-Dependent Statistics of the Ising Model"". Journal of Mathematical Physics. 4 (2): 294–307. doi:10.1063/1.1703954. Retrieved 2021-03-21.
  36. ^ Amari, S.-I. (November 1972). "Learning Patterns and Pattern Sequences by Self-Organizing Nets of Threshold Elements". IEEE Transactions on Computers. C-21 (11): 1197–1206. doi:10.1109/T-C.1972.223477. ISSN 0018-9340.
  37. ^ Hopfield, J. J. (1982). "Neural networks and physical systems with emergent collective computational abilities". Proceedings of the National Academy of Sciences. 79 (8): 2554–2558. Bibcode:1982PNAS...79.2554H. doi:10.1073/pnas.79.8.2554. PMC 346238. PMID 6953413.
  38. ^ Espinosa-Sanchez, Juan Manuel; Gomez-Marin, Alex; de Castro, Fernando (2023-07-05). "The Importance of Cajal's and Lorente de Nó's Neuroscience to the Birth of Cybernetics". The Neuroscientist. doi:10.1177/10738584231179932. hdl:10261/348372. ISSN 1073-8584. PMID 37403768.
  39. ^ de NÓ, R. Lorente (1933-08-01). "Vestibulo-Ocular Reflex Arc". Archives of Neurology and Psychiatry. 30 (2): 245. doi:10.1001/archneurpsyc.1933.02240140009001. ISSN 0096-6754.
  40. ^ Larriva-Sahd, Jorge A. (2014-12-03). "Some predictions of Rafael Lorente de Nó 80 years later". Frontiers in Neuroanatomy. 8: 147. doi:10.3389/fnana.2014.00147. ISSN 1662-5129. PMC 4253658. PMID 25520630.
  41. ^ "reverberating circuit". Oxford Reference. Retrieved 2024-07-27.
  42. ^ a b Schmidhuber, Jürgen (1993). Habilitation thesis: System modeling and optimization (PDF).[permanent dead link] Page 150 ff demonstrates credit assignment across the equivalent of 1,200 layers in an unfolded RNN.
  43. ^ a b S. Hochreiter., "Untersuchungen zu dynamischen neuronalen Netzen". Archived 2015-03-06 at the Wayback Machine. Diploma thesis. Institut f. Informatik, Technische Univ. Munich. Advisor: J. Schmidhuber, 1991.
  44. ^ Hochreiter, S.; et al. (15 January 2001). "Gradient flow in recurrent nets: the difficulty of learning long-term dependencies". In Kolen, John F.; Kremer, Stefan C. (eds.). A Field Guide to Dynamical Recurrent Networks. John Wiley & Sons. ISBN 978-0-7803-5369-5.
  45. ^ Schmidhuber, Jürgen (1992). "Learning complex, extended sequences using the principle of history compression (based on TR FKI-148, 1991)" (PDF). Neural Computation. 4 (2): 234–242. doi:10.1162/neco.1992.4.2.234. S2CID 18271205.[permanent dead link]
  46. ^ a b Sepp Hochreiter; Jürgen Schmidhuber (21 August 1995), Long Short Term Memory, Wikidata Q98967430
  47. ^ a b Schmidhuber, J. (2015). "Deep Learning in Neural Networks: An Overview". Neural Networks. 61: 85–117. arXiv:1404.7828. doi:10.1016/j.neunet.2014.09.003. PMID 25462637. S2CID 11715509.
  48. ^ Gers, Felix; Schmidhuber, Jürgen; Cummins, Fred (1999). "Learning to forget: Continual prediction with LSTM". 9th International Conference on Artificial Neural Networks: ICANN '99. Vol. 1999. pp. 850–855. doi:10.1049/cp:19991218. ISBN 0-85296-721-7.
  49. ^ Hochreiter, Sepp; Schmidhuber, Jürgen (1997-11-01). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID 9377276. S2CID 1915014.
  50. ^ Graves, Alex; Schmidhuber, Jürgen (2005-07-01). "Framewise phoneme classification with bidirectional LSTM and other neural network architectures". Neural Networks. IJCNN 2005. 18 (5): 602–610. CiteSeerX 10.1.1.331.5800. doi:10.1016/j.neunet.2005.06.042. PMID 16112549. S2CID 1856462.
  51. ^ Fernández, Santiago; Graves, Alex; Schmidhuber, Jürgen (2007). "An Application of Recurrent Neural Networks to Discriminative Keyword Spotting". Proceedings of the 17th International Conference on Artificial Neural Networks. ICANN'07. Berlin, Heidelberg: Springer-Verlag. pp. 220–229. ISBN 978-3-540-74693-5.
  52. ^ Sak, Haşim; Senior, Andrew; Beaufays, Françoise (2014). "Long Short-Term Memory recurrent neural network architectures for large scale acoustic modeling" (PDF). Google Research.
  53. ^ Li, Xiangang; Wu, Xihong (2014-10-15). "Constructing Long Short-Term Memory based Deep Recurrent Neural Networks for Large Vocabulary Speech Recognition". arXiv:1410.4281 [cs.CL].
  54. ^ Fan, Bo; Wang, Lijuan; Soong, Frank K.; Xie, Lei (2015). "Photo-Real Talking Head with Deep Bidirectional LSTM". Proceedings of ICASSP 2015 IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4884–8. doi:10.1109/ICASSP.2015.7178899. ISBN 978-1-4673-6997-8.
  55. ^ Sak, Haşim; Senior, Andrew; Rao, Kanishka; Beaufays, Françoise; Schalkwyk, Johan (September 2015). "Google voice search: faster and more accurate".
  56. ^ Sutskever, Ilya; Vinyals, Oriol; Le, Quoc V. (2014). "Sequence to Sequence Learning with Neural Networks" (PDF). Electronic Proceedings of the Neural Information Processing Systems Conference. 27: 5346. arXiv:1409.3215. Bibcode:2014arXiv1409.3215S.
  57. ^ Jozefowicz, Rafal; Vinyals, Oriol; Schuster, Mike; Shazeer, Noam; Wu, Yonghui (2016-02-07). "Exploring the Limits of Language Modeling". arXiv:1602.02410 [cs.CL].
  58. ^ Gillick, Dan; Brunk, Cliff; Vinyals, Oriol; Subramanya, Amarnag (2015-11-30). "Multilingual Language Processing From Bytes". arXiv:1512.00103 [cs.CL].
  59. ^ Vinyals, Oriol; Toshev, Alexander; Bengio, Samy; Erhan, Dumitru (2014-11-17). "Show and Tell: A Neural Image Caption Generator". arXiv:1411.4555 [cs.CV].
  60. ^ Fukushima, K. (2007). "Neocognitron". Scholarpedia. 2 (1): 1717. Bibcode:2007SchpJ...2.1717F. doi:10.4249/scholarpedia.1717.
  61. ^ Fukushima, Kunihiko (1980). "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position" (PDF). Biological Cybernetics. 36 (4): 193–202. doi:10.1007/BF00344251. PMID 7370364. S2CID 206775608. Retrieved 16 November 2013.
  62. ^ LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey (2015). "Deep learning" (PDF). Nature. 521 (7553): 436–444. Bibcode:2015Natur.521..436L. doi:10.1038/nature14539. PMID 26017442. S2CID 3074096.
  63. ^ a b Fukushima, K. (1969). "Visual feature extraction by a multilayered network of analog threshold elements". IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322–333. doi:10.1109/TSSC.1969.300225.
  64. ^ Schmidhuber, Juergen (2022). "Annotated History of Modern AI and Deep Learning". arXiv:2212.11279 [cs.NE].
  65. ^ Ramachandran, Prajit; Barret, Zoph; Quoc, V. Le (October 16, 2017). "Searching for Activation Functions". arXiv:1710.05941 [cs.NE].
  66. ^ a b Waibel, Alex (December 1987). Phoneme Recognition Using Time-Delay Neural Networks. Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE). Tokyo, Japan.
  67. ^ Alexander Waibel et al., Phoneme Recognition Using Time-Delay Neural Networks IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume 37, No. 3, pp. 328. – 339 March 1989.
  68. ^ Zhang, Wei (1988). "Shift-invariant pattern recognition neural network and its optical architecture". Proceedings of Annual Conference of the Japan Society of Applied Physics.
  69. ^ Zhang, Wei (1990). "Parallel distributed processing model with local space-invariant interconnections and its optical architecture". Applied Optics. 29 (32): 4790–7. Bibcode:1990ApOpt..29.4790Z. doi:10.1364/AO.29.004790. PMID 20577468.
  70. ^ Fukushima, Kunihiko (1980). "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position" (PDF). Biological Cybernetics. 36 (4): 193–202. doi:10.1007/BF00344251. PMID 7370364. S2CID 206775608. Archived (PDF) from the original on 3 June 2014. Retrieved 16 November 2013.
  71. ^ Fukushima, Kunihiko; Miyake, Sei (1982-01-01). "Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position". Pattern Recognition. 15 (6): 455–469. Bibcode:1982PatRe..15..455F. doi:10.1016/0031-3203(82)90024-3. ISSN 0031-3203.
  72. ^ LeCun et al., "Backpropagation Applied to Handwritten Zip Code Recognition," Neural Computation, 1, pp. 541–551, 1989.
  73. ^ LeCun, Yann; Boser, Bernhard; Denker, John; Henderson, Donnie; Howard, R.; Hubbard, Wayne; Jackel, Lawrence (1989). "Handwritten Digit Recognition with a Back-Propagation Network". Advances in Neural Information Processing Systems. 2. Morgan-Kaufmann.
  74. ^ Zhang, Wei (1991). "Image processing of human corneal endothelium based on a learning network". Applied Optics. 30 (29): 4211–7. Bibcode:1991ApOpt..30.4211Z. doi:10.1364/AO.30.004211. PMID 20706526.
  75. ^ Zhang, Wei (1994). "Computerized detection of clustered microcalcifications in digital mammograms using a shift-invariant artificial neural network". Medical Physics. 21 (4): 517–24. Bibcode:1994MedPh..21..517Z. doi:10.1118/1.597177. PMID 8058017.
  76. ^ J. Weng, N. Ahuja and T. S. Huang, "Cresceptron: a self-organizing neural network which grows adaptively," Proc. International Joint Conference on Neural Networks, Baltimore, Maryland, vol I, pp. 576–581, June, 1992.
  77. ^ J. Weng, N. Ahuja and T. S. Huang, "Learning recognition and segmentation of 3-D objects from 2-D images," Proc. 4th International Conf. Computer Vision, Berlin, Germany, pp. 121–128, May, 1993.
  78. ^ J. Weng, N. Ahuja and T. S. Huang, "Learning recognition and segmentation using the Cresceptron," International Journal of Computer Vision, vol. 25, no. 2, pp. 105–139, Nov. 1997.
  79. ^ Weng, J; Ahuja, N; Huang, TS (1993). "Learning recognition and segmentation of 3-D objects from 2-D images". 1993 (4th) International Conference on Computer Vision. pp. 121–128. doi:10.1109/ICCV.1993.378228. ISBN 0-8186-3870-2. S2CID 8619176.
  80. ^ LeCun, Yann; Léon Bottou; Yoshua Bengio; Patrick Haffner (1998). "Gradient-based learning applied to document recognition" (PDF). Proceedings of the IEEE. 86 (11): 2278–2324. CiteSeerX 10.1.1.32.9552. doi:10.1109/5.726791. S2CID 14542261. Retrieved October 7, 2016.
  81. ^ Dominik Scherer, Andreas C. Müller, and Sven Behnke: "Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition," In 20th International Conference Artificial Neural Networks (ICANN), pp. 92–101, 2010. doi:10.1007/978-3-642-15825-4_10.
  82. ^ Sven Behnke (2003). Hierarchical Neural Networks for Image Interpretation (PDF). Lecture Notes in Computer Science. Vol. 2766. Springer.
  83. ^ Martin Riedmiller und Heinrich Braun: Rprop – A Fast Adaptive Learning Algorithm. Proceedings of the International Symposium on Computer and Information Science VII, 1992
  84. ^ a b Oh, K.-S.; Jung, K. (2004). "GPU implementation of neural networks". Pattern Recognition. 37 (6): 1311–1314. Bibcode:2004PatRe..37.1311O. doi:10.1016/j.patcog.2004.01.013.
  85. ^ a b Chellapilla, Kumar; Puri, Sidd; Simard, Patrice (2006), High performance convolutional neural networks for document processing, archived from the original on 2020-05-18, retrieved 2021-02-14
  86. ^ Sze, Vivienne; Chen, Yu-Hsin; Yang, Tien-Ju; Emer, Joel (2017). "Efficient Processing of Deep Neural Networks: A Tutorial and Survey". arXiv:1703.09039 [cs.CV].
  87. ^ Raina, Rajat; Madhavan, Anand; Ng, Andrew Y. (2009-06-14). "Large-scale deep unsupervised learning using graphics processors". Proceedings of the 26th Annual International Conference on Machine Learning. ICML '09. New York, NY, USA: Association for Computing Machinery. pp. 873–880. doi:10.1145/1553374.1553486. ISBN 978-1-60558-516-1.
  88. ^ Cireşan, Dan Claudiu; Meier, Ueli; Gambardella, Luca Maria; Schmidhuber, Jürgen (21 September 2010). "Deep, Big, Simple Neural Nets for Handwritten Digit Recognition". Neural Computation. 22 (12): 3207–3220. arXiv:1003.0358. doi:10.1162/neco_a_00052. ISSN 0899-7667. PMID 20858131. S2CID 1918673.
  89. ^ Ciresan, D. C.; Meier, U.; Masci, J.; Gambardella, L.M.; Schmidhuber, J. (2011). "Flexible, High Performance Convolutional Neural Networks for Image Classification" (PDF). International Joint Conference on Artificial Intelligence. doi:10.5591/978-1-57735-516-8/ijcai11-210. Archived (PDF) from the original on 2014-09-29. Retrieved 2017-06-13.
  90. ^ Ciresan, Dan; Giusti, Alessandro; Gambardella, Luca M.; Schmidhuber, Jürgen (2012). Pereira, F.; Burges, C. J. C.; Bottou, L.; Weinberger, K. Q. (eds.). Advances in Neural Information Processing Systems 25 (PDF). Curran Associates, Inc. pp. 2843–2851. Archived (PDF) from the original on 2017-08-09. Retrieved 2017-06-13.
  91. ^ Ciresan, D.; Giusti, A.; Gambardella, L.M.; Schmidhuber, J. (2013). "Mitosis Detection in Breast Cancer Histology Images with Deep Neural Networks". Medical Image Computing and Computer-Assisted Intervention – MICCAI 2013. Lecture Notes in Computer Science. Vol. 7908. pp. 411–418. doi:10.1007/978-3-642-40763-5_51. ISBN 978-3-642-38708-1. PMID 24579167.
  92. ^ Ciresan, D.; Meier, U.; Schmidhuber, J. (2012). "Multi-column deep neural networks for image classification". 2012 IEEE Conference on Computer Vision and Pattern Recognition. pp. 3642–3649. arXiv:1202.2745. doi:10.1109/cvpr.2012.6248110. ISBN 978-1-4673-1228-8. S2CID 2161592.
  93. ^ Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey (2012). "ImageNet Classification with Deep Convolutional Neural Networks" (PDF). NIPS 2012: Neural Information Processing Systems, Lake Tahoe, Nevada. Archived (PDF) from the original on 2017-01-10. Retrieved 2017-05-24.
  94. ^ Simonyan, Karen; Andrew, Zisserman (2014). "Very Deep Convolution Networks for Large Scale Image Recognition". arXiv:1409.1556 [cs.CV].
  95. ^ Szegedy, Christian (2015). "Going deeper with convolutions" (PDF). Cvpr2015. arXiv:1409.4842.
  96. ^ Vinyals, Oriol; Toshev, Alexander; Bengio, Samy; Erhan, Dumitru (2014). "Show and Tell: A Neural Image Caption Generator". arXiv:1411.4555 [cs.CV]..
  97. ^ Fang, Hao; Gupta, Saurabh; Iandola, Forrest; Srivastava, Rupesh; Deng, Li; Dollár, Piotr; Gao, Jianfeng; He, Xiaodong; Mitchell, Margaret; Platt, John C; Lawrence Zitnick, C; Zweig, Geoffrey (2014). "From Captions to Visual Concepts and Back". arXiv:1411.4952 [cs.CV]..
  98. ^ Kiros, Ryan; Salakhutdinov, Ruslan; Zemel, Richard S (2014). "Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models". arXiv:1411.2539 [cs.LG]..
  99. ^ Simonyan, Karen; Zisserman, Andrew (2015-04-10), Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv:1409.1556
  100. ^ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification". arXiv:1502.01852 [cs.CV].
  101. ^ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (10 Dec 2015). Deep Residual Learning for Image Recognition. arXiv:1512.03385.
  102. ^ Srivastava, Rupesh Kumar; Greff, Klaus; Schmidhuber, Jürgen (2 May 2015). "Highway Networks". arXiv:1505.00387 [cs.LG].
  103. ^ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). Deep Residual Learning for Image Recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, NV, USA: IEEE. pp. 770–778. arXiv:1512.03385. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1.
  104. ^ Linn, Allison (2015-12-10). "Microsoft researchers win ImageNet computer vision challenge". The AI Blog. Retrieved 2024-06-29.
  105. ^ Schmidhuber, Jürgen (1991). "A possibility for implementing curiosity and boredom in model-building neural controllers". Proc. SAB'1991. MIT Press/Bradford Books. pp. 222–227.
  106. ^ Schmidhuber, Jürgen (2020). "Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991)". Neural Networks. 127: 58–66. arXiv:1906.04493. doi:10.1016/j.neunet.2020.04.008. PMID 32334341. S2CID 216056336.
  107. ^ Schmidhuber, Jürgen (November 1992). "Learning Factorial Codes by Predictability Minimization". Neural Computation. 4 (6): 863–879. doi:10.1162/neco.1992.4.6.863. S2CID 42023620.
  108. ^ Schmidhuber, Jürgen; Eldracher, Martin; Foltin, Bernhard (1996). "Semilinear predictability minimzation produces well-known feature detectors". Neural Computation. 8 (4): 773–786. doi:10.1162/neco.1996.8.4.773. S2CID 16154391.
  109. ^ Niemitalo, Olli (February 24, 2010). "A method for training artificial neural networks to generate missing data within a variable context". Internet Archive (Wayback Machine). Archived from the original on March 12, 2012. Retrieved February 22, 2019.
  110. ^ "GANs were invented in 2010?". reddit r/MachineLearning. 2019. Retrieved 2019-05-28.
  111. ^ Li, Wei; Gauci, Melvin; Gross, Roderich (July 6, 2013). "Proceeding of the fifteenth annual conference on Genetic and evolutionary computation conference - GECCO '13". Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation (GECCO 2013). Amsterdam, the Netherlands: ACM. pp. 223–230. doi:10.1145/2463372.2465801. ISBN 9781450319638.
  112. ^ Gutmann, Michael; Hyvärinen, Aapo. "Noise-Contrastive Estimation" (PDF). International Conference on AI and Statistics.
  113. ^ Goodfellow, Ian; Pouget-Abadie, Jean; Mirza, Mehdi; Xu, Bing; Warde-Farley, David; Ozair, Sherjil; Courville, Aaron; Bengio, Yoshua (2014). Generative Adversarial Networks (PDF). Proceedings of the International Conference on Neural Information Processing Systems (NIPS 2014). pp. 2672–2680. Archived (PDF) from the original on 22 November 2019. Retrieved 20 August 2019.
  114. ^ "GAN 2.0: NVIDIA's Hyperrealistic Face Generator". SyncedReview.com. December 14, 2018. Retrieved October 3, 2019.
  115. ^ Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. (26 February 2018). "Progressive Growing of GANs for Improved Quality, Stability, and Variation". arXiv:1710.10196 [cs.NE].
  116. ^ "Prepare, Don't Panic: Synthetic Media and Deepfakes". witness.org. Archived from the original on 2 December 2020. Retrieved 25 November 2020.
  117. ^ Sohl-Dickstein, Jascha; Weiss, Eric; Maheswaranathan, Niru; Ganguli, Surya (2015-06-01). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics" (PDF). Proceedings of the 32nd International Conference on Machine Learning. 37. PMLR: 2256–2265. arXiv:1503.03585.
  118. ^ Kramer, Arthur F.; Wiegmann, Douglas A.; Kirlik, Alex (2006-12-28). "1 Attention: From History to Application". Attention: From Theory to Practice. Oxford University Press. doi:10.1093/acprof:oso/9780195305722.003.0001. ISBN 978-0-19-530572-2.
  119. ^ Cherry EC (1953). "Some Experiments on the Recognition of Speech, with One and with Two Ears" (PDF). The Journal of the Acoustical Society of America. 25 (5): 975–79. Bibcode:1953ASAJ...25..975C. doi:10.1121/1.1907229. hdl:11858/00-001M-0000-002A-F750-3. ISSN 0001-4966.
  120. ^ Broadbent, D (1958). Perception and Communication. London: Pergamon Press.
  121. ^ Kowler, Eileen; Anderson, Eric; Dosher, Barbara; Blaser, Erik (1995-07-01). "The role of attention in the programming of saccades". Vision Research. 35 (13): 1897–1916. doi:10.1016/0042-6989(94)00279-U. ISSN 0042-6989. PMID 7660596.
  122. ^ Fukushima, Kunihiko (1987-12-01). "Neural network model for selective attention in visual pattern recognition and associative recall". Applied Optics. 26 (23): 4985–4992. Bibcode:1987ApOpt..26.4985F. doi:10.1364/AO.26.004985. ISSN 0003-6935. PMID 20523477.
  123. ^ Ba, Jimmy; Mnih, Volodymyr; Kavukcuoglu, Koray (2015-04-23). "Multiple Object Recognition with Visual Attention". arXiv:1412.7755 [cs.LG].
  124. ^ Koch, Christof; Ullman, Shimon (1987), Vaina, Lucia M. (ed.), "Shifts in Selective Visual Attention: Towards the Underlying Neural Circuitry", Matters of Intelligence: Conceptual Structures in Cognitive Neuroscience, Dordrecht: Springer Netherlands, pp. 115–141, doi:10.1007/978-94-009-3833-5_5, ISBN 978-94-009-3833-5, retrieved 2024-08-06
  125. ^ Soydaner, Derya (August 2022). "Attention mechanism in neural networks: where it comes and where it goes". Neural Computing and Applications. 34 (16): 13371–13385. doi:10.1007/s00521-022-07366-3. ISSN 0941-0643.
  126. ^ Giles, C. Lee; Maxwell, Tom (1987-12-01). "Learning, invariance, and generalization in high-order neural networks". Applied Optics. 26 (23): 4972–4978. doi:10.1364/AO.26.004972. ISSN 0003-6935. PMID 20523475.
  127. ^ Feldman, J. A.; Ballard, D. H. (1982-07-01). "Connectionist models and their properties". Cognitive Science. 6 (3): 205–254. doi:10.1016/S0364-0213(82)80001-3. ISSN 0364-0213.
  128. ^ Rumelhart, David E.; Mcclelland, James L.; Group, PDP Research (1987-07-29). Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations, Chapter 2 (PDF). Cambridge, Mass: Bradford Books. ISBN 978-0-262-68053-0. {{cite book}}: |last3= has generic name (help)
  129. ^ Schmidhuber, Jürgen (January 1992). "Learning to Control Fast-Weight Memories: An Alternative to Dynamic Recurrent Networks". Neural Computation. 4 (1): 131–139. doi:10.1162/neco.1992.4.1.131. ISSN 0899-7667.
  130. ^ Ha, David; Dai, Andrew; Le, Quoc V. (2016-12-01). "HyperNetworks". arXiv:1609.09106 [cs.LG].
  131. ^ Niu, Zhaoyang; Zhong, Guoqiang; Yu, Hui (2021-09-10). "A review on the attention mechanism of deep learning". Neurocomputing. 452: 48–62. doi:10.1016/j.neucom.2021.03.091. ISSN 0925-2312.
  132. ^ Cho, Kyunghyun; van Merrienboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (2014-06-03). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation". arXiv:1406.1078. {{cite journal}}: Cite journal requires |journal= (help)
  133. ^ Sutskever, Ilya; Vinyals, Oriol; Le, Quoc Viet (14 Dec 2014). "Sequence to sequence learning with neural networks". arXiv:1409.3215 [cs.CL].
  134. ^ Vinyals, Oriol; Toshev, Alexander; Bengio, Samy; Erhan, Dumitru (2015). "Show and Tell: A Neural Image Caption Generator": 3156–3164. arXiv:1411.4555. {{cite journal}}: Cite journal requires |journal= (help)
  135. ^ Xu, Kelvin; Ba, Jimmy; Kiros, Ryan; Cho, Kyunghyun; Courville, Aaron; Salakhudinov, Ruslan; Zemel, Rich; Bengio, Yoshua (2015-06-01). "Show, Attend and Tell: Neural Image Caption Generation with Visual Attention". Proceedings of the 32nd International Conference on Machine Learning. PMLR: 2048–2057. arXiv:1502.03044.
  136. ^ a b Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (2016-05-19). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv:1409.0473 [cs.CL].
  137. ^ Graves, Alex; Wayne, Greg; Danihelka, Ivo (2014-12-10). "Neural Turing Machines". arXiv:1410.5401 [cs.NE].
  138. ^ Cheng, Jianpeng; Dong, Li; Lapata, Mirella (2016-09-20). "Long Short-Term Memory-Networks for Machine Reading". arXiv:1601.06733 [cs.CL].
  139. ^ Parikh, Ankur P.; Täckström, Oscar; Das, Dipanjan; Uszkoreit, Jakob (2016-09-25). "A Decomposable Attention Model for Natural Language Inference". arXiv:1606.01933 [cs.CL].
  140. ^ Levy, Steven. "8 Google Employees Invented Modern AI. Here's the Inside Story". Wired. ISSN 1059-1028. Archived from the original on 20 March 2024. Retrieved 2024-08-06.
  141. ^ Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need" (PDF). Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
  142. ^ Peng, Bo; Alcaide, Eric; Anthony, Quentin; Albalak, Alon; Arcadinho, Samuel; Biderman, Stella; Cao, Huanqi; Cheng, Xin; Chung, Michael (2023-12-10). "RWKV: Reinventing RNNs for the Transformer Era". arXiv:2305.13048 [cs.CL].
  143. ^ Kohonen, Teuvo; Honkela, Timo (2007). "Kohonen Network". Scholarpedia. 2 (1): 1568. Bibcode:2007SchpJ...2.1568K. doi:10.4249/scholarpedia.1568.
  144. ^ Kohonen, Teuvo (1982). "Self-Organized Formation of Topologically Correct Feature Maps". Biological Cybernetics. 43 (1): 59–69. doi:10.1007/bf00337288. S2CID 206775459.
  145. ^ Von der Malsburg, C (1973). "Self-organization of orientation sensitive cells in the striate cortex". Kybernetik. 14 (2): 85–100. doi:10.1007/bf00288907. PMID 4786750. S2CID 3351573.
  146. ^ Ackley, David H.; Hinton, Geoffrey E.; Sejnowski, Terrence J. (1985-01-01). "A learning algorithm for boltzmann machines". Cognitive Science. 9 (1): 147–169. doi:10.1016/S0364-0213(85)80012-4. ISSN 0364-0213.
  147. ^ Smolensky, Paul (1986). "Chapter 6: Information Processing in Dynamical Systems: Foundations of Harmony Theory" (PDF). In Rumelhart, David E.; McLelland, James L. (eds.). Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations. MIT Press. pp. 194–281. ISBN 0-262-68053-X.
  148. ^ Peter, Dayan; Hinton, Geoffrey E.; Neal, Radford M.; Zemel, Richard S. (1995). "The Helmholtz machine". Neural Computation. 7 (5): 889–904. doi:10.1162/neco.1995.7.5.889. hdl:21.11116/0000-0002-D6D3-E. PMID 7584891. S2CID 1890561. Closed access icon
  149. ^ Hinton, Geoffrey E.; Dayan, Peter; Frey, Brendan J.; Neal, Radford (1995-05-26). "The wake-sleep algorithm for unsupervised neural networks". Science. 268 (5214): 1158–1161. Bibcode:1995Sci...268.1158H. doi:10.1126/science.7761831. PMID 7761831. S2CID 871473.
  150. ^ Sejnowski, Terrence J. (2018). The deep learning revolution. Cambridge, Massachusetts: The MIT Press. ISBN 978-0-262-03803-4.
  151. ^ Smolensky, P. (1986). "Information processing in dynamical systems: Foundations of harmony theory.". In D. E. Rumelhart; J. L. McClelland; PDP Research Group (eds.). Parallel Distributed Processing: Explorations in the Microstructure of Cognition. Vol. 1. MIT Press. pp. 194–281. ISBN 9780262680530.
  152. ^ Hinton, G. E.; Osindero, S.; Teh, Y. (2006). "A fast learning algorithm for deep belief nets" (PDF). Neural Computation. 18 (7): 1527–1554. CiteSeerX 10.1.1.76.1541. doi:10.1162/neco.2006.18.7.1527. PMID 16764513. S2CID 2309950.
  153. ^ Hinton, Geoffrey (2009-05-31). "Deep belief networks". Scholarpedia. 4 (5): 5947. Bibcode:2009SchpJ...4.5947H. doi:10.4249/scholarpedia.5947. ISSN 1941-6016.
  154. ^ Ng, Andrew; Dean, Jeff (2012). "Building High-level Features Using Large Scale Unsupervised Learning". arXiv:1112.6209 [cs.LG].
  155. ^ Watkin, Timothy L. H.; Rau, Albrecht; Biehl, Michael (1993-04-01). "The statistical mechanics of learning a rule". Reviews of Modern Physics. 65 (2): 499–556. Bibcode:1993RvMP...65..499W. doi:10.1103/RevModPhys.65.499.
  156. ^ Schwarze, H; Hertz, J (1992-10-15). "Generalization in a Large Committee Machine". Europhysics Letters (EPL). 20 (4): 375–380. Bibcode:1992EL.....20..375S. doi:10.1209/0295-5075/20/4/015. ISSN 0295-5075.
  157. ^ Mato, G; Parga, N (1992-10-07). "Generalization properties of multilayered neural networks". Journal of Physics A: Mathematical and General. 25 (19): 5047–5054. Bibcode:1992JPhA...25.5047M. doi:10.1088/0305-4470/25/19/017. ISSN 0305-4470.
  158. ^ Hansel, D; Mato, G; Meunier, C (1992-11-01). "Memorization Without Generalization in a Multilayered Neural Network". Europhysics Letters (EPL). 20 (5): 471–476. Bibcode:1992EL.....20..471H. doi:10.1209/0295-5075/20/5/015. ISSN 0295-5075.
  159. ^ Schmidhuber, Jürgen (1992). "Learning complex, extended sequences using the principle of history compression" (PDF). Neural Computation. 4 (2): 234–242. doi:10.1162/neco.1992.4.2.234. S2CID 18271205.[permanent dead link]
  160. ^ Hanson, Stephen; Pratt, Lorien (1988). "Comparing Biases for Minimal Network Construction with Back-Propagation". Advances in Neural Information Processing Systems. 1. Morgan-Kaufmann.
  161. ^ LeCun, Yann; Denker, John; Solla, Sara (1989). "Optimal Brain Damage". Advances in Neural Information Processing Systems. 2. Morgan-Kaufmann.
  162. ^ Mead, Carver A.; Ismail, Mohammed (8 May 1989). Analog VLSI Implementation of Neural Systems (PDF). The Kluwer International Series in Engineering and Computer Science. Vol. 80. Norwell, MA: Kluwer Academic Publishers. doi:10.1007/978-1-4613-1639-8. ISBN 978-1-4613-1639-8.
  163. ^ Yang, J. J.; Pickett, M. D.; Li, X. M.; Ohlberg, D. A. A.; Stewart, D. R.; Williams, R. S. (2008). "Memristive switching mechanism for metal/oxide/metal nanodevices". Nat. Nanotechnol. 3 (7): 429–433. doi:10.1038/nnano.2008.160. PMID 18654568.
  164. ^ Strukov, D. B.; Snider, G. S.; Stewart, D. R.; Williams, R. S. (2008). "The missing memristor found". Nature. 453 (7191): 80–83. Bibcode:2008Natur.453...80S. doi:10.1038/nature06932. PMID 18451858. S2CID 4367148.
[edit]