Chinchilla (language model): Difference between revisions
Citation bot (talk | contribs) Add: arxiv, authors 1-1. Removed parameters. Some additions/deletions were parameter name changes. | Use this bot. Report bugs. | Suggested by Abductive | Category:Large language models | #UCB_Category 23/50 |
sep lede |
||
Line 1: | Line 1: | ||
{{short description|Language model by DeepMind}} |
{{short description|Language model by DeepMind}} |
||
'''Chinchilla''' is a family of [[ |
'''Chinchilla''' is a family of [[large language models]] (LLMs) developed by the research team at [[Google DeepMind]], presented in March 2022.<ref name=":1">{{Cite arXiv |last1=Hoffmann |first1=Jordan |last2=Borgeaud |first2=Sebastian |last3=Mensch |first3=Arthur |last4=Buchatskaya |first4=Elena |last5=Cai |first5=Trevor |last6=Rutherford |first6=Eliza |last7=Casas |first7=Diego de Las |last8=Hendricks |first8=Lisa Anne |last9=Welbl |first9=Johannes |last10=Clark |first10=Aidan |last11=Hennigan |first11=Tom |last12=Noland |first12=Eric |last13=Millican |first13=Katie |last14=Driessche |first14=George van den |last15=Damoc |first15=Bogdan |date=2022-03-29 |title=Training Compute-Optimal Large Language Models |class=cs.CL |eprint=2203.15556 }}</ref> |
||
==Models== |
|||
⚫ | It claimed to outperform [[GPT-3]]. It considerably simplifies downstream utilization because it requires much less computer power for inference and fine-tuning. Based on the training of previously employed language models, it has been determined that if one doubles the model size, one must also have twice the number of training tokens. This hypothesis has been used to train Chinchilla by |
||
It is named "[[chinchilla]]" because it is a further development over a previous model family named Gopher. Both model families were trained in order to investigate the [[Neural scaling law|scaling laws of large language models]].<ref name=":0">{{Cite arXiv |last1=Rae |first1=Jack W. |last2=Borgeaud |first2=Sebastian |last3=Cai |first3=Trevor |last4=Millican |first4=Katie |last5=Hoffmann |first5=Jordan |last6=Song |first6=Francis |last7=Aslanides |first7=John |last8=Henderson |first8=Sarah |last9=Ring |first9=Roman |last10=Young |first10=Susannah |last11=Rutherford |first11=Eliza |last12=Hennigan |first12=Tom |last13=Menick |first13=Jacob |last14=Cassirer |first14=Albin |last15=Powell |first15=Richard |date=2022-01-21 |title=Scaling Language Models: Methods, Analysis & Insights from Training Gopher |class=cs.CL |eprint=2112.11446 }}</ref> |
|||
⚫ | It claimed to outperform [[GPT-3]]. It considerably simplifies downstream utilization because it requires much less computer power for inference and fine-tuning. Based on the training of previously employed language models, it has been determined that if one doubles the model size, one must also have twice the number of training tokens. This hypothesis has been used to train Chinchilla by DeepMind. Similar to Gopher in terms of cost, Chinchilla has 70B parameters and four times as much data.<ref name="dataconomy">{{Cite web |last=Eliaçık |first=Eray |date=January 12, 2023 |title=Chinchilla AI is coming for the GPT-3's throne |url=https://dataconomy.com/2023/01/what-is-chinchilla-ai-chatbot-deepmind/ |url-status=live |archive-url=https://web.archive.org/web/20230326130555/https://dataconomy.com/2023/01/what-is-chinchilla-ai-chatbot-deepmind/ |archive-date=March 26, 2023 |access-date= |website=Dataconomy}}</ref> |
||
Chinchilla has an average accuracy of 67.5% on the [[Measuring Massive Multitask Language Understanding]] (MMLU) benchmark, which is 7% higher than Gopher's performance. Chinchilla was still in the testing phase as of January 12, 2023.<ref>{{Citation| last = Hendrycks| first = Dan| title = Measuring Massive Multitask Language Understanding| accessdate = 2023-03-15| date = 2023-03-14| url = https://github.com/hendrycks/test| archive-date = 2023-03-15| archive-url = https://web.archive.org/web/20230315011614/https://github.com/hendrycks/test| url-status = live}}</ref> |
Chinchilla has an average accuracy of 67.5% on the [[Measuring Massive Multitask Language Understanding]] (MMLU) benchmark, which is 7% higher than Gopher's performance. Chinchilla was still in the testing phase as of January 12, 2023.<ref>{{Citation| last = Hendrycks| first = Dan| title = Measuring Massive Multitask Language Understanding| accessdate = 2023-03-15| date = 2023-03-14| url = https://github.com/hendrycks/test| archive-date = 2023-03-15| archive-url = https://web.archive.org/web/20230315011614/https://github.com/hendrycks/test| url-status = live}}</ref> |
Revision as of 22:23, 4 October 2024
Chinchilla is a family of large language models (LLMs) developed by the research team at Google DeepMind, presented in March 2022.[1]
Models
It is named "chinchilla" because it is a further development over a previous model family named Gopher. Both model families were trained in order to investigate the scaling laws of large language models.[2]
It claimed to outperform GPT-3. It considerably simplifies downstream utilization because it requires much less computer power for inference and fine-tuning. Based on the training of previously employed language models, it has been determined that if one doubles the model size, one must also have twice the number of training tokens. This hypothesis has been used to train Chinchilla by DeepMind. Similar to Gopher in terms of cost, Chinchilla has 70B parameters and four times as much data.[3]
Chinchilla has an average accuracy of 67.5% on the Measuring Massive Multitask Language Understanding (MMLU) benchmark, which is 7% higher than Gopher's performance. Chinchilla was still in the testing phase as of January 12, 2023.[4]
Chinchilla contributes to developing an effective training paradigm for large autoregressive language models with limited compute resources. The Chinchilla team recommends that the number of training tokens is twice for every model size doubling, meaning that using larger, higher-quality training datasets can lead to better results on downstream tasks.[5][6]
It has been used for the Flamingo vision-language model.[7]
Architecture
Both the Gopher family and Chinchilla family are families of transformer models.
In particular, they are essentially the same as GPT-2, with different sizes and minor modifications. Gopher family uses RMSNorm instead of LayerNorm; relative positional encoding rather than absolute positional encoding. The Chinchilla family is the same as the Gopher family, but trained with AdamW instead of Adam optimizer.
The Gopher family contains six models of increasing size, from 44 million parameters to 280 billion parameters. They refer to the largest one as "Gopher" by default. Similar naming conventions apply for the Chinchilla family.
Table 1 of [2] shows the entire Gopher family:
Parameter count | Layers | Number of heads | Key/value size | Internal dimension | Max learning rate | Batch size |
---|---|---|---|---|---|---|
44M | 8 | 16 | 32 | 512 | 6 × 10−4 | 0.25M |
117M | 12 | 12 | 64 | 768 | 6 × 10−4 | 0.25M |
417M | 12 | 12 | 128 | 1,536 | 2 × 10−4 | 0.25M |
1.4B | 24 | 16 | 128 | 2,048 | 2 × 10−4 | 0.25M |
7.1B | 32 | 32 | 128 | 4,096 | 1.2 × 10−4 | 2M |
Gopher 280B | 80 | 128 | 128 | 16,384 | 4 × 10−5 | 3M → 6M |
Table 4 of [1] compares the 70-billion-parameter Chinchilla with Gopher 280B.
Parameter count | Layers | Number of heads | Key/value size | Internal dimension | Max learning rate | Batch size |
---|---|---|---|---|---|---|
Gopher 280B | 80 | 128 | 128 | 16,384 | 4 × 10−5 | 3M → 6M |
Chinchilla 70B | 80 | 64 | 128 | 8,192 | 1 × 10−4 | 1.5M → 3M |
See also
References
- ^ a b Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Buchatskaya, Elena; Cai, Trevor; Rutherford, Eliza; Casas, Diego de Las; Hendricks, Lisa Anne; Welbl, Johannes; Clark, Aidan; Hennigan, Tom; Noland, Eric; Millican, Katie; Driessche, George van den; Damoc, Bogdan (2022-03-29). "Training Compute-Optimal Large Language Models". arXiv:2203.15556 [cs.CL].
- ^ a b Rae, Jack W.; Borgeaud, Sebastian; Cai, Trevor; Millican, Katie; Hoffmann, Jordan; Song, Francis; Aslanides, John; Henderson, Sarah; Ring, Roman; Young, Susannah; Rutherford, Eliza; Hennigan, Tom; Menick, Jacob; Cassirer, Albin; Powell, Richard (2022-01-21). "Scaling Language Models: Methods, Analysis & Insights from Training Gopher". arXiv:2112.11446 [cs.CL].
- ^ Eliaçık, Eray (January 12, 2023). "Chinchilla AI is coming for the GPT-3's throne". Dataconomy. Archived from the original on March 26, 2023.
- ^ Hendrycks, Dan (2023-03-14), Measuring Massive Multitask Language Understanding, archived from the original on 2023-03-15, retrieved 2023-03-15
- ^ Chaithali, G. (April 9, 2022). "Check Out This DeepMind's New Language Model, Chinchilla (70B Parameters), Which Significantly Outperforms Gopher (280B) and GPT-3 (175B) on a Large Range of Downstream Evaluation Tasks". Archived from the original on March 27, 2023. Retrieved January 15, 2023.
- ^ Wali, Kartik (April 12, 2022). "DeepMind launches GPT-3 rival, Chinchilla". Analytics India Magazine. Archived from the original on March 26, 2023. Retrieved January 15, 2023.
- ^ Alayrac, Jean-Baptiste; Donahue, Jeff; Luc, Pauline; Miech, Antoine; Barr, Iain; Hasson, Yana; Lenc, Karel; Mensch, Arthur; Millican, Katherine; Reynolds, Malcolm; Ring, Roman; Rutherford, Eliza; Cabi, Serkan; Han, Tengda; Gong, Zhitao (2022-12-06). "Flamingo: a Visual Language Model for Few-Shot Learning". Advances in Neural Information Processing Systems. 35: 23716–23736. arXiv:2204.14198.