Jump to content

Anscombe's quartet: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Tybugg (talk | contribs)
remove {{clear}}. I don't know why this was used in the first place but it inserts a dramatically large space.
 
(27 intermediate revisions by 17 users not shown)
Line 1: Line 1:
{{Short description|Four data sets with the same descriptive statistics, yet very different distributions}}
{{Short description|Four data sets with the same descriptive statistics, yet very different distributions}}
[[File:Anscombe's quartet 3.svg|right|425px|thumb|All four sets are identical when examined using simple summary statistics, but vary considerably when graphed]]
[[File:Anscombe's quartet 3.svg|right|425px|thumb|The four [[dataset]]s composing Anscombe's quartet. All four sets have identical statistical parameters, but the graphs show them to be considerably different]]


'''Anscombe's quartet''' comprises four [[data set]]s that have nearly identical simple [[descriptive statistics]], yet have very different [[probability distribution|distributions]] and appear very different when [[Plot (graphics)|graphed]]. Each dataset consists of eleven [[Cartesian coordinate system|(''x'',''y'') points]]. They were constructed in 1973 by the [[statistician]] [[Francis Anscombe]] to demonstrate both the importance of graphing data when analyzing it, and the effect of [[outlier]]s and other [[influential observations]] on statistical properties. He described the article as being intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough."<ref name="Anscombe">{{cite journal |last=Anscombe |first=F. J. |authorlink=Frank Anscombe |title=Graphs in Statistical Analysis |journal=[[American Statistician]] |volume=27 |year=1973 |issue=1 |pages=17–21 |jstor=2682899|doi=10.1080/00031305.1973.10478966}}</ref>
'''Anscombe's quartet''' comprises four [[dataset]]s that have nearly identical simple [[descriptive statistics]], yet have very different [[probability distribution|distributions]] and appear very different when [[Plot (graphics)|graphed]]. Each dataset consists of eleven [[Cartesian coordinate system|(''x'',&nbsp;''y'') points]]. They were constructed in 1973 by the [[statistician]] [[Francis Anscombe]] to demonstrate both the importance of graphing data when analyzing it, and the effect of [[outlier]]s and other [[influential observations]] on statistical properties. He described the article as being intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough".<ref name="Anscombe">{{cite journal |last=Anscombe |first=F. J. |authorlink=Frank Anscombe |title=Graphs in Statistical Analysis |journal=[[American Statistician]] |volume=27 |year=1973 |issue=1 |pages=17–21 |jstor=2682899 |doi=10.1080/00031305.1973.10478966}}</ref>


==Data==
== Data ==


For all four datasets:
For all four datasets:
Line 16: Line 16:
| exact
| exact
|-
|-
| Sample [[variance]] of ''x'' : ''s''{{supsub|2|''x''}}
| Sample [[variance]] of ''x'': ''s''{{supsub|2|''x''}}
| 11
| 11
| exact
| exact
Line 24: Line 24:
| to 2 decimal places
| to 2 decimal places
|-
|-
| Sample variance of ''y'' : ''s''{{supsub|2|''y''}}
| Sample variance of ''y'': ''s''{{supsub|2|''y''}}
| 4.125
| 4.125
| ±0.003
| ±0.003
Line 36: Line 36:
| to 2 and 3 decimal places, respectively
| to 2 and 3 decimal places, respectively
|-
|-
| [[Coefficient of determination]] of the linear regression : <math>R^2</math>
| [[Coefficient of determination]] of the linear regression: <math>R^2</math>
| 0.67
| 0.67
| to 2 decimal places
| to 2 decimal places
|}
|}
<!-- to be added to table above:
<!-- to be added to table above:
sums of squared errors (about the mean) = 110.0 <br />
sums of squared errors (about the mean) = 110.0
regression sums of squared errors (variance accounted for by ''x'') = 27.5 <br />
regression sums of squared errors (variance accounted for by ''x'') = 27.5
residual sums of squared errors (about the regression line) = 13.75 <br />
residual sums of squared errors (about the regression line) = 13.75
-->
coefficient of determination = 0.67 <br />
-->* The first [[scatter plot]] (top left) appears to be a simple [[linear relationship]], corresponding to two [[variable (mathematics)|variable]]s correlated where y could be modelled as [[normal distribution|gaussian]] with mean linearly dependent on&nbsp;''x''.
* The first [[scatter plot]] (top left) appears to be a simple [[linear relationship]], corresponding to two correlated [[variable (mathematics)|variables]], where ''y'' could be modelled as [[normal distribution|gaussian]] with mean linearly dependent on&nbsp;''x''.
* The second graph (top right); while a relationship between the two variables is obvious, it is not linear, and the [[Pearson correlation coefficient]] is not relevant. A more general regression and the corresponding [[coefficient of determination]] would be more appropriate.
* For the second graph (top right), while a relationship between the two variables is obvious, it is not linear, and the [[Pearson correlation coefficient]] is not relevant. A more general regression and the corresponding [[coefficient of determination]] would be more appropriate.
* In the third graph (bottom left), the modelled relationship is linear, but should have a different [[regression line]] (a [[robust regression]] would have been called for). The calculated regression is offset by the one [[outlier]] which exerts enough influence to lower the correlation coefficient from 1 to 0.816.
* In the third graph (bottom left), the modelled relationship is linear, but should have a different [[regression line]] (a [[robust regression]] would have been called for). The calculated regression is offset by the one [[outlier]], which exerts enough influence to lower the correlation coefficient from 1 to 0.816.
* Finally, the fourth graph (bottom right) shows an example when one [[high-leverage point]] is enough to produce a high correlation coefficient, even though the other data points do not indicate any relationship between the variables.
* Finally, the fourth graph (bottom right) shows an example when one [[high-leverage point]] is enough to produce a high correlation coefficient, even though the other data points do not indicate any relationship between the variables.


The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.<ref>{{cite journal| url=http://physics.info/linear-regression/practice.shtml#4 |title=Linear Regression |journal=The Physics Hypertextbook |last=Elert |first=Glenn|year=2021 }}</ref><ref>{{cite book |last=Janert |first=Philipp K. |title=Data Analysis with Open Source Tools |year=2010 |publisher=[[O'Reilly Media]] |pages=[https://archive.org/details/isbn_9780596802356/page/65 65–66] |isbn=978-0-596-80235-6 |url=https://archive.org/details/isbn_9780596802356/page/65 }}</ref><ref>{{cite book |last1=Chatterjee |first1=Samprit |last2=Hadi |first2=Ali S. |year=2006 |title=Regression Analysis by Example |publisher=John Wiley and Sons |page=91 |isbn=0-471-74696-7}}</ref><ref>{{cite book |last1=Saville |first1=David J. |last2=Wood |first2=Graham R. |year=1991 |title=Statistical Methods: The geometric approach |publisher=[[Springer Science+Business Media|Springer]] |page=418 |isbn=0-387-97517-9}}</ref><ref>{{cite book |last=Tufte |first=Edward R. |authorlink=Edward Tufte |year=2001 |title=The Visual Display of Quantitative Information |edition=2nd |location=Cheshire, CT |publisher=Graphics Press |isbn=0-9613921-4-2 |url=https://archive.org/details/visualdisplayofq00tuft }}</ref>
The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.<ref>{{cite journal |url=http://physics.info/linear-regression/practice.shtml#4 |title=Linear Regression |journal=The Physics Hypertextbook |last=Elert |first=Glenn |year=2021 |access-date=2017-02-23 |archive-date=2020-10-01 |archive-url=https://web.archive.org/web/20201001193224/http://physics.info/linear-regression/practice.shtml#4 |url-status=live }}</ref><ref>{{cite book |last=Janert |first=Philipp K. |title=Data Analysis with Open Source Tools |year=2010 |publisher=[[O'Reilly Media]] |pages=[https://archive.org/details/isbn_9780596802356/page/65 65–66] |isbn=978-0-596-80235-6 |url=https://archive.org/details/isbn_9780596802356/page/65 }}</ref><ref>{{cite book |last1=Chatterjee |first1=Samprit |last2=Hadi |first2=Ali S. |year=2006 |title=Regression Analysis by Example |publisher=John Wiley and Sons |page=91 |isbn=0-471-74696-7}}</ref><ref>{{cite book |last1=Saville |first1=David J. |last2=Wood |first2=Graham R. |year=1991 |title=Statistical Methods: The geometric approach |publisher=[[Springer Science+Business Media|Springer]] |page=418 |isbn=0-387-97517-9}}</ref><ref>{{cite book |last=Tufte |first=Edward R. |authorlink=Edward Tufte |year=2001 |title=The Visual Display of Quantitative Information |edition=2nd |location=Cheshire, CT |publisher=Graphics Press |isbn=0-9613921-4-2 |url=https://archive.org/details/visualdisplayofq00tuft }}</ref>


The datasets are as follows. The ''x'' values are the same for the first three datasets.<ref name="Anscombe"/>
The datasets are as follows. The ''x'' values are the same for the first three datasets.<ref name="Anscombe"/>
<div class="center" style="width:auto; margin-left:auto; margin-right:auto;">

{| class="wikitable" style="text-align: center; margin-left:auto; margin-right:auto;" border="1"
{| class="wikitable" style="text-align: center"
|+ Anscombe's quartet
|+ Anscombe's quartet
|-
|-
! colspan="2"| I
! colspan="2"| Dataset I
! colspan="2"| II
! colspan="2"| Dataset II
! colspan="2"| III
! colspan="2"| Dataset III
! colspan="2"| IV
! colspan="2"| Dataset IV
|-
|-
| ''x''
| ''x''
Line 93: Line 93:
| 5.0 || 5.68 || 5.0 || 4.74 || 5.0 || 5.73 || 8.0 || 6.89
| 5.0 || 5.68 || 5.0 || 4.74 || 5.0 || 5.73 || 8.0 || 6.89
|}
|}
</div>


It is not known how Anscombe created his datasets.<ref name="ChatterjeeFirat">{{cite journal |last1=Chatterjee |first1=Sangit |last2=Firat |first2=Aykut |year=2007 |title=Generating Data with Identical Statistics but Dissimilar Graphics: A follow up to the Anscombe dataset |journal=[[The American Statistician]] |volume=61 |issue=3 |pages=248–254 |doi=10.1198/000313007X220057| jstor=27643902|s2cid=121163371 }}</ref> Since its publication, several methods to generate similar data sets with identical statistics and dissimilar graphics have been developed.<ref name="ChatterjeeFirat"/><ref>{{cite journal |last1=Matejka |first1=Justin |last2=Fitzmaurice |first2=George |year=2017 |title=Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing |journal=[[Conference on Human Factors in Computing Systems|Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems]] |pages=1290–1294 |doi=10.1145/3025453.3025912|s2cid=9247543 }}</ref>
It is not known how Anscombe created his datasets.<ref name="ChatterjeeFirat">{{cite journal |last1=Chatterjee |first1=Sangit |last2=Firat |first2=Aykut |year=2007 |title=Generating Data with Identical Statistics but Dissimilar Graphics: A follow up to the Anscombe dataset |journal=[[The American Statistician]] |volume=61 |issue=3 |pages=248–254 |doi=10.1198/000313007X220057| jstor=27643902|s2cid=121163371 }}</ref> Since its publication, several methods to generate similar datasets with identical statistics and dissimilar graphics have been developed.<ref name="ChatterjeeFirat"/><ref>{{cite book |last1=Matejka |first1=Justin |last2=Fitzmaurice |first2=George |title=Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems |chapter=Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing |year=2017 |pages=1290–1294 |doi=10.1145/3025453.3025912|isbn=9781450346559 |s2cid=9247543 }}</ref>
One of these, the ''Datasaurus Dozen'', consists of points tracing out the outline of a dinosaur, plus twelve other data sets that have the same summary statistics.<ref>{{Cite journal|last1=Murray|first1=Lori L.|last2=Wilson|first2=John G.|date=April 2021|title=Generating data sets for teaching the importance of regression analysis|url=https://onlinelibrary.wiley.com/doi/10.1111/dsji.12233|journal=Decision Sciences Journal of Innovative Education|language=en|volume=19|issue=2|pages=157–166|doi=10.1111/dsji.12233|s2cid=233609149|issn=1540-4595}}</ref><ref>{{Citation|last1=Andrienko|first1=Natalia|title=Visual Analytics for Investigating and Processing Data|date=2020|url=http://link.springer.com/10.1007/978-3-030-56146-8_5|work=Visual Analytics for Data Scientists|pages=151–180|place=Cham|publisher=Springer International Publishing|language=en|doi=10.1007/978-3-030-56146-8_5|isbn=978-3-030-56145-1|access-date=2021-04-20|last2=Andrienko|first2=Gennady|last3=Fuchs|first3=Georg|last4=Slingsby|first4=Aidan|last5=Turkay|first5=Cagatay|last6=Wrobel|first6=Stefan|s2cid=226648414}}</ref><ref>{{Cite web|last1=Matejka|first1=Justin|last2=Fitzmaurice|first2=George|date=2017|title=Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing|url=https://www.autodesk.com/research/publications/same-stats-different-graphs|url-status=live|access-date=2021-04-20|website=Autodesk Research|language=en-US|archive-url=https://web.archive.org/web/20201004003855/https://www.autodesk.com/research/publications/same-stats-different-graphs |archive-date=2020-10-04 }}</ref> ''Datasaurus Dozen'' was created by Justin Matejka and George Fitzmaurice. The process is described in their paper “Same stats, different graphs: generating datasets with varied appearence and identical statistics through simulated annealing“.
One of these, the ''[[Datasaurus dozen]]'', consists of points tracing out the outline of a dinosaur, plus twelve other datasets that have the same summary statistics.<ref>{{Cite web |last1=Matejka |first1=Justin |last2=Fitzmaurice |first2=George |date=2017 |title=Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing |url=https://www.autodesk.com/research/publications/same-stats-different-graphs |url-status=live |access-date=2021-04-20 |website=Autodesk Research |language=en-US |archive-url=https://web.archive.org/web/20201004003855/https://www.autodesk.com/research/publications/same-stats-different-graphs |archive-date=2020-10-04 }}</ref><ref>{{Cite journal |last1=Murray |first1=Lori L. |last2=Wilson |first2=John G. |date=April 2021 |title=Generating data sets for teaching the importance of regression analysis |url=https://onlinelibrary.wiley.com/doi/10.1111/dsji.12233 |journal=Decision Sciences Journal of Innovative Education |language=en |volume=19 |issue=2 |pages=157–166 |doi=10.1111/dsji.12233 |s2cid=233609149 |issn=1540-4595 |access-date=2021-04-20 |archive-date=2021-04-23 |archive-url=https://web.archive.org/web/20210423155254/https://onlinelibrary.wiley.com/doi/10.1111/dsji.12233 |url-status=live }}</ref><ref>{{Citation |last1=Andrienko |first1=Natalia |author1-link=Natalia Andrienko |title=Visual Analytics for Investigating and Processing Data |date=2020 |url=http://link.springer.com/10.1007/978-3-030-56146-8_5 |work=Visual Analytics for Data Scientists |pages=151–180 |place=Cham |publisher=Springer International Publishing |language=en |doi=10.1007/978-3-030-56146-8_5 |isbn=978-3-030-56145-1 |access-date=2021-04-20 |last2=Andrienko |first2=Gennady |last3=Fuchs |first3=Georg |last4=Slingsby |first4=Aidan |last5=Turkay |first5=Cagatay |last6=Wrobel |first6=Stefan |s2cid=226648414 |postscript=. |archive-date=2024-10-03 |archive-url=https://web.archive.org/web/20241003162552/https://link.springer.com/chapter/10.1007/978-3-030-56146-8_5 |url-status=live }}</ref>

The datasaurus Dozen proves us as much as Anscombe Quartet why visualizing our data is important as summary statistics can be the same while distributions can be very different.


==See also==
== See also ==
* [[Datasaurus dozen]]
*[[Exploratory data analysis]]
*[[Goodness of fit]]
* [[Exploratory data analysis]]
*[[Regression validation]]
* [[Goodness of fit]]
*[[Simpson's paradox]]
* [[Regression validation]]
* [[Simpson's paradox]]
*[[Statistical model validation]]
* [[Statistical model validation]]


==References==
== References ==
{{Reflist}}
{{Reflist}}


==External links==
== External links ==
*[http://www.upscale.utoronto.ca/GeneralInterest/Harrison/Visualisation/Visualisation.html Department of Physics, University of Toronto]
* [http://www.upscale.utoronto.ca/GeneralInterest/Harrison/Visualisation/Visualisation.html Department of Physics, University of Toronto]
*[https://www.geogebra.org/m/tbwXxySn Dynamic Applet] made in [[GeoGebra]] showing the data & statistics and also allowing the points to be dragged (Set 5).
* [https://www.geogebra.org/m/tbwXxySn Dynamic Applet] made in [[GeoGebra]] showing the data & statistics and also allowing the points to be dragged (Set 5).
*[https://www.autodeskresearch.com/publications/samestats Animated examples from Autodesk] called the "Datasaurus Dozen".
* [https://www.autodeskresearch.com/publications/samestats Animated examples from Autodesk] called the "Datasaurus Dozen".
*[https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/anscombe.html Documentation] for the datasets in [[R (programming language)|R]].
* [https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/anscombe.html Documentation] for the datasets in [[R (programming language)|R]].


[[Category:Misuse of statistics]]
[[Category:Misuse of statistics]]

Latest revision as of 02:00, 12 October 2024

The four datasets composing Anscombe's quartet. All four sets have identical statistical parameters, but the graphs show them to be considerably different

Anscombe's quartet comprises four datasets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (xy) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data when analyzing it, and the effect of outliers and other influential observations on statistical properties. He described the article as being intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough".[1]

Data

[edit]

For all four datasets:

Property Value Accuracy
Mean of x 9 exact
Sample variance of x: s2
x
11 exact
Mean of y 7.50 to 2 decimal places
Sample variance of y: s2
y
4.125 ±0.003
Correlation between x and y 0.816 to 3 decimal places
Linear regression line y = 3.00 + 0.500x to 2 and 3 decimal places, respectively
Coefficient of determination of the linear regression: 0.67 to 2 decimal places
  • The first scatter plot (top left) appears to be a simple linear relationship, corresponding to two correlated variables, where y could be modelled as gaussian with mean linearly dependent on x.
  • For the second graph (top right), while a relationship between the two variables is obvious, it is not linear, and the Pearson correlation coefficient is not relevant. A more general regression and the corresponding coefficient of determination would be more appropriate.
  • In the third graph (bottom left), the modelled relationship is linear, but should have a different regression line (a robust regression would have been called for). The calculated regression is offset by the one outlier, which exerts enough influence to lower the correlation coefficient from 1 to 0.816.
  • Finally, the fourth graph (bottom right) shows an example when one high-leverage point is enough to produce a high correlation coefficient, even though the other data points do not indicate any relationship between the variables.

The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.[2][3][4][5][6]

The datasets are as follows. The x values are the same for the first three datasets.[1]

Anscombe's quartet
Dataset I Dataset II Dataset III Dataset IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84
11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47
14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04
6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25
4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50
12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56
7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91
5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

It is not known how Anscombe created his datasets.[7] Since its publication, several methods to generate similar datasets with identical statistics and dissimilar graphics have been developed.[7][8] One of these, the Datasaurus dozen, consists of points tracing out the outline of a dinosaur, plus twelve other datasets that have the same summary statistics.[9][10][11]

See also

[edit]

References

[edit]
  1. ^ a b Anscombe, F. J. (1973). "Graphs in Statistical Analysis". American Statistician. 27 (1): 17–21. doi:10.1080/00031305.1973.10478966. JSTOR 2682899.
  2. ^ Elert, Glenn (2021). "Linear Regression". The Physics Hypertextbook. Archived from the original on 2020-10-01. Retrieved 2017-02-23.
  3. ^ Janert, Philipp K. (2010). Data Analysis with Open Source Tools. O'Reilly Media. pp. 65–66. ISBN 978-0-596-80235-6.
  4. ^ Chatterjee, Samprit; Hadi, Ali S. (2006). Regression Analysis by Example. John Wiley and Sons. p. 91. ISBN 0-471-74696-7.
  5. ^ Saville, David J.; Wood, Graham R. (1991). Statistical Methods: The geometric approach. Springer. p. 418. ISBN 0-387-97517-9.
  6. ^ Tufte, Edward R. (2001). The Visual Display of Quantitative Information (2nd ed.). Cheshire, CT: Graphics Press. ISBN 0-9613921-4-2.
  7. ^ a b Chatterjee, Sangit; Firat, Aykut (2007). "Generating Data with Identical Statistics but Dissimilar Graphics: A follow up to the Anscombe dataset". The American Statistician. 61 (3): 248–254. doi:10.1198/000313007X220057. JSTOR 27643902. S2CID 121163371.
  8. ^ Matejka, Justin; Fitzmaurice, George (2017). "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing". Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. pp. 1290–1294. doi:10.1145/3025453.3025912. ISBN 9781450346559. S2CID 9247543.
  9. ^ Matejka, Justin; Fitzmaurice, George (2017). "Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing". Autodesk Research. Archived from the original on 2020-10-04. Retrieved 2021-04-20.
  10. ^ Murray, Lori L.; Wilson, John G. (April 2021). "Generating data sets for teaching the importance of regression analysis". Decision Sciences Journal of Innovative Education. 19 (2): 157–166. doi:10.1111/dsji.12233. ISSN 1540-4595. S2CID 233609149. Archived from the original on 2021-04-23. Retrieved 2021-04-20.
  11. ^ Andrienko, Natalia; Andrienko, Gennady; Fuchs, Georg; Slingsby, Aidan; Turkay, Cagatay; Wrobel, Stefan (2020), "Visual Analytics for Investigating and Processing Data", Visual Analytics for Data Scientists, Cham: Springer International Publishing, pp. 151–180, doi:10.1007/978-3-030-56146-8_5, ISBN 978-3-030-56145-1, S2CID 226648414, archived from the original on 2024-10-03, retrieved 2021-04-20.
[edit]