Jump to content

Mathematical statistics: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Citation bot (talk | contribs)
Altered url. URLs might have been anonymized. | Use this bot. Report bugs. | Suggested by Jay8g | #UCB_toolbar
 
(79 intermediate revisions by 50 users not shown)
Line 1: Line 1:
{{Short description|Branch of statistics}}
{{confused|Mathematics and statistics|Mathematics|Statistics}}
[[Image:Linear regression.svg|thumb|right|300px|Illustration of linear regression on a data set. [[Regression analysis]] is an important part of mathematical statistics.]]
[[Image:Linear regression.svg|thumb|right|300px|Illustration of linear regression on a data set. [[Regression analysis]] is an important part of mathematical statistics.]]
{{Statistics topics sidebar}}
{{Math topics TOC}}


'''Mathematical statistics''' is the application of [[mathematics]] to [[statistics]], which was originally conceived as the science of the state{{citation needed|date=March 2018}} the collection and analysis of facts about a country: its economy, land, military, population, and so on. Mathematical techniques which are used for this include [[mathematical analysis]], [[linear algebra]], [[stochastic analysis]], [[differential equations]], and [[measure-theoretic probability theory]].<ref>{{cite book|last=Lakshmikantham,|first=ed. by D. Kannan,... V.|title=Handbook of stochastic analysis and applications|date=2002|publisher=M. Dekker|location=New York|isbn=0824706609}}</ref><ref>{{cite book|last=Schervish|first=Mark J.|title=Theory of statistics|date=1995|publisher=Springer|location=New York|isbn=0387945466|edition=Corr. 2nd print.}}</ref>
'''Mathematical statistics''' is the application of [[probability theory]] and other mathematical concepts to [[statistics]], as opposed to techniques for collecting statistical data.<ref>{{Cite book |last=Shao |first=Jun |url=https://books.google.com/books?id=_bEPBwAAQBAJ |title=Mathematical Statistics |date=2008-02-03 |publisher=Springer Science & Business Media |isbn=978-0-387-21718-5 |language=en}}</ref> Specific mathematical techniques that are commonly used in statistics include [[mathematical analysis]], [[linear algebra]], [[stochastic analysis]], [[differential equations]], and [[measure theory]].<ref>{{cite book|editor1-last=Kannan|editor1-first=D.|editor2-last=Lakshmikantham|editor2-first=V.|title=Handbook of stochastic analysis and applications|date=2002|publisher=M. Dekker|location=New York|isbn=0824706609}}</ref><ref>{{cite book|last=Schervish|first=Mark J.|title=Theory of statistics|date=1995|publisher=Springer|location=New York|isbn=0387945466|edition=Corr. 2nd print.}}</ref>


==Introduction==
==Introduction==
Statistical science{{clarify|reason=What is "statistical science"?|date=March 2018}} is concerned with the planning of studies, especially with the [[design of experiments|design of randomized experiments]] and with the planning of [[statistical survey|surveys]] using [[random sampling]]. The initial analysis of the data from properly randomized studies{{clarify|reason=What are "randomized studies"? What do we mean by "properly randomized studies"? What makes the studies properly randomized?|date=March 2018}} often follows the study protocol{{which|date=March 2018}}{{clarify|reason=What is a study protocol in this context?|date=March 2018}}. The data from a randomized study can be analyzed to consider secondary hypotheses{{clarify|reason=Are we talking about hypothesis testing?|date=March 2018}} or to suggest new ideas{{clarify|reason=Regarding what?|date=March 2018}}. A secondary analysis of the data from a planned study{{clarify|reason=What exactly is this planned study?|date=March 2018}} uses tools from [[data analysis]].
Statistical data collection is concerned with the planning of studies, especially with the [[design of experiments|design of randomized experiments]] and with the planning of [[statistical survey|surveys]] using [[random sampling]]. The initial analysis of the data often follows the study protocol specified prior to the study being conducted. The data from a study can also be analyzed to consider secondary hypotheses inspired by the initial results, or to suggest new studies. A secondary analysis of the data from a planned study uses tools from [[data analysis]], and the process of doing this is mathematical statistics.


Data analysis is divided into:
Data analysis is divided into:


* [[descriptive statistics]] - the part of statistics that describes data, i.e. summarises the data and their typical properties.
* [[descriptive statistics]] the part of statistics that describes data, i.e. summarises the data and their typical properties.
* [[inferential statistics]] - the part of statistics that draws conclusions from data (using some model for the data): For example, inferential statistics involves selecting a model for the data, checking whether the data fulfill the conditions of a particular model, and with quantifying the involved uncertainty (e.g. using [[confidence interval]]s).
* [[inferential statistics]] the part of statistics that draws conclusions from data (using some model for the data): For example, inferential statistics involves selecting a model for the data, checking whether the data fulfill the conditions of a particular model, and with quantifying the involved uncertainty (e.g. using [[confidence interval]]s).


While the tools of data analysis work best on data from randomized studies, they are also applied to other kinds of data. For example, from [[natural experiments]] and [[observational studies]], in which case the inference is dependent on the model chosen by the statistician, and so subjective.<ref>[[David A. Freedman (statistician)|Freedman, D.A.]] (2005) ''Statistical Models: Theory and Practice'', Cambridge University Press. {{isbn|978-0-521-67105-7}}</ref>
While the tools of data analysis work best on data from randomized studies, they are also applied to other kinds of data. For example, from [[natural experiments]] and [[observational studies]], in which case the inference is dependent on the model chosen by the statistician, and so subjective.<ref>[[David A. Freedman (statistician)|Freedman, D.A.]] (2005) ''Statistical Models: Theory and Practice'', Cambridge University Press. {{isbn|978-0-521-67105-7}}</ref><ref name=Freedman>{{cite book |last1=Freedman |first1=David A. |editor1-last=Collier |editor1-first=David |editor2-last=Sekhon |editor2-first=Jasjeet S. |editor3-last=Stark |editor3-first=Philp B. |title=Statistical Models and Causal Inference: A Dialogue with the Social Sciences |date=2010 |publisher=Cambridge University Press |isbn=978-0-521-12390-7 |url=http://www.cambridge.org/9780521123907}}</ref>

Mathematical statistics has been inspired by and has extended many options{{clarify|reason=What exactly are options?|date=March 2018}} in applied statistics{{citation needed|date=March 2018}}.


==Topics==
==Topics==
Line 19: Line 21:


===Probability distributions===
===Probability distributions===
{{main article|Probability distribution}}
{{main|Probability distribution}}
A [[probability distribution]] assigns a [[probability]] to each [[measure (mathematics)|measurable subset]] of the possible outcomes of a random [[Experiment (probability theory)|experiment]], [[Survey methodology|survey]], or procedure of [[statistical inference]]. Examples are found in experiments whose [[sample space]] is non-numerical, where the distribution would be a [[categorical distribution]]; experiments whose sample space is encoded by discrete [[random variables]], where the distribution can be specified by a [[probability mass function]]; and experiments with sample spaces encoded by continuous random variables, where the distribution can be specified by a [[probability density function]]. More complex experiments, such as those involving [[stochastic processes]] defined in [[continuous time]], may demand the use of more general [[probability measure]]s.
A [[probability distribution]] is a [[function (mathematics)|function]] that assigns a [[probability]] to each [[measure (mathematics)|measurable subset]] of the possible outcomes of a random [[Experiment (probability theory)|experiment]], [[Survey methodology|survey]], or procedure of [[statistical inference]]. Examples are found in experiments whose [[sample space]] is non-numerical, where the distribution would be a [[categorical distribution]]; experiments whose sample space is encoded by discrete [[random variables]], where the distribution can be specified by a [[probability mass function]]; and experiments with sample spaces encoded by continuous random variables, where the distribution can be specified by a [[probability density function]]. More complex experiments, such as those involving [[stochastic processes]] defined in [[continuous time]], may demand the use of more general [[probability measure]]s.


A probability distribution can either be [[Univariate distribution|univariate]] or [[Multivariate distribution|multivariate]]. A univariate distribution gives the probabilities of a single [[random variable]] taking on various alternative values; a multivariate distribution (a joint probability distribution) gives the probabilities of a [[random vector]]—a set of two or more random variables—taking on various combinations of values. Important and commonly encountered univariate probability distributions include the [[binomial distribution]], the [[hypergeometric distribution]], and the [[normal distribution]]. The [[multivariate normal distribution]] is a commonly encountered multivariate distribution.
A probability distribution can either be [[Univariate distribution|univariate]] or [[Multivariate distribution|multivariate]]. A univariate distribution gives the probabilities of a single [[random variable]] taking on various alternative values; a multivariate distribution (a [[joint probability distribution]]) gives the probabilities of a [[random vector]]—a set of two or more random variables—taking on various combinations of values. Important and commonly encountered univariate probability distributions include the [[binomial distribution]], the [[hypergeometric distribution]], and the [[normal distribution]]. The [[multivariate normal distribution]] is a commonly encountered multivariate distribution.


====Special distributions====
====Special distributions====
*[[Normal distribution|Normal distribution]], the most common continuous distribution
*[[Normal distribution]], the most common continuous distribution
*[[Bernoulli distribution]], for the outcome of a single Bernoulli trial (e.g. success/failure, yes/no)
*[[Bernoulli distribution]], for the outcome of a single Bernoulli trial (e.g. success/failure, yes/no)
*[[Binomial distribution]], for the number of "positive occurrences" (e.g. successes, yes votes, etc.) given a fixed total number of [[independent (statistics)|independent]] occurrences
*[[Binomial distribution]], for the number of "positive occurrences" (e.g. successes, yes votes, etc.) given a fixed total number of [[independent (statistics)|independent]] occurrences
*[[Negative binomial distribution]], for binomial-type observations but where the quantity of interest is the number of failures before a given number of successes occurs
*[[Negative binomial distribution]], for binomial-type observations but where the quantity of interest is the number of failures before a given number of successes occurs
*[[Geometric distribution]], for binomial-type observations but where the quantity of interest is the number of failures before the first success; a special c*[[Discrete uniform distribution]], for a finite set of values (e.g. the outcome of a fair die)
*[[Geometric distribution]], for binomial-type observations but where the quantity of interest is the number of failures before the first success; a special case of the negative binomial distribution, where the number of successes is one.
*[[Discrete uniform distribution]], for a finite set of values (e.g. the outcome of a fair die)
*[[Continuous uniform distribution]], for continuously distributed values
*[[Continuous uniform distribution]], for continuously distributed values
*[[Poisson distribution]], for the number of occurrences of a Poisson-type event in a given period of time
*[[Poisson distribution]], for the number of occurrences of a Poisson-type event in a given period of time
Line 39: Line 42:


===Statistical inference===
===Statistical inference===
{{main article|Statistical inference}}
{{main|Statistical inference}}
[[Statistical inference]] is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation.<ref name="Oxford">Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. {{isbn|978-0-19-954145-4}}</ref> Initial requirements of such a system of procedures for [[inference]] and [[Inductive reasoning|induction]] are that the system should produce reasonable answers when applied to well-defined situations and that it should be general enough to be applied across a range of situations. Inferential statistics are used to test hypotheses and make estimations using sample data. Whereas [[descriptive statistics]] describe a sample, inferential statistics infer predictions about a larger population that the sample represents.
[[Statistical inference]] is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation.<ref name="Oxford">Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. {{isbn|978-0-19-954145-4}}</ref> Initial requirements of such a system of procedures for [[inference]] and [[Inductive reasoning|induction]] are that the system should produce reasonable answers when applied to well-defined situations and that it should be general enough to be applied across a range of situations. Inferential statistics are used to test hypotheses and make estimations using sample data. Whereas [[descriptive statistics]] describe a sample, inferential statistics infer predictions about a larger population that the sample represents.


Line 48: Line 51:


===Regression===
===Regression===
{{main article|Regression analysis}}
{{main|Regression analysis}}


In [[statistics]], '''regression analysis''' is a statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a [[dependent variable]] and one or more [[independent variable]]s. More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the [[conditional expectation]] of the dependent variable given the independent variables – that is, the [[average value]] of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a [[quantile]], or other [[location parameter]] of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a [[function (mathematics)|function]] of the independent variables called the '''regression function'''. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function which can be described by a [[probability distribution]].
In [[statistics]], '''regression analysis''' is a statistical process for estimating the relationships among variables. It includes many ways for modeling and analyzing several variables, when the focus is on the relationship between a [[dependent variable]] and one or more [[independent variable]]s. More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the [[conditional expectation]] of the dependent variable given the independent variables – that is, the [[average value]] of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a [[quantile]], or other [[location parameter]] of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a [[function (mathematics)|function]] of the independent variables called the '''regression function'''. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function which can be described by a [[probability distribution]].


Many techniques for carrying out regression analysis have been developed. Familiar methods, such as [[linear regression]], are [[parametric statistics|parametric]], in that the regression function is defined in terms of a finite number of unknown [[parameter]]s that are estimated from the [[data]] (e.g. using [[ordinary least squares]]). [[Nonparametric regression]] refers to techniques that allow the regression function to lie in a specified set of [[function (mathematics)|functions]], which may be [[dimension|infinite-dimensional]].
Many techniques for carrying out regression analysis have been developed. Familiar methods, such as [[linear regression]], are [[parametric statistics|parametric]], in that the regression function is defined in terms of a finite number of unknown [[parameter]]s that are estimated from the [[data]] (e.g. using [[ordinary least squares]]). [[Nonparametric regression]] refers to techniques that allow the regression function to lie in a specified set of [[function (mathematics)|functions]], which may be [[dimension|infinite-dimensional]].


===Nonparametric statistics===
===Nonparametric statistics===
{{main article|Nonparametric statistics}}
{{main|Nonparametric statistics}}
'''Nonparametric statistics''' are [[statistics]] not based on [[parametrization|parameterized]] families of [[probability distribution]]s. They include both [[descriptive statistics|descriptive]] and [[statistical inference|inferential]] statistics. The typical parameters are the mean, variance, etc. Unlike [[parametric statistics]], nonparametric statistics make no assumptions about the [[probability distribution]]s of the variables being assessed{{citation needed|date=March 2018}}.
'''Nonparametric statistics''' are values calculated from data in a way that is not based on [[Statistical parameter|parameterized]] families of [[probability distribution]]s. They include both [[descriptive statistics|descriptive]] and [[statistical inference|inferential]] statistics. The typical parameters are the expectations, variance, etc. Unlike [[parametric statistics]], nonparametric statistics make no assumptions about the [[probability distribution]]s of the variables being assessed.<ref>{{Cite web |title=Research Nonparametric Methods |url=https://d8.stat.cmu.edu/research-areas/nonparametric-methods |access-date=August 30, 2022 |website=Carnegie Mellon University}}</ref>


Non-parametric methods are widely used for studying populations that take on a ranked order (such as movie reviews receiving one to four stars). The use of non-parametric methods may be necessary when data have a [[ranking]] but no clear numerical interpretation, such as when assessing [[preferences]]. In terms of [[level of measurement|levels of measurement]], non-parametric methods result in "ordinal" data.
Non-parametric methods are widely used for studying populations that take on a ranked order (such as movie reviews receiving one to four stars). The use of non-parametric methods may be necessary when data have a [[ranking]] but no clear numerical interpretation, such as when assessing [[preferences]]. In terms of [[level of measurement|levels of measurement]], non-parametric methods result in "ordinal" data.


As non-parametric methods make fewer assumptions, their applicability is much wider than the corresponding parametric methods. In particular, they may be applied in situations where less is known about the application in question. Also, due to the reliance on fewer assumptions, non-parametric methods are more [[Robust statistics#Introduction|robust]].
As non-parametric methods make fewer assumptions, their applicability is much wider than the corresponding parametric methods. In particular, they may be applied in situations where less is known about the application in question. Also, due to the reliance on fewer assumptions, non-parametric methods are more [[Robust statistics#Introduction|robust]].

One drawback of non-parametric methods is that since they do not rely on assumptions, they are generally less [[Power of a test|powerful]] than their parametric counterparts.<ref name=":0">{{Cite web |title=Nonparametric Tests |url=https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Nonparametric/BS704_Nonparametric_print.html |access-date=2022-08-31 |website=sphweb.bumc.bu.edu}}</ref> Low power non-parametric tests are problematic because a common use of these methods is for when a sample has a low sample size.<ref name=":0" /> Many parametric methods are proven to be the most powerful tests through methods such as the [[Neyman–Pearson lemma]] and the [[Likelihood-ratio test]].


Another justification for the use of non-parametric methods is simplicity. In certain cases, even when the use of parametric methods is justified, non-parametric methods may be easier to use. Due both to this simplicity and to their greater robustness, non-parametric methods are seen by some statisticians as leaving less room for improper use and misunderstanding.
Another justification for the use of non-parametric methods is simplicity. In certain cases, even when the use of parametric methods is justified, non-parametric methods may be easier to use. Due both to this simplicity and to their greater robustness, non-parametric methods are seen by some statisticians as leaving less room for improper use and misunderstanding.


==Statistics, mathematics, and mathematical statistics==
==Statistics, mathematics, and mathematical statistics==
Mathematical statistics has substantial overlap with the discipline of [[statistics]]. [[Statisticians|Statistical theorists]] study and improve statistical procedures with mathematics, and statistical research often raises mathematical questions. Statistical theory relies on [[Probability theory|probability]] and [[optimal decision|decision theory]].
Mathematical statistics is a key subset of the discipline of [[statistics]]. [[Statisticians|Statistical theorists]] study and improve statistical procedures with mathematics, and statistical research often raises mathematical questions.


Mathematicians and statisticians like [[Gauss]], [[Laplace]], and [[Charles Sanders Peirce|C. S. Peirce]] used [[optimal decision|decision theory]] with [[probability distribution]]s and [[loss function]]s (or [[utility function]]s). The decision-theoretic approach to statistical inference was reinvigorated by [[Abraham Wald]] and his successors,<ref>{{Cite book
Mathematicians and statisticians like [[Gauss]], [[Laplace]], and [[Charles Sanders Peirce|C. S. Peirce]] used [[optimal decision|decision theory]] with [[probability distribution]]s and [[loss function]]s (or [[utility function]]s). The decision-theoretic approach to statistical inference was reinvigorated by [[Abraham Wald]] and his successors<ref>{{Cite book
| first = Abraham
| first = Abraham
| last = Wald |authorlink=Abraham Wald
| last = Wald |author-link=Abraham Wald
| title = Sequential analysis
| title = Sequential analysis
| year = 1947
| year = 1947
Line 79: Line 84:
|first=Abraham
|first=Abraham
|last=Wald
|last=Wald
|authorlink=Abraham Wald
|author-link=Abraham Wald
|title=Statistical Decision Functions
|title=Statistical Decision Functions
|year=1950
|year=1950
|publisher=John Wiley and Sons, New York
|publisher=John Wiley and Sons, New York
}}</ref><ref>{{cite book|last=Lehmann|first=Erich|authorlink=Erich Leo Lehmann
}}</ref><ref>{{cite book|last=Lehmann|first=Erich|author-link=Erich Leo Lehmann
| title=Testing Statistical Hypotheses|year=1997 |edition=2nd
| title=Testing Statistical Hypotheses|year=1997 |edition=2nd
|isbn=0-387-94919-4 }}</ref><ref>
|isbn=0-387-94919-4 }}</ref><ref>
Line 91: Line 96:
| last2=Cassella
| last2=Cassella
| first2=George
| first2=George
| authorlink1=Erich Leo Lehmann
| author-link1=Erich Leo Lehmann
| title=Theory of Point Estimation
| title=Theory of Point Estimation
| year=1998 |edition=2nd|isbn= 0-387-98502-6}}</ref><ref>
| year=1998 |edition=2nd|isbn= 0-387-98502-6}}</ref><ref>
{{cite book
{{cite book
| last1=Bickel|first1= Peter J.|last2=Doksum|first2=Kjell A.
| last1=Bickel|first1= Peter J.|last2=Doksum|first2=Kjell A.
| authorlink1=Peter J. Bickel
| author-link1=Peter J. Bickel
|title=Mathematical Statistics: Basic and Selected Topics
|title=Mathematical Statistics: Basic and Selected Topics
|volume=1
|volume=1
Line 105: Line 110:
|first=Lucien
|first=Lucien
|last=Le Cam
|last=Le Cam
|authorlink=Lucien Le Cam
|author-link=Lucien Le Cam
|title=Asymptotic Methods in Statistical Decision Theory
|title=Asymptotic Methods in Statistical Decision Theory
|year=1986
|year=1986
Line 111: Line 116:
}}</ref><ref>{{cite book
}}</ref><ref>{{cite book
|author1=Liese, Friedrich |author2=Miescke, Klaus-J.
|author1=Liese, Friedrich |author2=Miescke, Klaus-J.
|lastauthoramp=yes |title=Statistical Decision Theory: Estimation, Testing, and Selection
|name-list-style=amp |title=Statistical Decision Theory: Estimation, Testing, and Selection
|year=2008
|year=2008
|publisher=Springer
|publisher=Springer
}}
}}
</ref> and makes extensive use of [[scientific computing]], [[mathematical analysis|analysis]], and [[Optimization (mathematics)|optimization]]; for the [[design of experiments]], statisticians use [[Algebraic statistics|algebra]] and [[Combinatorial design|combinatorics]].
</ref> and makes extensive use of [[scientific computing]], [[mathematical analysis|analysis]], and [[Optimization (mathematics)|optimization]]; for the [[design of experiments]], statisticians use [[Algebraic statistics|algebra]] and [[Combinatorial design|combinatorics]]. But while statistical practice often relies on [[Probability theory|probability]] and [[optimal decision|decision theory]], their application can be controversial <ref name=Freedman/>


==See also==
==See also==
{{portal|Mathematics}}
*[[Asymptotic theory (statistics)]]
*[[Asymptotic theory (statistics)]]


Line 123: Line 129:
<references/>
<references/>


== Additional reading ==
== Further reading ==
* Borovkov, A. A. (1999). ''Mathematical Statistics''. CRC Press. {{isbn|90-5699-018-7}}
* [[Aleksandr Alekseevich Borovkov|Borovkov, A. A.]] (1999). ''Mathematical Statistics''. CRC Press. {{isbn|90-5699-018-7}}
* [http://www.math.uah.edu/stat/ Virtual Laboratories in Probability and Statistics (Univ. of Ala.-Huntsville)]
* [http://www.math.uah.edu/stat/ Virtual Laboratories in Probability and Statistics (Univ. of Ala.-Huntsville)]
* [http://www.trigonella.ch/statibot/english/ StatiBot], interactive online expert system on statistical tests.
* [http://www.trigonella.ch/statibot/english/ StatiBot], interactive online expert system on statistical tests.
* {{Cite book|last1=Ray|first1=Manohar|url=https://books.google.com/books?id=NXGpYgEACAAJ|title=Mathematical Statistics|last2=Sharma|first2=Har Swarup|date=1966|publisher=Ram Prasad & Sons}} {{ISBN|978-9383385188}}


{{Statistics}}
{{Statistics}}

Latest revision as of 07:44, 30 December 2024

Illustration of linear regression on a data set. Regression analysis is an important part of mathematical statistics.

Mathematical statistics is the application of probability theory and other mathematical concepts to statistics, as opposed to techniques for collecting statistical data.[1] Specific mathematical techniques that are commonly used in statistics include mathematical analysis, linear algebra, stochastic analysis, differential equations, and measure theory.[2][3]

Introduction

[edit]

Statistical data collection is concerned with the planning of studies, especially with the design of randomized experiments and with the planning of surveys using random sampling. The initial analysis of the data often follows the study protocol specified prior to the study being conducted. The data from a study can also be analyzed to consider secondary hypotheses inspired by the initial results, or to suggest new studies. A secondary analysis of the data from a planned study uses tools from data analysis, and the process of doing this is mathematical statistics.

Data analysis is divided into:

  • descriptive statistics – the part of statistics that describes data, i.e. summarises the data and their typical properties.
  • inferential statistics – the part of statistics that draws conclusions from data (using some model for the data): For example, inferential statistics involves selecting a model for the data, checking whether the data fulfill the conditions of a particular model, and with quantifying the involved uncertainty (e.g. using confidence intervals).

While the tools of data analysis work best on data from randomized studies, they are also applied to other kinds of data. For example, from natural experiments and observational studies, in which case the inference is dependent on the model chosen by the statistician, and so subjective.[4][5]

Topics

[edit]

The following are some of the important topics in mathematical statistics:[6][7]

Probability distributions

[edit]

A probability distribution is a function that assigns a probability to each measurable subset of the possible outcomes of a random experiment, survey, or procedure of statistical inference. Examples are found in experiments whose sample space is non-numerical, where the distribution would be a categorical distribution; experiments whose sample space is encoded by discrete random variables, where the distribution can be specified by a probability mass function; and experiments with sample spaces encoded by continuous random variables, where the distribution can be specified by a probability density function. More complex experiments, such as those involving stochastic processes defined in continuous time, may demand the use of more general probability measures.

A probability distribution can either be univariate or multivariate. A univariate distribution gives the probabilities of a single random variable taking on various alternative values; a multivariate distribution (a joint probability distribution) gives the probabilities of a random vector—a set of two or more random variables—taking on various combinations of values. Important and commonly encountered univariate probability distributions include the binomial distribution, the hypergeometric distribution, and the normal distribution. The multivariate normal distribution is a commonly encountered multivariate distribution.

Special distributions

[edit]

Statistical inference

[edit]

Statistical inference is the process of drawing conclusions from data that are subject to random variation, for example, observational errors or sampling variation.[8] Initial requirements of such a system of procedures for inference and induction are that the system should produce reasonable answers when applied to well-defined situations and that it should be general enough to be applied across a range of situations. Inferential statistics are used to test hypotheses and make estimations using sample data. Whereas descriptive statistics describe a sample, inferential statistics infer predictions about a larger population that the sample represents.

The outcome of statistical inference may be an answer to the question "what should be done next?", where this might be a decision about making further experiments or surveys, or about drawing a conclusion before implementing some organizational or governmental policy. For the most part, statistical inference makes propositions about populations, using data drawn from the population of interest via some form of random sampling. More generally, data about a random process is obtained from its observed behavior during a finite period of time. Given a parameter or hypothesis about which one wishes to make inference, statistical inference most often uses:

  • a statistical model of the random process that is supposed to generate the data, which is known when randomization has been used, and
  • a particular realization of the random process; i.e., a set of data.

Regression

[edit]

In statistics, regression analysis is a statistical process for estimating the relationships among variables. It includes many ways for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed. Most commonly, regression analysis estimates the conditional expectation of the dependent variable given the independent variables – that is, the average value of the dependent variable when the independent variables are fixed. Less commonly, the focus is on a quantile, or other location parameter of the conditional distribution of the dependent variable given the independent variables. In all cases, the estimation target is a function of the independent variables called the regression function. In regression analysis, it is also of interest to characterize the variation of the dependent variable around the regression function which can be described by a probability distribution.

Many techniques for carrying out regression analysis have been developed. Familiar methods, such as linear regression, are parametric, in that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data (e.g. using ordinary least squares). Nonparametric regression refers to techniques that allow the regression function to lie in a specified set of functions, which may be infinite-dimensional.

Nonparametric statistics

[edit]

Nonparametric statistics are values calculated from data in a way that is not based on parameterized families of probability distributions. They include both descriptive and inferential statistics. The typical parameters are the expectations, variance, etc. Unlike parametric statistics, nonparametric statistics make no assumptions about the probability distributions of the variables being assessed.[9]

Non-parametric methods are widely used for studying populations that take on a ranked order (such as movie reviews receiving one to four stars). The use of non-parametric methods may be necessary when data have a ranking but no clear numerical interpretation, such as when assessing preferences. In terms of levels of measurement, non-parametric methods result in "ordinal" data.

As non-parametric methods make fewer assumptions, their applicability is much wider than the corresponding parametric methods. In particular, they may be applied in situations where less is known about the application in question. Also, due to the reliance on fewer assumptions, non-parametric methods are more robust.

One drawback of non-parametric methods is that since they do not rely on assumptions, they are generally less powerful than their parametric counterparts.[10] Low power non-parametric tests are problematic because a common use of these methods is for when a sample has a low sample size.[10] Many parametric methods are proven to be the most powerful tests through methods such as the Neyman–Pearson lemma and the Likelihood-ratio test.

Another justification for the use of non-parametric methods is simplicity. In certain cases, even when the use of parametric methods is justified, non-parametric methods may be easier to use. Due both to this simplicity and to their greater robustness, non-parametric methods are seen by some statisticians as leaving less room for improper use and misunderstanding.

Statistics, mathematics, and mathematical statistics

[edit]

Mathematical statistics is a key subset of the discipline of statistics. Statistical theorists study and improve statistical procedures with mathematics, and statistical research often raises mathematical questions.

Mathematicians and statisticians like Gauss, Laplace, and C. S. Peirce used decision theory with probability distributions and loss functions (or utility functions). The decision-theoretic approach to statistical inference was reinvigorated by Abraham Wald and his successors[11][12][13][14][15][16][17] and makes extensive use of scientific computing, analysis, and optimization; for the design of experiments, statisticians use algebra and combinatorics. But while statistical practice often relies on probability and decision theory, their application can be controversial [5]

See also

[edit]

References

[edit]
  1. ^ Shao, Jun (2008-02-03). Mathematical Statistics. Springer Science & Business Media. ISBN 978-0-387-21718-5.
  2. ^ Kannan, D.; Lakshmikantham, V., eds. (2002). Handbook of stochastic analysis and applications. New York: M. Dekker. ISBN 0824706609.
  3. ^ Schervish, Mark J. (1995). Theory of statistics (Corr. 2nd print. ed.). New York: Springer. ISBN 0387945466.
  4. ^ Freedman, D.A. (2005) Statistical Models: Theory and Practice, Cambridge University Press. ISBN 978-0-521-67105-7
  5. ^ a b Freedman, David A. (2010). Collier, David; Sekhon, Jasjeet S.; Stark, Philp B. (eds.). Statistical Models and Causal Inference: A Dialogue with the Social Sciences. Cambridge University Press. ISBN 978-0-521-12390-7.
  6. ^ Hogg, R. V., A. Craig, and J. W. McKean. "Intro to Mathematical Statistics." (2005).
  7. ^ Larsen, Richard J. and Marx, Morris L. "An Introduction to Mathematical Statistics and Its Applications" (2012). Prentice Hall.
  8. ^ Upton, G., Cook, I. (2008) Oxford Dictionary of Statistics, OUP. ISBN 978-0-19-954145-4
  9. ^ "Research Nonparametric Methods". Carnegie Mellon University. Retrieved August 30, 2022.
  10. ^ a b "Nonparametric Tests". sphweb.bumc.bu.edu. Retrieved 2022-08-31.
  11. ^ Wald, Abraham (1947). Sequential analysis. New York: John Wiley and Sons. ISBN 0-471-91806-7. See Dover reprint, 2004: ISBN 0-486-43912-7
  12. ^ Wald, Abraham (1950). Statistical Decision Functions. John Wiley and Sons, New York.
  13. ^ Lehmann, Erich (1997). Testing Statistical Hypotheses (2nd ed.). ISBN 0-387-94919-4.
  14. ^ Lehmann, Erich; Cassella, George (1998). Theory of Point Estimation (2nd ed.). ISBN 0-387-98502-6.
  15. ^ Bickel, Peter J.; Doksum, Kjell A. (2001). Mathematical Statistics: Basic and Selected Topics. Vol. 1 (Second (updated printing 2007) ed.). Pearson Prentice-Hall.
  16. ^ Le Cam, Lucien (1986). Asymptotic Methods in Statistical Decision Theory. Springer-Verlag. ISBN 0-387-96307-3.
  17. ^ Liese, Friedrich & Miescke, Klaus-J. (2008). Statistical Decision Theory: Estimation, Testing, and Selection. Springer.

Further reading

[edit]