German tank problem
In the statistical theory of estimation, estimating the maximum of a uniform distribution is a common illustration of differences between estimation methods. The specific case of sampling without replacement from a discrete uniform distribution is known, in the English-speaking world, as the German tank problem, due to its application in World War II to the estimation of the number of German tanks.
Estimating the population maximum based on a single sample raises philosophical points about evaluation of estimators and likelihood (particularly bias in maximum likelihood estimators) and yields divergent results, depending on approach, while the estimation based on multiple samples is used in elementary statistics education as an instructive practical estimation question whose answer is simple but not obvious.
The problem is usually framed for a discrete distribution, but virtually identical analysis holds for a continuous distribution.
Exposition
One may frame the question of estimation of the population maximum as:
- Suppose one is an Allied intelligence analyst during World War II, and one has some serial numbers of captured German tanks. Further, assume that the tanks are numbered sequentially from 1 to N. How does one estimate the total number of tanks?
For point estimation (estimating a single value for the total), the minimum-variance unbiased estimator (MVUE, or UMVU estimator) is given by:[notes 1]
where m is the largest serial number observed (sample maximum) and k is the number of tanks observed (sample size).[1][2][3] Note that once a serial number has been observed, it is no longer in the pool and will not be observed again.
This has a variance of[1]
so a standard deviation of approximately N/k, the (population) average size of a gap between samples; compare m/k above.
Example
Suppose there are 15 tanks, numbered 1, 2, 3, ..., 15. An intelligence officer has spotted tanks 2, 6, 7, 14. Using the above formula, the values for m and k would be 14 and 4, respectively. The formula gives a value 16.5, which is close to the actual number of tanks, 15. Then suppose the officer spots an additional 2 tanks, neither of them #15. Now k = 6 and m remains 14. The formula gives a better estimate of 15.333...
Intuition
The formula may be understood intuitively as:
- "The sample maximum plus the average gap between observations in the sample",
the gap being added to compensate for the negative bias of the sample maximum as an estimator for the population maximum,[notes 2] and written as
This can be visualized by imagining that the samples are evenly spaced throughout the range, with additional samples just outside the range at 0 and N + 1. If one starts with an initial gap being between 0 and the lowest sample (sample minimum), then the average gap between samples is m/k − 1; the −1 being because one does not count the samples themselves in computing the gap between samples.[notes 3]
This philosophy is formalized and generalized in the method of maximum spacing estimation.
Derivation
More formally, the probability that the sample maximum m is less than or equal to i (i ≥ k) is
- (where is the binomial coefficient)
so the probability that m equals i is
The expected value of m is the sum of these multiplied by i:
Using the fact that
we find that the expected value of m is:[1]
Solving for N gives
Since , we find that ((k + 1)/k)m − 1 is an unbiased estimator of N.
To show that this is the UMVU estimator:
- One first shows that the sample maximum is a sufficient statistic for the population maximum, using a method similar to that detailed at sufficiency: uniform distribution (but for the German tank problem we must exclude the outcomes in which a serial number occurs twice in the sample);
- Next, one shows that it is a complete statistic.
- Then the Lehmann–Scheffé theorem states that the sample maximum, corrected for bias as above to be unbiased, is the UMVU estimator.
Variant: Unknown lower bound
If the lower bound is also unknown, one can perform a similar analysis (both for upper and lower bound), where the sufficient statistics are now the sample max and sample min.[1] The estimators can again be interpreted as "sample max (resp. sample min) corrected by an estimate of the gap".
Order of magnitude
Note that for order of magnitude estimates, the sample maximum or the refined estimator above are sufficiently accurate: even a single sample is likely to give you a correct estimate of order of magnitude, as you are 90% unlikely to fall in the bottom decile in one sample, and more than 99% unlikely to fall in the bottom decile with two samples.
Confidence intervals
Instead of, or in addition to, point estimation, one can do interval estimation, such as confidence intervals. These are easily computed, based on the observation that the odds that k samples will fall in an interval covering p of the range (0 ≤ p ≤ 1) is pk (assuming in this section that draws are with replacement, to simplify computations; if draws are without replacement, this overstates the likelihood, and intervals will be overly conservative).
Thus the sampling distribution of the quantile of the sample maximum is the graph x1/k from 0 to 1: the pth to qth quantile of the sample maximum m are the interval [p1/kN, q1/kN]. Inverting this yields the corresponding confidence interval for the population maximum of [m/q1/k, m/p1/k].
For example, taking the symmetric 95% interval p = 2.5% and q = 97.5% for k = 5 yields , so a confidence interval of approximately . The lower bound is very close to m, so more informative is the asymmetric confidence interval from p = 5% to 100%; for k = 5 this yields so the interval [m, 1.82m].
More generally, the (downward biased) 95% confidence interval is For a range of k, with the UMVU point estimator (plus 1 for legibility) for reference, this yields:
k | point estimate | confidence interval |
---|---|---|
1 | ||
2 | ||
5 | ||
10 | ||
20 |
Immediate observations are:
- For small sample sizes, the confidence interval is very wide, reflecting great uncertainty in the estimate.
- The range shrinks rapidly, reflecting the exponentially decaying likelihood that all samples will be significantly below the maximum.
- The confidence interval exhibits positive skew, as N can never be below the sample maximum, but can potentially be arbitrarily high above it.
Note that one cannot naively use m/k (or rather (m + m/k − 1)/k) as an estimate of the standard error SE, as the standard error of an estimator is based on the population maximum (a parameter), and using an estimate to estimate the error in that very estimate is circular reasoning. Further, because the sampling distribution of the sample maximum is not normal (and hence the distribution of the estimator is not normal), one cannot use the 68-95-99.7 rule to conclude that an interval of is a 95% confidence interval, as demonstrated in the table above.
Historical problem
In wartime, a key goal of military intelligence is to determine the strength of an enemy force: in World War II, the Western Allies wanted to estimate the number of tanks the Germans had, and approached this in two major ways: conventional intelligence gathering, and statistical estimation. The statistical approach proved to be far more accurate than the conventional intelligence; the primary reference for the statistical approach is Ruggles & Brodie (1947).[4][notes 4] In some cases statistical analysis contradicted and substantially improved on conventional intelligence; in others, conventional intelligence and the statistical approach worked together, as in estimation of Panther tank production, discussed below. Estimating production was not the only use of this serial number analysis; it was used to understand German production more generally, including number of factories, relative importance of factories, length of supply chain (based on lag between production and use), changes in production, and use of resources such as rubber.
To estimate the number of tanks produced up to a certain point, the Allies used the serial numbers on tanks (most significantly gearbox numbers, as these fell in two unbroken sequences; also chassis and engine, though these were more complicated – various other components helped cross-check the analysis; similar analyses were done on tires),[4] which were observed to be sequentially numbered (i.e. 1, 2, 3, ..., N).[notes 5][5][6]
Specific data
The Allied conventional intelligence estimates believed the number of tanks the Germans were producing between June 1940 and September 1942 was around 1,400 a month. Using the above formula on the serial numbers of captured German tanks, (both serviceable and destroyed) the number was calculated to be 256 a month. After the war captured German production figures from the ministry of Albert Speer show the actual number to be 255.[5]
Estimates for some specific months is given as:[7][8]
Month | Statistical estimate | Intelligence estimate | German records |
June 1940 | 169 | 1000 | 122 |
June 1941 | 244 | 1550 | 271 |
August 1942 | 327 | 1550 | 342 |
Shortly before D-Day, following rumors of large Panther tank production collected by conventional intelligence, analysis of road wheels from two tanks (consisting of 48 wheels each, for 96 wheels total) yielded an estimate of 270 Panthers produced in February 1944, substantially more than had previously been suspected; German records after the war showed production for that month was 276.[9] Specifically, analysis of the wheels yielded an estimate for the number of wheel molds; discussion with British road wheel makers then estimated the number of wheels that could be produced from this many molds.
Similar analyses
Similar serial number analysis was used for other military equipment during World War II, most successfully for the V-2 rocket.[10]
During WWII, German intelligence analyzed factory markings on Soviet military equipment, and during the Korean war, factory markings on Soviet equipment was again analyzed. Soviets also estimated German tank production.[11]
In the 1980s, a similar situation occurred where Americans were given access to the production line of Israel’s Merkava tanks. The production numbers were classified, but the tanks had serial numbers, allowing estimation of production.[1]
Application to iPhone production estimation
In 2008 "A London investor called Tommo_UK started asking for people to post the serial numbers of the phone and the date they bought it so that he could decipher how many phones Apple is distributing."[12][13] from this information and using the above formula he was able to calculate that Apple Inc had sold 9.1 million iPhones to the end of September 2008,[14] which meant that they would probably sell more than 10 million in the year.[12]
Countermeasures
To prevent serial number analysis, one can most simply not include serial numbers, or reduce auxiliary information that can be usable. Alternatively, one can design serial numbers that resist cryptanalysis, most effectively by randomly choosing numbers without replacement from a list that is much larger than the number of objects you produce (compare one-time pad), or simply produce random numbers and check them against the list of already assigned numbers; collisions are likely to occur unless the number of digits possible is more than twice as many as the number of digits in the number of objects produced (where the serial number can be in decimal, hexadecimal or indeed in any base); see birthday problem.[notes 6] For this, one may use a cryptographically secure pseudorandom number generator. Less securely, to avoid lookup problems, one may use any pseudorandom number generator with large period, which is guaranteed to avoid collisions. All these methods require a lookup table (or breaking the cypher) to back out from serial number to production order, which complicates use of serial numbers: one cannot simply recall a range of serial numbers, for instance, but instead must look them each up individually, or generate a list.
Alternatively, one can simply use sequential serial numbers and encrypt them, which allows easy decoding, but then there is a known-plaintext attack: even if one starts from an arbitrary point, the plaintext has a pattern (namely, numbers are in sequence).
Theoretical discussion
Formally, the question is the estimation of the maximum in a finite population consisting of one element at each integer up to the unknown maximum. Observing the special case of a sample of size 1, this approach demonstrates the features and limitations of maximum likelihood estimation and likelihood functions.
Consider a jar containing N lottery tickets numbered from 1 through N, where N is unknown. Given a single ticket n, drawn from the jar, how does one estimate the maximum N?
Three approaches to point estimation give radically different answers: n, 2n − 1, and ∞, accordingly as one uses the maximum likelihood estimator (MLE), minimum-variance unbiased estimator (MVUE), and naive use of likelihood functions. It illustrates dangers and difficulties of naive use of maximum likelihood estimators, which may have significant bias, and of likelihood functions, which cannot simply be conflated with probability distribution functions.
The divergence between these answers is reflected in the corresponding interval estimation for the problem, which yields very large (potentially infinite) confidence intervals and credible intervals.
Approaches
Point estimation
- Maximum likelihood
- The maximum likelihood estimator (MLE) for N is n; however, this is a biased estimator and very likely an underestimate,[notes 7] as N could be higher than n but can not be lower; see analysis of bias.
- Natural unbiased estimation
- The unbiased estimator is 2n − 1 (this is the unique unbiased estimator, thus also the MVUE). This can be derived in two ways:
- The expected value of n is (N + 1)/2. Solving this[notes 8] for N shows that N = 2n − 1 is the (unique) unbiased estimator. Equivalently, take the sample mean as the population mean (the method of moments).
- Alternatively, one can use order statistics: to consider the sample median (in this case, n) as the population median — i.e., to take n as the half-way point. In that case, the estimate for N is 2n − 1 This is equally likely to be an under-estimate or over-estimate, as a given sample is equally likely to be above or below the median.
- These criteria — unbiased and equal odds of over- or under-estimation — coincide because the population mean and median of the uniform distribution are equal, and because the sample mean and median of a single sample are equal, being the sample itself.
- Uniform prior
- Naively applying Bayesian inference, one can attempt to take as one's prior distribution a uniform distribution on 1, 2, 3, ..., which is an improper prior and compute the expected value of the posterior distribution. However, as elaborated below, the sums diverge. Properly, taking a uniform distribution on 1, 2, 3, ..., M for large M, calculating the (finite) expected value of the posterior, and taking the limit as M → ∞ yields infinity. Note that if one has some prior knowledge or belief about N (such as an upper limit), one can apply other estimation; this discussion assumes no knowledge about N.
Interval estimation
- Confidence intervals
- Confidence intervals are easily calculated, and very wide: as n has a 95% chance of falling between the 2.5% and 97.5% quantiles, this shows that with 95% confidence, n/0.975 ≤ N ≤ n/0.025, or approximately, 1.025n ≤ N ≤ 40n, with equal probabilities of being above or below. If one instead takes the asymmetric interval that n has a 95% chance of falling between the 0% and 95% quantiles, this yields the confidence interval of approximately 1.05n ≤ N ≤ ∞, which is otherwise an overestimate. Conversely, taking the asymmetric interval that n has a 95% chance of falling between the 5% and 100% quantiles yields the confidence interval of n ≤ N ≤ 20n, which is otherwise an underestimate; the analogous (possibly underestimating) 99% confidence interval is n ≤ N ≤ 100n.
- Credible intervals
- As above in "point estimation", the likelihood function has infinite integral, so naive use of the likelihood function (with some improper priors) yields infinite credible intervals. Similarly, the Bayes factor between any finite interval and an infinite interval (unbounded above) is infinite, as the former has finite integral and the latter infinite integral. However, the Bayes factor between any two finite intervals can be computed, regardless of prior, as the integrals are both finite.
Analysis
All that one can know for sure is that N is at least n; however, it could be arbitrarily higher. The divergent answers reflect the following:
- the likeliest maximum is n itself, as the higher N is, the less probable any particular n is to arise: one is more likely to draw n if N = n (probability 1/n) than if N = n + 1 or more (probability 1/(n + 1)).
- if N is equally likely to be any positive number, potentially very large,[notes 9] then it is more likely that N is greater than any particular bound K, and n happens to have been drawn from the bottom of the range. This follows from the divergence of the harmonic series: the tail odds 1/K + 1/(K + 1) + ... corresponding to "N is large and n is small" diverge to infinity. This is reflected in the likelihood function having infinite integral, as discussed below.
- One may criticize the choice of uniform prior as placing undue weight on large numbers; formally, it is not scale invariant, and thus because there are far more numbers between 1,000,000 and 10,000,000 than between 100 and 1,000, large numbers dominate, making the expectation infinite: the (linear) uniform distribution is a poor prior for estimation of scale. A scale invariant choice of prior, like the logarithmic prior (uniform distribution on the logarithmic scale), places less weight on large numbers and thus gives more controlled likelihood functions: it yields a likelihood function with finite integral, but still with infinite expectation.
- the unbiased estimate of N = 2m − 1 (taking m to be the population median) has equal odds of being an underestimate or an overestimate, regardless of the distribution of N (it is a nonparametric estimator, being an order statistic). However, since N is bounded below (by n) but unbounded above, in expectation this may be biased low (if we take a uniform prior distribution on N).[notes 10]
Thus the different estimates reflect different facets of what can be inferred:
- n is most likely;
- 2m − 1 is unbiased and also has even odds of under/over estimation;
- but if all N are equally likely, expectation is infinite. However, none of these is categorically better.
The confidence interval computation shows that, based on a single observation, one cannot estimate with high confidence (95%) better than a range of 1:20; with 99% confidence one can give the range [n, 100n], so 10n to within an order of magnitude. This range (interval estimation) demonstrates the uncertainty giving but a single sample, which is not evident from considering only point estimation.
Improved estimates
Additional samples
If one has additional samples, one can refine the analysis:
- the maximum likelihood estimate for N is the sample maximum, for the same reason as above;
- the MVUE is ((k + 1)/k)m − 1), with the sample maximum m, as discussed above: the sample maximum is a sufficient statistic.
- The estimators 2μ − 1, where μ is the mean (the method of moments estimator), or 2m − 1 for the sample median m are unbiased but inefficient: other than having higher variance than the MVUE, they may even give a lower estimate than the sample maximum, which cannot be correct. Precise variances can be computed;[1]
- the computation of the expectation of N can be refined, as the likelihood function becomes more controlled, as discussed below.
Prior distribution
If one has a prior which is either bounded above or more generally decays sufficiently fast, the expected value of N (using the likelihood function) will be finite.
If one has a prior, one can replace maximum likelihood estimation with maximum a posteriori estimation; recall that these coincide if the prior is uniform, as assumed above. If one has no prior beliefs about the scale, one will likely choose a scale invariant measure such as the logarithmic prior.
Using a logarithmic prior, p(θ) 1/θ (which is an improper prior), modifies the above by multiplying by a factor of 1/x, making the tails decay faster (as x−(k+1) rather than x−k). Thus with a sample of one observation, the likelihood has finite integral but infinite expectation; with a sample of two observations, the likelihood has finite expectation; and with a sample of three observations, the likelihood has finite expectation and variance.
Likelihood function
Consider a jar containing N lottery tickets numbered from 1 through N. (Or N tanks numbered from 1 through N). Take a sample of k tickets. Let the sample maximum be m.
The likelihood functions, interpreted as discrete distributions, are the heavy-tailed distributions
where the Iverson bracket [N ≥ m] is 1 when N ≥ m and 0 otherwise.
Series involving these distributions are simplified using the nice formula (14) from here:
If you pick 3 or more tickets the likelihood function has a well defined mean value, which is larger than the maximum likelihood estimate (as the likelihood function is at least as big as the sample maximum). If you pick 4 or more tickets the likelihood function has a well defined standard deviation too, and likewise for larger samples having higher moments.
A single ticket
If you pick a ticket randomly you get number n with probability 1/N if n ≤ N, and zero if n > N. This is written
When considered a function of n for fixed N this is the probability distribution, but when considered a function of N for fixed n this is a likelihood function.
The maximum likelihood estimate for N is N0 = n.
This likelihood function is not proportional to any probability mass function, because the total
is a divergent series, being a tail of the harmonic series.
However, the sum over finite intervals is defined, and thus the relative likelihood of N being in one of two intervals, such as 100 to 1,000 or 1,000 to 10,000, is defined, because it is a ratio of two finite sums.
Two tickets
Suppose, however, that you pick two tickets rather than one.
The probability of the outcome {n1, n2},
When considered a function of N for fixed sample maximum m = n2, where n1 < n2, this is a likelihood function
The maximum likelihood estimate for N is N0 = m.
This time the total likelihood is a convergent series,
and so this likelihood function can be normalized into a probability distribution. This means that the median (together with other quantiles) is defined:
But the mean value of this distribution is undefined.
k tickets
In general, if n1 < n2 < ... < nk are the observations in the sample, and the sample maximum m = nk, then the likelihood function is given by:
The total likelihood is finite for k ≥ 2:
The weighted sum is finite for k ≥ 3 :
so the mean value is
The weighted sum of squares is finite for k ≥ 4:
so the variance is
and the standard deviation is
The estimate is N ≈ μ ± σ.
See also
- Capture-recapture, other method of estimating population size
- Maximum spacing estimation, which generalizes the intuition of "assume uniformly distributed"
Other discussions of the estimation
- Maximum likelihood#Bias
- Bias of an estimator#Maximum of a discrete uniform distribution
- Likelihood function#Example 5
Notes
- ^ In a continuous distribution, there is no −1.
- ^ The sample maximum is never more than the population maximum, but can be less, hence it is a biased estimator: it will tend to underestimate the population maximum.
- ^ For example, the gap between 2 and 7 is (7 − 2) − 1 = 4, consisting of 3, 4, 5, and 6.
- ^ Ruggles & Brodie is largely a practical analysis and summary, not a mathematical one – the estimation problem is only mentioned in footnote 3 on page 82, where they estimate the maximum as "sample maximum + average gap".
- ^ In fact, the lower bound was also unknown, but to simplify the discussion this detail is generally omitted, taking the lower bound as known to be 1.
- ^ As discussed in birthday attack, one can expect a collision after 1.25√H numbers, if choosing from H possible outputs. This square root corresponds to half the digits. For example, the square root of a number with 100 digits is approximately a number with 50 digits in any base.
- ^ (N − 1)/N ≥ (n − 1)/n likely to be an underestimate.
- ^ so
- ^ Properly, equally likely to be any positive number up to a very large upper bound.
- ^ This statement can be confusing: from a frequentist analysis, it is an unbiased estimator, but from a Bayesian analysis, assuming a prior, in expectation it may be low.
References
- ^ a b c d e f Johnson, Roger (1994), "Estimating the Size of a Population", Teaching Statistics, 16 (2 (Summer)): 50, doi:10.1111/j.1467-9639.1994.tb00688.x
{{citation}}
: External link in
(help)|journal=
- ^ Johnson, Roger (2006), "Estimating the Size of a Population" (PDF), Getting the Best from Teaching Statistics
- ^ Joyce Smart. German Tank Problem Logan High School cites Activity Based Statistics [by Richard L. Schaeffer] p. 148-150. Exploring Surveys and Information from Samples, [by James M. Landwehr] Section IX, p. 75–83. Statistical Reasoning, Gary Smith, p. 148-149
- ^ a b Ruggles; Brodie, Henry (1947), "An empirical approach to economic intelligence in WWII", Journal of the American Statistical Association, 42 (237), American Statistical Association: 72–91, doi:10.2307/2280189
{{citation}}
: More than one of|author1=
and|last=
specified (help); More than one of|number=
and|issue=
specified (help); Unknown parameter|month=
ignored (help) - ^ a b Gavyn Davies. How a statistical formula won the war The Guardian, 20 July 2006
- ^ Matthews, Robert (23 May 1998), "Data sleuths go to war, sidebar in feature 'Hidden truths'", New Scientist, archived from the original on 18 April 2001
- ^ Ruggles & Brodie, p. 89
- ^ Order Statistics, in Virtual Laboratories in Probability and Statistics
- ^ Ruggles & Brodie, pp. 82–83
- ^ Ruggles & Brodie, pp. 90–91
- ^ Volz, Arthur G. (2008), "A Soviet Estimate of German Tank Production", The Journal of Slavic Military Studies, 21 (3): 588–590, doi:10.1080/13518040802313902
{{citation}}
: Unknown parameter|month=
ignored (help) - ^ a b Jemima Kiss 10m iPhones sold this year? PDA TheDigitalContentBlog, The Guardian Media, 7 October 2008
- ^ Tommo_UK TMO Forum Index → Apple Finance Board, The Mac Observer Forums, 1 August 2008
- ^ Charles Arthur. Why iPhones are just like German tanks The Guardian online 8 October 2008
- General
- Goodman, L. A. (1954), "Some Practical Techniques in Serial Number Analysis", Journal of the American Statistical Association, 49 (265), American Statistical Association: 97–112, doi:10.2307/2281038