Jump to content

German tank problem: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Further reading: Added external link to Dr James Grime’s Numberphile video
 
(468 intermediate revisions by more than 100 users not shown)
Line 1: Line 1:
{{Short description|Mathematical problem}}
In the [[statistical theory]] of [[estimation theory|estimation]], estimating the '''maximum of a [[uniform distribution]]''' is a common illustration of differences between estimation methods. The specific case of sampling without replacement from a [[discrete uniform distribution]] is known, in the English-speaking world, as the '''German tank problem,''' due to its application in [[World War II]] to the estimation of the number of German tanks.
{{Use dmy dates|date=January 2020}}
[[File:Bundesarchiv Bild 101I-635-3966-27, Panzerfabrik in Deutschland.jpg|thumb|During [[World War II]], production of German tanks such as the [[Panther tank|Panther]] was accurately estimated by Allied intelligence using statistical methods.]]
In the [[statistical theory]] of [[estimation theory|estimation]], the '''German tank problem''' consists of estimating the maximum of a [[discrete uniform distribution]] from [[sampling without replacement]]. In simple terms, suppose there exists an unknown number of items which are sequentially numbered from 1 to ''N''. A random sample of these items is taken and their sequence numbers observed; the problem is to estimate ''N'' from these observed numbers.


Estimating the population maximum based on a single sample raises philosophical points about evaluation of [[estimators]] and [[likelihood]] (particularly [[bias of an estimator|bias]] in [[maximum likelihood]] estimators) and yields divergent results, depending on approach, while the estimation based on multiple samples is used in elementary statistics education as an instructive practical estimation question whose answer is simple but not obvious.
The problem can be approached using either [[frequentist inference]] or [[Bayesian inference]], leading to different results. Estimating the population maximum based on a ''single'' sample yields divergent results, whereas estimation based on ''multiple'' samples is a practical estimation question whose answer is simple (especially in the frequentist setting) but not obvious (especially in the Bayesian setting).


The problem is named after its historical application by Allied forces in [[World War II]] to the estimation of the monthly rate of German tank production from very limited data. This exploited the manufacturing practice of assigning and attaching ascending sequences of serial numbers to tank components (chassis, gearbox, engine, wheels), with some of the tanks eventually being captured in battle by Allied forces.
The problem is usually framed for a discrete distribution, but virtually identical analysis holds for a continuous distribution.


==Exposition==
== Suppositions ==
The adversary is presumed to have manufactured a series of tanks marked with consecutive whole numbers, beginning with serial number 1. Additionally, regardless of a tank's date of manufacture, history of service, or the serial number it bears, the distribution over serial numbers becoming revealed to analysis is uniform, up to the point in time when the analysis is conducted.
One may frame the question of estimation of the population maximum as:
:Suppose one is an Allied intelligence analyst during World War II, and one has some [[serial number]]s of captured German tanks. Further, assume that the tanks are numbered sequentially from 1 to ''N''. How does one estimate the total number of tanks?


== Example ==
For [[point estimation]] (estimating a single value for the total), the [[minimum-variance unbiased estimator]] (MVUE, or UMVU estimator) is given by:<ref group="notes">In a continuous distribution, there is no &minus;1.</ref>
[[File:German tank problem graphs.svg|thumb|Estimated population size (N). The number of observations in the sample is ''k''. The largest sample serial number is ''m''. Frequentist analysis is shown with dotted lines. Bayesian analysis has solid yellow lines with mean and shading to show range from minimum possible value to mean plus 1 standard deviation). The example shows if four tanks are observed and the highest serial number is "60", frequentist analysis predicts 74, whereas Bayesian analysis predicts a mean of 88.5 and standard deviation of 138.72&nbsp;−&nbsp;88.5 =&nbsp;50.22, and a minimum of 60 tanks.
In [http:/upwiki/wikipedia/commons/3/3e/German_tank_problem_graphs.svg the SVG file], hover over a graph to highlight it.]]


Assuming tanks are assigned sequential serial numbers starting with 1, suppose that four tanks are captured and that they have the serial numbers: 19, 40, 42 and 60.
:<math>\hat{N} = \frac{k+1}{k} m - 1 = m + \frac{m}{k} - 1</math>


A ''frequentist'' approach (using the [[minimum-variance unbiased estimator]]) predicts the total number of tanks produced will be:
where ''m'' is the largest serial number observed ([[sample maximum]]) and ''k'' is the number of tanks observed ([[sample size]]).<ref name="Johnson">{{citation
|last=Johnson
|first=Roger
|title=Estimating the Size of a Population
|year=1994
|journal=[http://www.rsscse.org.uk/ts/index.htm Teaching Statistics]
|volume=16
|issue=2 (Summer)
|doi=10.1111/j.1467-9639.1994.tb00688.x
|pages=50
}}</ref><ref name="Johnson2">{{citation
|last=Johnson
|first=Roger
|contribution=Estimating the Size of a Population
|title=Getting the Best from Teaching Statistics
|year=2006
|url=http://www.rsscse.org.uk/ts/gtb/contents.html
|contribution-url=http://www.rsscse.org.uk/ts/gtb/johnson.pdf
}}</ref><ref>[http://www.lhs.logan.k12.ut.us/~jsmart/Other_math_webpages/Math_11_Syllabus.html Joyce Smart]. [http://www.lhs.logan.k12.ut.us/~jsmart/tank.htm German Tank Problem] [[Logan High School (Utah)|Logan High School]] cites ''Activity Based Statistics'' [by Richard L. Schaeffer] p. 148-150. ''Exploring Surveys and Information from Samples'', [by James M. Landwehr] Section IX, p. 75&ndash;83. ''Statistical Reasoning'', Gary Smith, p. 148-149</ref> Note that once a serial number has been observed, it is no longer in the pool and will not be observed again.


:<math>N \approx 74</math>
This has a variance of<ref name="Johnson"/>


A ''Bayesian'' approach (using a uniform [[prior distribution|prior]] over the integers in <math>[4,\Omega]</math> for any suitably large <math>\Omega</math>) predicts that the [[median]] number of tanks produced will be very similar to the frequentist prediction:
:<math> \operatorname{var}(\hat{N}) = \frac{1}{k}\frac{(m-k)(m+1)}{(k+2)} \approx \frac{N^2}{k^2} \text{ for small samples } k \ll m</math>


:<math>N_{med} \approx 74.5</math>
so a [[standard deviation]] of approximately ''N''/''k'', the (population) average size of a gap between samples; compare ''m''/''k'' above.


whereas the Bayesian [[mean]] predicts that the number of tanks produced would be:
===Example===
Suppose there are 15 tanks, numbered 1,&nbsp;2,&nbsp;3,&nbsp;...,&nbsp;15. An intelligence officer has spotted tanks 2,&nbsp;6,&nbsp;7,&nbsp;14. Using the above formula, the values for ''m'' and ''k'' would be 14 and 4, respectively. The formula gives a value 16.5, which is close to the actual number of tanks, 15. Then suppose the officer spots an additional 2 tanks, neither of them #15. Now ''k'' = 6 and ''m'' remains 14. The formula gives a better estimate of 15.333...


:<math>N_{av} \approx 89 </math>
===Intuition===
The formula may be understood intuitively as:
:"The sample maximum plus the average gap between observations in the sample",
the gap being added to compensate for the negative bias of the sample maximum as an estimator for the population maximum,<ref group="notes">The sample maximum is never more than the population maximum, but can be less, hence it is a [[biased estimator]]: it will tend to ''underestimate'' the population maximum.</ref> and written as


Let {{mvar|N}} equal the total number of tanks predicted to have been produced, {{mvar|m}} equal the highest serial number observed and {{mvar|k}} equal the number of tanks captured.
:<math>\frac{k+1}{k} m - 1 = m + \frac{m}{k} - 1 = m + \frac\text{missing numbers}{k}.</math>


The frequentist prediction is calculated as:
This can be visualized by imagining that the samples are evenly spaced throughout the range, with additional samples just outside the range at 0 and ''N''&nbsp;+&nbsp;1. If one starts with an initial gap being between 0 and the lowest sample (sample minimum), then the average gap between samples is ''m''/''k''&nbsp;&minus;&nbsp;1; the &minus;1 being because one does not count the samples themselves in computing the gap between samples.<ref group="notes">For example, the gap between 2 and 7 is (7&nbsp;&minus;&nbsp;2)&nbsp;&minus;&nbsp;1&nbsp;=&nbsp;4, consisting of 3, 4, 5, and&nbsp;6.</ref>


:<math>N \approx m + \frac{m}{k} - 1=74 </math>
This philosophy is formalized and generalized in the method of [[maximum spacing estimation]].


The Bayesian median is calculated as:
===Derivation===
More formally, the probability that the sample maximum ''m'' is less than or equal to ''i'' (''i''&nbsp;≥&nbsp;''k'') is


:<math>N_{med} \approx m + \frac{m \ln(2)}{k-1} =74.5 </math>
:<math>P(m\le i)=\frac{\tbinom ik}{\tbinom Nk}=\frac{\left(\frac{i!}{(i-k)!k!}\right)}{\left(\frac{N!}{(N-k)!k!}\right)}={(N-k)!i!\over N!(i-k)!},</math> (where <math>\tbinom \cdot\cdot</math> is the [[binomial coefficient]])


The Bayesian mean is calculated as:
so the probability that ''m'' equals ''i'' is


:<math>P(m=i)=P(m\le i) - P(m \le i-1)={k(i-1)!(N-k)!\over (i-k)!N!}.</math>
:<math>N_{av} \approx (m - 1)\frac{k - 1}{k - 2} = 89 </math>


These Bayesian quantities are derived from the Bayesian posterior distribution:
The expected value of ''m'' is the sum of these multiplied by ''i'':


:<math>E(m)=k{(N-k)!\over N!}\sum_{i=k}^N{i!\over(i-k)!}</math>
:<math>\Pr(N=n) = \begin{cases}
0 &\text{if } n < m \\
\frac {k - 1}{k}\frac{\binom{m - 1}{k - 1}}{\binom n k} &\text{if } n \ge m,
\end{cases}</math>


This [[probability mass function]] has a positive [[skewness]], related to the fact that there are at least 60 tanks. Because of this skewness, the mean may not be the most meaningful estimate. The [[median]] in this example is 74.5, in close agreement with the frequentist formula. Using [[Stirling's approximation]], the posterior may be approximated by an exponentially decaying function of ''n'',
Using the fact that


:<math>\Pr(N=n) \approx \begin{cases}
:<math>\sum_{i=k}^{N}\frac{i!}{(i-k)!}=k!\sum_{i=k}^{N}\frac{i!}{(i-k)!k!}=k!\sum_{i=k}^{N}\tbinom ik = k! \tbinom {N+1}{k+1} = k! \frac{(N+1)!}{(N-k)!(k+1)!}= {(N+1)!\over(N-k)!(k+1)}</math>
0 &\text{if } n < m \\
(k-1)m^{k-1}n^{-k} &\text{if } n \ge m,
\end{cases}</math>


which results in the following approximation for the median:
we find that the expected value of ''m'' is:<ref name="Johnson" />


:<math>E(m) = \frac{k}{k+1}(N+1)</math>
:<math>N_{med} \approx m + \frac{m \ln(2)}{k-1} </math>


and the following approximations for the mean and standard deviation:
Solving for ''N'' gives


:<math>E(m) \frac{k+1}{k}-1=N.</math>
:<math>\begin{align}
N &\approx \mu \pm \sigma = 89 \pm 50, \\[5pt]
\mu &= (m - 1)\frac{k - 1}{k - 2}, \\[5pt]
\sigma &= \sqrt{\frac{(k-1)(m-1)(m-k+1)}{(k-3)(k-2)^2}}.
\end{align}</math>


== Historical example of the problem ==
Since <math>E(m) \frac{k+1}{k}-1=E(m \frac{k+1}{k}-1)</math>, we find that ((''k''&nbsp;+&nbsp;1)/''k'')''m''&nbsp;&minus;&nbsp;1 is an [[unbiased estimator]] of&nbsp;''N''.
[[File:Bundesarchiv Bild 183-H26258, Panzer V "Panther".jpg|thumb|Panther tanks are loaded for transport to frontline units, 1943.]]
During the course of the Second World War, the [[Allies of World War II|Western Allies]] made sustained efforts to determine the extent of German production and approached this in two major ways: conventional intelligence gathering and statistical estimation. In many cases, statistical analysis substantially improved on conventional intelligence. In some cases, conventional intelligence was used in conjunction with statistical methods, as was the case in estimation of [[German tanks in World War II|Panther tank]] production just prior to [[D-Day]].


The allied command structure had thought the [[Panzer V]] (Panther) tanks seen in Italy, with their high velocity, long-barreled 75&nbsp;mm/L70 guns, were unusual heavy tanks and would only be seen in northern France in small numbers, much the same way as the [[Tiger I]] was seen in Tunisia. The US Army was confident that the [[Sherman tank]] would continue to perform well, as it had versus the [[Panzer III]] and [[Panzer IV]] tanks in North Africa and Sicily.{{refn|An Armored Ground Forces policy statement of November&nbsp;1943 concluded: "The recommendation of a limited proportion of tanks carrying a 90&nbsp;mm gun is not concurred in for the following reasons: The M4 tank has been hailed widely as the best tank of the battlefield today. ... There appears to be no fear on the part of our forces of the German Mark&nbsp;VI (Tiger) tank. There can be no basis for the T26 tank other than the conception of a tank-vs.-tank duel – which is believed to be unsound and unnecessary."<ref>AGF policy statement. Chief of staff AGF. November 1943. MHI</ref>|group=lower-alpha}} Shortly before D-Day, rumors indicated that large numbers of Panzer V tanks were being used.
To show that this is the UMVU estimator:
* One first shows that the sample maximum is a [[sufficient statistic]] for the population maximum, using a method similar to that detailed at [[Sufficiency_(statistics)#Uniform_distribution|sufficiency: uniform distribution]] (but for the German tank problem we must exclude the outcomes in which a serial number occurs twice in the sample);
* Next, one shows that it is a [[complete statistic]].
* Then the [[Lehmann–Scheffé theorem]] states that the sample maximum, corrected for bias as above to be unbiased, is the UMVU estimator.


To determine whether this was true, the Allies attempted to estimate the number of tanks being produced. To do this, they used the serial numbers on captured or destroyed tanks. The principal numbers used were gearbox numbers, as these fell in two unbroken sequences. Chassis and engine numbers were also used, though their use was more complicated. Various other components were used to cross-check the analysis. Similar analyses were done on wheels, which were observed to be sequentially numbered (i.e., 1,&nbsp;2,&nbsp;3,&nbsp;...,&nbsp;''N'').{{sfn|Ruggles|Brodie|1947|pp=73-74}}{{efn|The lower bound was unknown, but to simplify the discussion, this detail is generally omitted, taking the lower bound as known to be 1.}}<ref name=Davies-2006-07-20>{{cite web
===Variant: Unknown lower bound===
|title=Gavyn Davies does the maths – How a statistical formula won the war
If the lower bound is also unknown, one can perform a similar analysis (both for upper and lower bound), where the sufficient statistics are now the sample max and sample ''min''.<ref name="Johnson"/> The estimators can again be interpreted as "sample max (resp. sample min) corrected by an estimate of the gap".
|date=20 July 2006
|url=https://www.theguardian.com/world/2006/jul/20/secondworldwar.tvandradio
|website=[[The Guardian]]
|access-date=6 July 2014
}}</ref><ref>{{citation
|title=Data sleuths go to war, sidebar in feature "Hidden truths"
|date=23 May 1998
|journal=[[New Scientist]]
|last=Matthews
|first=Robert
|author-link=Robert Matthews (scientist)
|url=https://www.newscientist.com/article/mg15821355.000-hidden-truths.html
|archive-url=https://web.archive.org/web/20010418025817/http://www.newscientist.com/ns/980523/features.html#data
|archive-date=18 April 2001
}}</ref>

The analysis of tank wheels yielded an estimate for the number of wheel molds that were in use. A discussion with British road wheel makers then estimated the number of wheels that could be produced from this many molds, which yielded the number of tanks that were being produced each month. Analysis of wheels from two tanks (32 road wheels each, 64 road wheels total) yielded an estimate of 270 tanks produced in February 1944, substantially more than had previously been suspected.<ref name="Carruthers">{{cite book
|author=Bob Carruthers
|title=Panther V in Combat
|url=https://books.google.com/books?id=99JRxKz4Da4C&pg=PT94
|publisher=Coda Books
|isbn=978-1-908538-15-4
|pages=94–
|date=2012-03-01
}}</ref>

German records after the war showed production for the month of February&nbsp;1944 was 276.{{sfn|Ruggles|Brodie|1947|pp=82–83}}{{refn|Ruggles & Brodie is largely a practical analysis and summary, not a mathematical one – the estimation problem is only mentioned in footnote&nbsp;3 on page&nbsp;82, where they estimate the maximum as "sample maximum + average gap".|group=lower-alpha}} The statistical approach proved to be far more accurate than conventional intelligence methods, and the phrase "German tank problem" became accepted as a descriptor for this type of statistical analysis.

Estimating production was not the only use of this serial-number analysis. It was also used to understand German production more generally, including number of factories, relative importance of factories, length of supply chain (based on lag between production and use), changes in production, and use of resources such as rubber.

=== Specific data ===
According to conventional Allied intelligence estimates, the Germans were producing around 1,400&nbsp;tanks a month between June&nbsp;1940 and September&nbsp;1942. Applying the formula below to the serial numbers of captured tanks, the number was calculated to be 246 a month. After the war, captured German production figures from the ministry of [[Albert Speer]] showed the actual number to be 245.<ref name="Davies-2006-07-20" />

Estimates for some specific months are given as:{{sfn|Ruggles|Brodie|1947|p=89}}
{| class="wikitable"
|- align=center
! Month !! Statistical estimate !! Intelligence estimate !! German records
|- align=center
| June 1940 || 169 || 1,000 || 122
|- align=center
| June 1941 || 244 || 1,550 || 271
|- align=center
| August 1942 || 327 || 1,550 || 342
|}

=== Similar analyses ===
[[File:Fusée V2.jpg|thumb|upright|[[V-2]] rocket production was accurately estimated by statistical methods.]]
Similar serial-number analysis was used for other military equipment during World War&nbsp;II, most successfully for the [[V-2]] rocket.{{sfn|Ruggles|Brodie|1947|pp=90–91}}

Factory markings on Soviet military equipment were analyzed during the [[Korean War]], and by German intelligence during World War&nbsp;II.{{sfn|Volz|2008}}

In the 1980s, some Americans were given access to the production line of Israel's [[Merkava]] tanks. The production numbers were classified, but the tanks had serial numbers, allowing estimation of production.{{sfn|Johnson|1994}}

The formula has been used in non-military contexts, for example to estimate the number of [[Commodore 64]] computers built, where the result (12.5&nbsp;million) matches the low-end estimates.<ref name="pagetable.com">{{cite web
|title=How many Commodore&nbsp;64 computers were really sold?
|date=1 February 2011
|url=http://www.pagetable.com/?p=547
|website=pagetable.com
|access-date=6 July 2014
|archive-url=https://web.archive.org/web/20160306232450/http://www.pagetable.com/?p=547
|archive-date=6 March 2016
}}</ref>

== Countermeasures ==
{{unreferenced section|date=January 2013}}
To confound serial-number analysis, serial numbers can be excluded, or usable auxiliary information reduced. Alternatively, serial numbers that resist cryptanalysis can be used, most effectively by randomly choosing numbers without replacement from a list that is much larger than the number of objects produced, or by producing random numbers and checking them against the list of already assigned numbers; collisions are likely to occur unless the number of digits possible is more than twice the number of digits in the number of objects produced (where the serial number can be in any base); see [[birthday problem]].{{efn|As discussed in [[birthday attack]], one can expect a collision after 1.25{{radic|''H''}} numbers, if choosing from ''H'' possible outputs. This square root corresponds to half the digits. For example, in any base, the square root of a number with 100 digits is approximately a number with 50 digits.}} For this, a [[cryptographically secure pseudorandom number generator]] may be used. All these methods require a lookup table (or breaking the cypher) to back out from serial number to production order, which complicates use of serial numbers: a range of serial numbers cannot be recalled, for instance, but each must be looked up individually, or a list generated.

Alternatively, sequential serial numbers can be encrypted with a simple [[substitution cipher]], which allows easy decoding, but is also easily broken by [[frequency analysis]]: even if starting from an arbitrary point, the plaintext has [[Benford's law|a pattern]] (namely, numbers are in sequence). One example is given in [[Ken Follett]]'s novel ''[[Code to Zero]]'', where the encryption of the [[Jupiter-C]] rocket serial numbers is given by:
{| class="wikitable" style="margin:1em auto;"
|- align=center
! H !! U !! N !! T !! S !! V !! I !! L !! E !! X
|- align=center
| 1 || 2 || 3 || 4 || 5 || 6 || 7 || 8 || 9 || 0
|}

The code word here is [[Huntsville, Alabama|Huntsville]] (with repeated letters omitted) to get a 10-letter key.<ref>{{Cite web|url=https://www.spaceline.org/rocketsum/jupiter-c.html|title=Rockets and Missiles|website=www.spaceline.org}}</ref> The rocket number&nbsp;13 was therefore "HN", and the rocket number&nbsp;24 was "UT".

== Frequentist analysis ==
=== Minimum-variance unbiased estimator ===
For [[point estimation]] (estimating a single value for the total, <math>\widehat{N}</math>), the [[minimum-variance unbiased estimator]] (MVUE, or UMVU estimator) is given by:{{efn|In a continuous distribution, there is no −1 term.}}

:<math>\widehat{N} = m(1 + k^{-1}) - 1,</math>

where ''m'' is the largest serial number observed ([[sample maximum]]) and ''k'' is the number of tanks observed ([[sample size]]).{{sfn|Johnson|1994}}<ref>{{cite web
|last1=Joyce
|first1=Smart
|title=German Tank Problem
|url=http://www.lhs.logan.k12.ut.us/~jsmart/tank.htm
|publisher=[[Logan High School (Utah)|Logan High School]]
|access-date=8 July 2014
|archive-url=https://web.archive.org/web/20120424231135/http://www.lhs.logan.k12.ut.us/~jsmart/tank.htm
|archive-date=2012-04-24 <!-- 231135 -->
}}</ref> Note that once a serial number has been observed, it is no longer in the pool and will not be observed again.

This has a variance{{sfn|Johnson|1994}}

:<math> \operatorname{var}\left(\widehat{N}\right) = \frac{1}{k}\frac{(N-k)(N+1)}{(k+2)} \approx \frac{N^2}{k^2} \text{ for small samples } k \ll N,</math>

so the [[standard deviation]] is approximately ''N''/''k'', the expected size of the gap between sorted observations in the sample.

The formula may be understood intuitively as the sample maximum plus the average gap between observations in the sample, the sample maximum being chosen as the initial estimator, due to being the [[maximum likelihood estimator]],{{efn|Given a particular set of observations, this set is most likely to occur if the population maximum is the sample maximum, not a higher value (it cannot be lower).}} with the gap being added to compensate for the negative bias of the sample maximum as an estimator for the population maximum,{{efn|The sample maximum is never more than the population maximum, but can be less, hence it is a [[biased estimator]]: it will tend to ''underestimate'' the population maximum.}} and written as

:<math>\widehat{N} = m + \frac{m - k}{k}= m + mk^{-1} - 1 = m(1 + k^{-1}) - 1.</math>

This can be visualized by imagining that the observations in the sample are evenly spaced throughout the range, with additional observations just outside the range at 0 and ''N''&nbsp;+&nbsp;1. If starting with an initial gap between 0 and the lowest observation in the sample (the sample minimum), the average gap between consecutive observations in the sample is <math>(m - k)/k</math>; the <math>-k</math> being because the observations themselves are not counted in computing the gap between observations.{{efn|1=For example, the gap between 2 and 7 is (7&nbsp;−&nbsp;2)&nbsp;−&nbsp;1&nbsp;=&nbsp;4, consisting of 3, 4, 5, and&nbsp;6.}}. A derivation of the expected value and the variance of the sample maximum are shown in the page of the [[discrete uniform distribution]].


This philosophy is formalized and generalized in the method of [[maximum spacing estimation]]; a similar heuristic is used for
===Order of magnitude===
[[plotting position]] in a [[Q–Q plot]], plotting sample points at {{math|''k'' / (''n'' + 1)}}, which is evenly on the uniform distribution, with a gap at the end.
Note that for [[order of magnitude]] estimates, the sample maximum or the refined estimator above are sufficiently accurate: even a single sample is likely to give you a correct estimate of order of magnitude, as you are 90% unlikely to fall in the bottom decile in one sample, and more than 99% unlikely to fall in the bottom decile with two samples.


===Confidence intervals===
=== Confidence intervals ===
Instead of, or in addition to, ''point'' estimation, one can do [[interval estimation|''interval'' estimation]], such as [[confidence interval]]s.
Instead of, or in addition to, ''point'' estimation, [[interval estimation|''interval'' estimation]] can be carried out, such as [[confidence interval]]s.
These are easily computed, based on the observation that the odds that ''k'' samples will fall in an interval covering ''p'' of the range (0&nbsp;≤&nbsp;''p''&nbsp;≤&nbsp;1) is ''p''<sup>''k''</sup> (assuming in this section that draws are ''with'' replacement, to simplify computations; if draws are without replacement, this overstates the likelihood, and intervals will be overly conservative).
These are easily computed, based on the observation that the probability that ''k'' observations in the sample will fall in an interval covering ''p'' of the range (0&nbsp;≤&nbsp;''p''&nbsp;≤&nbsp;1) is ''p''<sup>''k''</sup> (assuming in this section that draws are ''with'' replacement, to simplify computations; if draws are without replacement, this overstates the likelihood, and intervals will be overly conservative).


Thus the [[sampling distribution]] of the quantile of the sample maximum is the graph ''x''<sup>1/''k''</sup> from 0 to 1: the ''p''th to ''q''th quantile of the sample maximum ''m'' are the interval [''p''<sup>1/''k''</sup>''N'',&nbsp;''q''<sup>1/''k''</sup>''N'']. Inverting this yields the corresponding confidence interval for the population maximum of [''m''/''q''<sup>1/''k''</sup>,&nbsp;''m''/''p''<sup>1/''k''</sup>].
Thus the [[sampling distribution]] of the quantile of the sample maximum is the graph ''x''<sup>1/''k''</sup> from 0 to 1: the ''p''-th to ''q''-th quantile of the sample maximum ''m'' are the interval [''p''<sup>1/''k''</sup>''N'',&nbsp;''q''<sup>1/''k''</sup>''N'']. Inverting this yields the corresponding confidence interval for the population maximum of [''m''/''q''<sup>1/''k''</sup>,&nbsp;''m''/''p''<sup>1/''k''</sup>].


For example, taking the symmetric 95% interval ''p'' = 2.5% and ''q'' = 97.5% for ''k'' = 5 yields <math>0.025^{1/5} \approx 0.48, 0.975^{1/5} \approx 0.995</math>, so a confidence interval of approximately <math>[1.005m, 2.08m]</math>. The lower bound is very close to ''m,'' so more informative is the asymmetric confidence interval from ''p'' = 5% to 100%; for ''k'' = 5 this yields <math>0.05^{1/5} \approx 0.55,</math> so the interval [''m'',&nbsp;1.82''m''].
For example, taking the symmetric 95% interval ''p'' = 2.5% and ''q'' = 97.5% for ''k'' = 5 yields 0.025<sup>1/5</sup> 0.48, 0.975<sup>1/5</sup> 0.995, so the confidence interval is approximately [1.005''m'', 2.08''m'']. The lower bound is very close to ''m'', thus more informative is the asymmetric confidence interval from ''p'' = 5% to 100%; for ''k'' = 5 this yields 0.05<sup>1/5</sup> 0.55 and the interval [''m'',&nbsp;1.82''m''].


More generally, the (downward biased) 95% confidence interval is <math>[m,m/.05^{1/k}]=[m,m\cdot 20^{1/k}].</math> For a range of ''k,'' with the UMVU point estimator (plus 1 for legibility) for reference, this yields:
More generally, the (downward biased) 95% confidence interval is [''m'', ''m''/0.05<sup>1/''k''</sup>] = [''m'', ''m''·20<sup>1/k</sup>]. For a range of ''k'' values, with the UMVU point estimator (plus 1 for legibility) for reference, this yields:
{| class="wikitable"
{| class="wikitable"
! ''k'' !! point estimate !! confidence interval
! ''k'' !! Point estimate !! Confidence interval
|-
|-
| 1 || <math>2m</math> || <math>[m,20m]</math>
| 1 || 2''m'' || [''m'', 20''m'']
|-
|-
| 2 || <math>1.5m</math> || <math>[m,4.5m]</math>
| 2 || 1.5''m'' || [''m'', 4.5''m'']
|-
|-
| 5 || <math>1.2m</math> || <math>[m,1.82m]</math>
| 5 || 1.2''m'' || [''m'', 1.82''m'']
|-
|-
| 10 || <math>1.1m</math> || <math>[m,1.35m]</math>
| 10 || 1.1''m'' || [''m'', 1.35''m'']
|-
|-
| 20 || <math>1.05m</math> || <math>[m,1.16m]</math>
| 20 || 1.05''m'' || [''m'', 1.16''m'']
|}
|}

Immediate observations are:
Immediate observations are:
* For small sample sizes, the confidence interval is very wide, reflecting great uncertainty in the estimate.
* For small sample sizes, the confidence interval is very wide, reflecting great uncertainty in the estimate.
* The range shrinks rapidly, reflecting the exponentially decaying likelihood that ''all'' samples will be significantly below the maximum.
* The range shrinks rapidly, reflecting the exponentially decaying probability that ''all'' observations in the sample will be significantly below the maximum.
* The confidence interval exhibits positive skew, as ''N'' can never be below the sample maximum, but can potentially be arbitrarily high above it.
* The confidence interval exhibits positive skew, as ''N'' can never be below the sample maximum, but can potentially be arbitrarily high above it.


Note that one ''cannot'' naively use ''m''/''k'' (or rather (''m''&nbsp;+&nbsp;''m''/''k''&nbsp;&minus;&nbsp;1)/''k'') as an estimate of the [[standard error (statistics)|standard error]] ''SE'', as the standard error of an estimator is based on the ''population'' maximum (a parameter), and using an estimate to estimate the error in that very estimate is [[circular reasoning]]. Further, because the sampling distribution of the sample maximum is '''not''' normal (and hence the distribution of the estimator is not normal), one '''cannot''' use the [[68-95-99.7 rule]] to conclude that an interval of <math>\hat N \pm 2\widehat{SE}</math> is a 95% confidence interval, as demonstrated in the table above.
Note that ''m''/''k'' cannot be used naively (or rather (''m''&nbsp;+&nbsp;''m''/''k''&nbsp;&nbsp;1)/''k'') as an estimate of the [[standard error (statistics)|standard error]] ''SE'', as the standard error of an estimator is based on the ''population'' maximum (a parameter), and using an estimate to estimate the error in that very estimate is [[circular reasoning]].


== Bayesian analysis ==
==Historical problem==
The Bayesian approach to the German tank problem<ref>{{cite journal |last1=Simon |first1=Cory |title=A Bayesian Treatment of the German Tank Problem |journal=The Mathematical Intelligencer |date=2023 |volume=46 |issue=2 |pages=117–127 |doi=10.1007/s00283-023-10274-6 |doi-access=free |pmid=38841650 |pmc=11147940 |arxiv=2301.00046 }}</ref> is to consider the posterior probability <math>(N=n\mid M=m, K=k)</math> that the number of enemy tanks <math>N</math> is <math>n</math>, when the number of observed tanks <math>K</math> is <math>k</math>, and the maximum observed serial number <math>M</math> is <math>m</math>.
In wartime, a key goal of [[military intelligence]] is to determine the strength of an enemy force: in World War II, the [[Western Allies]] wanted to estimate the number of tanks the Germans had, and approached this in two major ways: conventional intelligence gathering, and statistical estimation. The statistical approach proved to be far more accurate than the conventional intelligence; the primary reference for the statistical approach is Ruggles & Brodie (1947).<ref name="rb">{{citation
|last=Ruggles
|last2=Brodie
|year=1947
|month=March
|title=An empirical approach to economic intelligence in WWII
|journal=Journal of the American Statistical Association
|volume=42
|number=237
|pages=72–91
|url=http://www.jstor.org/pss/2280189
|doi=10.2307/2280189
|author1=Ruggles, Richard
|issue=237
|publisher=American Statistical Association
|first2=Henry
}}</ref><ref group="notes">Ruggles & Brodie is largely a practical analysis and summary, not a mathematical one – the estimation problem is only mentioned in footnote 3 on page 82, where they estimate the maximum as "sample maximum + average gap".</ref> In some cases statistical analysis contradicted and substantially improved on conventional intelligence; in others, conventional intelligence and the statistical approach worked together, as in estimation of [[Panther tank]] production, discussed below. Estimating production was not the only use of this serial number analysis; it was used to understand German production more generally, including number of factories, relative importance of factories, length of supply chain (based on lag between production and use), changes in production, and use of resources such as rubber.


The answer to this problem depends on the choice of prior for <math>N</math>. One can proceed using a proper prior over the positive integers, e.g., the Poisson or Negative Binomial distribution, where a closed formula for the posterior mean and posterior variance can be obtained.<ref>{{Cite journal|last1=Höhle|first1=M.|last2=Held|first2=L.|date=2006|title=Bayesian Estimation of the Size of a Population|url=https://epub.ub.uni-muenchen.de/2094/1/paper_499.pdf|journal=Technical Report SFB 386, No. 399, Department of Statistics, University of Munich|access-date=2016-04-17}}</ref> Below, we will instead adopt a bounded uniform prior.
To estimate the number of tanks produced up to a certain point, the Allies used the [[serial numbers]] on tanks (most significantly [[gearbox]] numbers, as these fell in two unbroken sequences; also chassis and engine, though these were more complicated – various other components helped cross-check the analysis; similar analyses were done on tires),<ref name="rb"/> which were observed to be sequentially numbered (i.e. 1,&nbsp;2,&nbsp;3,&nbsp;...,&nbsp;''N'').<ref group="notes">In fact, the lower bound was also unknown, but to simplify the discussion this detail is generally omitted, taking the lower bound as known to be 1.</ref><ref name=Davies-2006-07-20>[[Gavyn Davies]]. [http://www.guardian.co.uk/world/2006/jul/20/secondworldwar.tvandradio How a statistical formula won the war] [[The Guardian]], 20 July 2006</ref><ref>{{citation
|title=Data sleuths go to war, sidebar in feature 'Hidden truths'
|date=1998-05-23
|journal=[[New Scientist]]
|last=Matthews
|first=Robert
|author-link=Robert Matthews (scientist)
|url=http://www.newscientist.com/article/mg15821355.000-hidden-truths.html
|archiveurl=http://web.archive.org/web/20010418025817/http://www.newscientist.com/ns/980523/features.html#data
|archivedate=2001-04-18
}}</ref>


For brevity, in what follows, <math>(N=n\mid M=m,K=k)</math> is written <math>(n\mid m,k)</math>.
===Specific data===
The Allied conventional intelligence estimates believed the number of tanks the Germans were producing between June 1940 and September 1942 was around 1,400 a month. Using the above formula on the serial numbers of captured German tanks, (both serviceable and destroyed) the number was calculated to be 256 a month. After the war captured German production figures from the ministry of [[Albert Speer]] show the actual number to be 255.<ref name=Davies-2006-07-20/>


=== Conditional probability ===
Estimates for some specific months is given as:<ref>Ruggles & Brodie, p. 89</ref><ref>[http://www.math.uah.edu/stat/urn/OrderStatistics.xhtml Order Statistics], in [http://www.math.uah.edu/stat/index.xhtml Virtual Laboratories in Probability and Statistics]</ref>
The rule for [[conditional probability]] gives
{| class="wikitable"
:<math>(n\mid m,k)(m\mid k) = (m\mid n,k)(n\mid k)= (m,n\mid k)</math>
|- align=center
| Month || Statistical estimate || Intelligence estimate || German records
|- align=center
| June 1940 || 169 || 1000 || 122
|- align=center
| June 1941 || 244 || 1550 || 271
|- align=center
| August 1942 || 327 || 1550 || 342
|}


=== Probability of ''M'' knowing ''N'' and ''K'' ===
Shortly before [[D-Day]], following rumors of large [[Panther tank]] production collected by conventional intelligence, analysis of road wheels from two tanks (consisting of 48 wheels each, for 96 wheels total) yielded an estimate of 270 Panthers produced in February 1944, substantially more than had previously been suspected; German records after the war showed production for that month was 276.<ref>Ruggles & Brodie, pp. 82–83</ref> Specifically, analysis of the wheels yielded an estimate for the number of wheel molds; discussion with British road wheel makers then estimated the number of wheels that could be produced from this many molds.
The expression
:<math> (m \mid n,k)=(M=m \mid N=n,K=k)</math>
is the conditional probability that the maximum serial number observed, <math>M</math>, is equal to <math>m</math>, when the number of enemy tanks, <math>N</math>, is known to be equal to <math>n</math>, and the number of enemy tanks observed, <math>K</math>, is known to be equal to <math>k</math>.


It is
===Similar analyses===
:<math>
[[File:Fusée V2.jpg|thumb|[[V-2]] rocket production was accurately estimated by statistical methods.]]
(m\mid n,k) =
Similar serial number analysis was used for other military equipment during World War II, most successfully for the [[V-2]] rocket.<ref>Ruggles & Brodie, pp. 90–91</ref>
\binom{m - 1}{k - 1}\binom{n}{k}^{-1}[k\le m][m\le n]
</math>
where <math>\binom n k</math> is a [[binomial coefficient]] and <math>[k\le n]</math> is an [[Iverson bracket]].


The expression can be derived as follows:
During WWII, German intelligence analyzed factory markings on Soviet military equipment, and during the [[Korean war]], factory markings on Soviet equipment was again analyzed. Soviets also estimated German tank production.<ref>{{citation
<math>(m\mid n,k)</math> answers the question: "What is the probability of a specific serial number <math>m</math> being the highest number observed in a sample of <math>k</math> tanks, given there are <math>n</math> tanks in total?"
|title=A Soviet Estimate of German Tank Production
|last=Volz
|first=Arthur G.
|url=http://www.informaworld.com/smpp/content~content=a901871255~db=all
|journal=The Journal of Slavic Military Studies
|volume=21
|issue=3
|month=July
|year=2008
|pages=588–590
|doi=10.1080/13518040802313902
}}</ref>


One can think of the sample of size <math>k</math> to be the result of <math>k</math> individual draws without replacement. Assume <math>m</math> is observed on draw number <math>d</math>. The probability of this occurring is:
In the 1980s, a similar situation occurred where Americans were given access to the production line of Israel’s [[Merkava]] tanks. The production numbers were classified, but the tanks had serial numbers, allowing estimation of production.<ref name="Johnson" />
<math display="block">
\underbrace{ \frac{m-1}{n}\cdot \frac{m-2}{n-1}\cdot \frac{m-3}{n-2}\cdots \frac{m-d+1}{n-d+2}}_{d-1 \text{ times}} \cdot \underbrace{\frac{1}{n-d+1}}_{\text{draw no. }d} \cdot \underbrace{\frac{m-d}{n-d}\cdot \frac{m-d-1}{n-d-1}\cdots \frac{m-d-(k-d-1)}{n-d-(k-d-1)}}_{k-d\text{ times}} = \frac{(n-k)!}{n!} \cdot \frac{(m-1)!}{(m-k)!}.
</math>


As can be seen from the right-hand side, this expression is independent of <math>d</math> and therefore the same for each <math>d\leq k</math>. As <math>m</math> can be drawn on <math>k</math> different draws, the probability of any specific <math>m</math> being the largest one observed is <math>k</math> times the above probability:
===Application to iPhone production estimation===
<math display="block">
[[File:IPhone Image Viewer.jpg|thumb|Production of products such as [[iPhone]]s can also be estimated by serial number analysis.]]
(m\mid n,k) =
In 2008 "A London investor called Tommo_UK started asking for people to post the serial numbers of the phone and the date they bought it so that he could decipher how many phones Apple is distributing."<ref name=Kiss-2008-10-07>Jemima Kiss [http://www.guardian.co.uk/media/pda/2008/oct/07/apple.mobilephones 10m iPhones sold this year?] [http://www.guardian.co.uk/media/pda PDA TheDigitalContentBlog], The Guardian Media, 7 October 2008</ref><ref>Tommo_UK [http://www.macobserver.com/forums/viewtopic.php?t=69155&postdays=0&postorder=asc&start=0 TMO Forum Index → Apple Finance Board], The Mac Observer Forums, 1 August 2008</ref> from this information and using the above formula he was able to calculate that [[Apple Inc]] had sold 9.1 million [[iPhones]] to the end of September 2008,<ref>Charles Arthur. [http://www.guardian.co.uk/technology/blog/2008/oct/08/iphone.apple Why iPhones are just like German tanks] [[The Guardian]] online 8 October 2008</ref> which meant that they would probably sell more than 10 million in the year.<ref name=Kiss-2008-10-07/>
k\cdot \frac{(n-k)!}{n!}\cdot \frac{(m-1)!}{(m-k)!} = \binom{m - 1}{k - 1}\binom{n}{k}^{-1}.
</math>


=== Probability of ''M'' knowing only ''K'' ===
===Countermeasures===
The expression <math>(m\mid k)=(M=m\mid K=k)</math> is the probability that the maximum serial number is equal to <math>m</math> once <math>k</math> tanks have been observed but before the serial numbers have actually been observed.
To prevent serial number analysis, one can most simply not include serial numbers, or reduce auxiliary information that can be usable. Alternatively, one can design serial numbers that resist cryptanalysis, most effectively by randomly choosing numbers without replacement from a list that is much larger than the number of objects you produce (compare [[one-time pad]]), or simply produce random numbers and check them against the list of already assigned numbers; collisions are likely to occur unless the number of digits possible is more than twice as many as the number of digits in the number of objects produced (where the serial number can be in decimal, hexadecimal or indeed in any base); see [[birthday problem]].<ref group="notes">As discussed in [[birthday attack]], one can expect a collision after 1.25&radic;''H'' numbers, if choosing from ''H'' possible outputs. This square root corresponds to half the digits. For example, the square root of a number with 100 digits is approximately a number with 50 digits in any base.</ref> For this, one may use a [[cryptographically secure pseudorandom number generator]]. Less securely, to avoid lookup problems, one may use any [[pseudorandom number generator]] with large period, which is guaranteed to avoid collisions. All these methods require a lookup table (or breaking the cypher) to back out from serial number to production order, which complicates use of serial numbers: one cannot simply recall a range of serial numbers, for instance, but instead must look them each up individually, or generate a list.


The expression <math>(m\mid k)</math> can be re-written in terms of the other quantities by marginalizing over all possible <math>n</math>.
Alternatively, one can simply use sequential serial numbers and encrypt them, which allows easy decoding, but then there is a [[known-plaintext attack]]: even if one starts from an arbitrary point, the plaintext has a pattern (namely, numbers are in sequence).
:<math>\begin{align}
(m\mid k)
&=\sum_{n=0}^\infty(m,n\mid k) \\
&=\sum_{n=0}^\infty(m\mid n,k)(n\mid k)
\end{align}</math>


=== Prior probability of ''N'' knowing only ''K'' ===
==Theoretical discussion==
We assume that <math>k</math> is fixed in advance so that we do not have to consider any distribution over <math>k</math>. Thus, our prior can depend on <math>k</math>.
Formally, the question is the estimation of the maximum in a finite population consisting of one element at each integer up to the unknown maximum. Observing the special case of a sample of size 1, this approach demonstrates the features and limitations of [[maximum likelihood]] estimation and [[likelihood function]]s.


The expression
Consider a jar containing ''N'' lottery tickets numbered from 1 through ''N,'' where ''N'' is unknown. Given a single ticket ''n,'' drawn from the jar, how does one estimate the maximum ''N''?
:<math>(n\mid k)=(N=n\mid K=k)</math>
is the credibility that the total number of tanks, <math>N</math>, is equal to <math>n</math> when the number <math>K</math> tanks observed is known to be <math>k</math>, but before the serial numbers have been observed. Assume that it is some [[discrete uniform distribution]]
:<math> (n\mid k) = (\Omega - k)^{-1}[k \le n][n < \Omega]</math>


The upper limit <math>\Omega</math> must be finite, because the function
Three approaches to [[point estimation]] give radically different answers: ''n'', 2''n''&nbsp;&minus;&nbsp;1, and ∞, accordingly as one uses the maximum likelihood estimator (MLE), minimum-variance unbiased estimator (MVUE), and naive use of likelihood functions. It illustrates dangers and difficulties of naive use of [[maximum likelihood estimator]]s, which may have significant bias, and of [[likelihood function]]s, which cannot simply be conflated with [[probability distribution function]]s.
:<math>f(n) =\lim_{\Omega\rarr\infty}(\Omega - k)^{-1}[k \le n][n < \Omega]=0</math>
is not a mass distribution function. Our result below will not depend on <math>\Omega</math>.


=== Posterior probability of ''N'' knowing ''M'' and ''K'' ===
The divergence between these answers is reflected in the corresponding [[interval estimation]] for the problem, which yields very large (potentially infinite) [[confidence interval]]s and [[credible interval]]s.


Provided that <math>\Omega > m</math>, so that the prior is consistent with the observed data:
===Approaches===
:<math> (n\mid m,k) = (m\mid n,k)\left(\sum_{n=m}^{\Omega - 1} (m\mid n,k)\right)^{-1}[m \le n][n < \Omega]</math>
==== Point estimation ====
;[[Maximum likelihood]]: The maximum likelihood estimator (MLE) for ''N'' is ''n;'' however, this is a [[Bias of an estimator|biased estimator]] and very likely an underestimate,<ref group="notes">(''N''&nbsp;&minus;&nbsp;1)/''N''&nbsp;&ge;&nbsp;(''n''&nbsp;&minus;&nbsp;1)/''n'' likely to be an underestimate.</ref> as ''N'' could be higher than ''n'' but can not be lower; see [[Bias_of_an_estimator#Maximum_of_a_discrete_uniform_distribution|analysis of bias]].
;Natural unbiased estimation: The [[unbiased estimator]] is 2''n''&nbsp;&minus;&nbsp;1 (this is the unique unbiased estimator, thus also the [[MVUE]]). This can be derived in two ways:
:* The expected value of ''n'' is (''N''&nbsp;+&nbsp;1)/2. Solving this<ref group="notes"><math>E(n) = (N+1)/2,</math> so <math>E(2n-1)=N</math></ref> for ''N'' shows that ''N''&nbsp;=&nbsp;2''n''&nbsp;&minus;&nbsp;1 is the (unique) unbiased estimator. Equivalently, take the sample mean as the population mean (the [[method of moments (statistics)|method of moments]]).
:* Alternatively, one can use [[order statistic]]s: to consider the [[sample median]] (in this case, ''n'') as the population median &mdash; i.e., to take ''n'' as the half-way point. In that case, the estimate for ''N'' is 2''n''&nbsp;&minus;&nbsp;1 This is equally likely to be an under-estimate or over-estimate, as a given sample is equally likely to be above or below the median.
:These criteria &mdash; unbiased and equal odds of over- or under-estimation &mdash; coincide because the population mean and median of the uniform distribution are equal, and because the sample mean and median of a single sample are equal, being the sample itself.
;Uniform prior: Naively applying [[Bayesian inference]], one can attempt to take as one's [[prior distribution]] a uniform distribution on 1,&nbsp;2,&nbsp;3,&nbsp;..., which is an [[improper prior]] and compute the expected value of the posterior distribution. However, as elaborated below, the sums diverge. Properly, taking a uniform distribution on 1,&nbsp;2,&nbsp;3,&nbsp;...,&nbsp;''M'' for large ''M,'' calculating the (finite) expected value of the posterior, and taking the limit as ''M''&nbsp;→&nbsp;∞ yields infinity. Note that if one has some prior knowledge or belief about ''N'' (such as an upper limit), one can apply other estimation; this discussion assumes no knowledge about ''N.''


As <math>\Omega \rightarrow \infty</math>, the summation approaches <math>\sum_{n=m}^\infty(m\mid n,k)</math> (which is finite if ''k''&nbsp;≥&nbsp;2). Thus, for suitably large <math>\Omega</math>, we have
====Interval estimation====
;[[Confidence interval]]s: Confidence intervals are easily calculated, and very wide: as ''n'' has a 95% chance of falling between the 2.5% and 97.5% [[quantile]]s, this shows that with 95% confidence, ''n''/0.975&nbsp;≤&nbsp;''N''&nbsp;≤&nbsp;''n''/0.025, or approximately, 1.025''n''&nbsp;≤&nbsp;''N''&nbsp;≤&nbsp;40''n'', with equal probabilities of being above or below. If one instead takes the asymmetric interval that ''n'' has a 95% chance of falling between the 0% and 95% quantiles, this yields the confidence interval of approximately 1.05''n''&nbsp;≤&nbsp;''N''&nbsp;≤&nbsp;∞, which is otherwise an overestimate. Conversely, taking the asymmetric interval that ''n'' has a 95% chance of falling between the 5% and 100% quantiles yields the confidence interval of ''n''&nbsp;≤&nbsp;''N''&nbsp;≤&nbsp;20''n'', which is otherwise an underestimate; the analogous (possibly underestimating) 99% confidence interval is ''n''&nbsp;≤&nbsp;''N''&nbsp;≤&nbsp;100''n''.
;[[Credible interval]]s: As above in "point estimation", the likelihood function has infinite integral, so naive use of the likelihood function (with some improper priors) yields infinite credible intervals. Similarly, the [[Bayes factor]] between any finite interval and an infinite interval (unbounded above) is infinite, as the former has finite integral and the latter infinite integral. However, the Bayes factor between any two finite intervals can be computed, regardless of prior, as the integrals are both finite.


:<math> (n\mid m,k) \approx (m\mid n,k)\left(\sum_{n=m}^{\infty} (m\mid n,k)\right)^{-1}[m \le n ]</math>
===Analysis===
For ''k''&nbsp;≥&nbsp;1 the [[mode (statistics)|mode]] of the distribution of the number of enemy tanks is ''m''.
All that one can know for sure is that ''N'' is at least ''n;'' however, it could be arbitrarily higher. The divergent answers reflect the following:
* the ''likeliest'' maximum is ''n'' itself, as the higher ''N'' is, the less probable any ''particular'' ''n'' is to arise: one is more likely to draw ''n'' if ''N''&nbsp;=&nbsp;''n'' (probability 1/''n'') than if ''N''&nbsp;=&nbsp;''n''&nbsp;+&nbsp;1 or more (probability 1/(''n''&nbsp;+&nbsp;1)).
* if ''N'' is ''equally likely'' to be any positive number, potentially very large,<ref group="notes">Properly, equally likely to be any positive number up to a very large upper bound.</ref> then it is more likely that ''N'' is greater than any particular bound ''K'', and ''n'' happens to have been drawn from the bottom of the range. This follows from the divergence of the [[harmonic series (mathematics)|harmonic series]]: the tail odds 1/''K''&nbsp;+&nbsp;1/(''K''&nbsp;+&nbsp;1)&nbsp;+&nbsp;... corresponding to "''N'' is large and ''n'' is small" diverge to infinity. This is reflected in the likelihood function having infinite integral, as discussed below.
*:One may criticize the choice of uniform prior as placing undue weight on large numbers; formally, it is not [[scale invariant]], and thus because there are far more numbers between 1,000,000 and 10,000,000 than between 100 and 1,000, large numbers dominate, making the expectation infinite: the (linear) uniform distribution is a poor prior for estimation of scale. A scale invariant choice of prior, like the [[logarithmic prior]] (uniform distribution on the [[logarithmic scale]]), places less weight on large numbers and thus gives more controlled likelihood functions: it yields a likelihood function with finite integral, but still with infinite expectation.
* the unbiased estimate of ''N''&nbsp;=&nbsp;2''m''&nbsp;&minus;&nbsp;1 (taking ''m'' to be the population median) has equal odds of being an underestimate or an overestimate, regardless of the distribution of ''N'' (it is a [[nonparametric estimator]], being an [[order statistic]]). However, since ''N'' is bounded below (by ''n'') but unbounded above, in ''expectation'' this may be biased low (if we take a uniform prior distribution on ''N'').<ref group="notes">This statement can be confusing: from a frequentist analysis, it is an unbiased estimator, but from a Bayesian analysis, assuming a prior, in expectation it may be low.</ref>
Thus the different estimates reflect different facets of what can be inferred:
* ''n'' is most likely;
* 2''m''&nbsp;&minus;&nbsp;1 is unbiased and also has even odds of under/over estimation;
* but if all ''N'' are equally likely, expectation is infinite. However, none of these is categorically ''better.''


For ''k''&nbsp;≥&nbsp;2, the credibility that the number of enemy tanks is ''equal to'' <math>n</math>, is
The confidence interval computation shows that, based on a single observation, one cannot estimate with high confidence (95%) better than a range of 1:20; with 99% confidence one can give the range [''n'',&nbsp;100''n''], so 10''n'' to within an order of magnitude. This range ([[interval estimation]]) demonstrates the uncertainty giving but a single sample, which is not evident from considering only [[point estimation]].
:<math>
(N=n\mid m,k) = (k - 1)\binom{m - 1}{k - 1}k^{-1}\binom n k^{-1}[m \le n ]
</math>


The credibility that the number of enemy tanks, ''N'', is ''greater than n'', is
===Improved estimates===
:<math>
==== Additional samples ====
(N>n\mid m,k)= \begin{cases}
If one has additional samples, one can refine the analysis:
1 &\text{if } n < m \\
* the maximum likelihood estimate for ''N'' is the sample maximum, for the same reason as above;
\frac {\binom{m - 1}{k - 1}}{\binom n {k - 1}} &\text{if } n \ge m
* the [[MVUE]] is ((''k''&nbsp;+&nbsp;1)/''k'')''m''&nbsp;&minus;&nbsp;1), with the [[sample maximum]] ''m,'' as discussed above: the sample maximum is a [[sufficient statistic]].
\end{cases}
** The estimators 2''μ''&nbsp;&minus;&nbsp;1, where ''μ'' is the mean (the method of moments estimator), or 2''m''&nbsp;&minus;&nbsp;1 for the sample median ''m'' are unbiased but [[Efficiency (statistics)|inefficient]]: other than having higher variance than the MVUE, they may even give a lower estimate than the sample maximum, which cannot be correct. Precise variances can be computed;<ref name="Johnson"/>
</math>
* the computation of the expectation of ''N'' can be refined, as the likelihood function becomes more controlled, as discussed below.


=== Mean value and standard deviation ===
====Prior distribution====
For ''k''&nbsp;≥&nbsp;3, ''N'' has the finite [[mean value]]:
If one has a prior which is either bounded above or more generally decays sufficiently fast, the expected value of ''N'' (using the likelihood function) will be finite.
:<math>(m - 1)(k - 1)(k - 2)^{-1}</math>


For ''k''&nbsp;≥&nbsp;4, ''N'' has the finite [[standard deviation]]:
If one has a prior, one can replace maximum likelihood estimation with [[maximum a posteriori]] estimation; recall that these coincide if the prior is uniform, as assumed above. If one has no prior beliefs about the scale, one will likely choose a [[scale invariant]] measure such as the [[logarithmic prior]].
:<math>(k - 1)^{1/2}(k - 2)^{-1}(k - 3)^{-1/2}(m - 1)^{1/2}(m + 1 - k)^{1/2}</math>
These formulas are derived below.


=== Summation formula ===
Using a [[logarithmic prior]], ''p''(''θ'')&nbsp;<math>\propto</math>&nbsp;1/''θ'' (which is an [[improper prior]]), modifies the above by multiplying by a factor of 1/''x'', making the tails decay faster (as&nbsp;''x''<sup>&minus;(''k''+1)</sup> rather than ''x''<sup>&minus;''k''</sup>). Thus with a sample of one observation, the likelihood has finite integral but infinite expectation; with a sample of two observations, the likelihood has finite expectation; and with a sample of three observations, the likelihood has finite expectation and variance.
The following [[Binomial coefficient#Identities involving binomial coefficients|binomial coefficient identity]] is used below for simplifying [[series (mathematics)|series]] relating to the German Tank Problem.
:<math>\sum_{n=m}^\infty \frac 1 {\binom n k} = \frac k{k - 1}\frac 1 {\binom{m - 1}{k - 1}}</math>


This sum formula is somewhat analogous to the integral formula
===Likelihood function===
:<math>\int_{n=m}^\infty \frac {dn}{n^k} = \frac 1{k - 1}\frac 1{m^{k - 1}}</math>
Consider a jar containing ''N'' lottery tickets numbered from 1 through ''N''. (Or ''N'' tanks numbered from 1 through ''N''). Take a sample of ''k'' tickets. Let the sample maximum be ''m''.


These formulas apply for ''k''&nbsp;>&nbsp;1.
The likelihood functions, interpreted as discrete distributions, are the [[heavy-tailed distribution]]s


=== One tank ===
:<math>L_{k,m}(N) = \frac{[N \ge m]}{\binom{N}{k}}.</math>
Observing one tank randomly out of a population of ''n'' tanks gives the serial number ''m'' with probability 1/''n'' for ''m''&nbsp;≤&nbsp;''n'', and zero probability for ''m''&nbsp;>&nbsp;''n''. Using [[Iverson bracket]] notation this is written
:<math>(M=m\mid N=n,K=1) = (m\mid n) = \frac{[m \le n]}{n}</math>


This is the conditional probability mass distribution function of <math>m</math>.
where the [[Iverson bracket]] [''N''&nbsp;≥&nbsp;''m''] is 1 when ''N''&nbsp;≥&nbsp;''m'' and 0 otherwise.


When considered a function of ''n'' for fixed ''m'' this is a likelihood function.
Series involving these distributions are simplified using the nice formula (14) from [[Binomial coefficient#Identities involving binomial coefficients|here]]:
:<math>\mathcal{L}(n) = \frac{[n\ge m]}{n}</math>


The [[maximum likelihood]] estimate for the total number of tanks is ''N''<sub>0</sub>&nbsp;=&nbsp;''m'', clearly a biased estimate since the true number can be more than this, potentially many more, but cannot be fewer.
:<math>\sum_{N} \frac {[N \ge m]} {\binom N k}=\frac k{(k-1)\binom{m-1}{k-1}}.</math>


The marginal likelihood (i.e. marginalized over all models) is [[Infinity|infinite]], being a tail of the [[harmonic series (mathematics)|harmonic series]].
If you pick 3 or more tickets the likelihood function has a well defined [[mean value]], which is larger than the maximum likelihood estimate (as the likelihood function is at least as big as the sample maximum). If you pick 4 or more tickets the likelihood function has a well defined [[standard deviation]] too, and likewise for larger samples having higher moments.


:<math>\sum_n \mathcal{L}(n) = \sum_{n=m}^\infty \frac{1}{n} = \infty</math>
====A single ticket====
If you pick a ticket randomly you get number ''n'' with probability 1/''N'' if ''n''&nbsp;≤&nbsp;''N'', and zero if ''n''&nbsp;>&nbsp;''N''. This is written


but
:<math>P(n|N) = \frac{[n \le N]}{N}.</math>
:<math>\begin{align}
\sum_n \mathcal{L}(n)[n < \Omega]
&= \sum_{n=m}^{\Omega - 1} \frac{1}{n} \\[5pt]
&= H_{\Omega-1} - H_{m - 1}
\end{align}</math>


where <math>H_n</math> is the [[harmonic number]].
When considered a function of ''n'' for fixed ''N'' this is the probability distribution, but when considered a function of ''N'' for fixed ''n'' this is a likelihood function.
:<math>L(N) =P(n|N) = \frac{[N \ge n]}N.</math>
The [[maximum likelihood]] estimate for ''N'' is ''N''<sub>0</sub>&nbsp;=&nbsp;''n''.


The credibility mass distribution function depends on the prior limit <math>\Omega</math>:
This likelihood function is not proportional to any probability mass function, because the total
:<math>\begin{align}
&(N=n\mid M=m,K=1) \\[5pt]
= {} &(n\mid m) = \frac{[m\le n]}{n} \frac{[n<\Omega]}{H_{\Omega - 1} - H_{m - 1}}
\end{align}</math>


:<math>\sum_N L(N) \,</math>
The mean value of <math>N</math> is
:<math>\begin{align}
\sum_n n\cdot(n\mid m) &= \sum_{n=m}^{\Omega - 1} \frac{1}{H_{\Omega - 1} - H_{m - 1}} \\[5pt]
&= \frac{\Omega - m}{H_{\Omega - 1} - H_{m - 1}} \\[5pt]
&\approx \frac{\Omega - m}{\log\left(\frac{\Omega - 1}{m - 1} \right)}
\end{align}</math>


=== Two tanks ===
is a [[divergent series]], being a tail of the [[harmonic series (mathematics)|harmonic series]].
If two tanks rather than one are observed, then the probability that the larger of the observed two serial numbers is equal to ''m'', is


:<math>(M=m\mid N=n,K=2) = (m\mid n) = [m \le n]\frac{m - 1}{\binom{n}{2}}</math>
However, the sum over finite intervals ''is'' defined, and thus the ''relative'' likelihood of ''N'' being in one of two intervals, such as 100 to 1,000 or 1,000 to 10,000, ''is'' defined, because it is a ratio of two finite sums.


When considered a function of ''n'' for fixed ''m'' this is a [[likelihood function]]
====Two tickets====
:<math>\mathcal{L}(n) = [n \ge m]\frac{m - 1}{\binom{n}{2}}</math>
Suppose, however, that you pick ''two'' tickets rather than ''one''.


The total likelihood is
The probability of the outcome {''n''<sub>1</sub>, ''n''<sub>2</sub>},
:<math>P(\{n_1,n_2\}|N)= \frac{[n_1\le N][n_2\le N]}{\binom{N}{2}} .</math>


:<math>\begin{align}
When considered a function of ''N'' for fixed sample maximum ''m'' = ''n''<sub>2</sub>, where ''n''<sub>1</sub>&nbsp;<&nbsp;''n''<sub>2</sub>, this is a [[likelihood function]]
:<math>L(N)=\frac{[N \ge m]}{\binom{N}{2}} .</math>
\sum_{n}\mathcal{L}(n) &= \frac{m - 1}{1} \sum_{n=m}^\infty \frac{1}{\binom n 2} \\[4pt]
&= \frac{m - 1}{1} \cdot \frac{2}{2 - 1} \cdot \frac{1}{\binom {m - 1}{2 - 1}} \\[4pt]
&= 2
\end{align}</math>


and the credibility mass distribution function is
The [[maximum likelihood]] estimate for ''N'' is ''N''<sub>0</sub>&nbsp;=&nbsp;''m''.


:<math>\begin{align}
This time the total likelihood is a convergent series,
&(N=n\mid M=m,K=2) \\[4pt]
= {} &(n\mid m) \\[4pt]
= {} &\frac{\mathcal{L}(n)}{\sum_n \mathcal{L}(n)} \\[4pt]
= {} &[n \ge m]\frac{m - 1}{n(n - 1)}
\end{align}</math>


The [[median]] <math>\tilde{N}</math> satisfies
:<math>S_0
:<math>\sum_n [n \ge \tilde{N}](n\mid m) = \frac{1}{2}</math>
=\sum_N L(N)
=\sum_N {[N \ge m]\over \binom N 2}
=\frac 2 {m-1}</math>


so
and so this likelihood function can be normalized into a probability distribution. This means that the [[median]] <math>\scriptstyle \tilde{N}</math> (together with other [[quantile]]s) is defined:
:<math> \tilde{N}=2m-1.</math>
:<math>\frac{m - 1}{\tilde N - 1} = \frac{1}{2}</math>
But the mean value of this distribution is undefined.


and so the median is
====''k'' tickets====
:<math>\tilde{N} = 2m - 1</math>
In general, if ''n''<sub>1</sub>&nbsp;<&nbsp;''n''<sub>2</sub>&nbsp;<&nbsp;...&nbsp;<&nbsp;''n''<sub>''k''</sub> are the observations in the sample, and the sample maximum ''m'' = ''n''<sub>''k''</sub>, then the likelihood function is given by:


but the mean value of <math>N</math> is infinite
:<math>L(N)=P(\{n_1,n_2,\dots,n_k\}|N)= \frac{[N \ge m]}{\binom{N}{k}}.</math>
:<math>\mu = \sum_n n \cdot (n\mid m) = \frac{m - 1}1\sum_{n=m}^\infty \frac{1}{n - 1} = \infty</math>


=== Many tanks ===
The total likelihood is finite for ''k'' ≥ 2:


==== Credibility mass distribution function ====
:<math>S_0
The conditional probability that the largest of ''k'' observations taken from the serial numbers {1,...,''n''}, is equal to ''m'', is
=\sum_N L(N)
:<math>\begin{align}
=\sum_N {[N \ge m]\over \binom N k}
&(M=m\mid N=n,K=k\ge 2) \\
=\frac k{(k-1)\binom{m-1}{k-1}}</math>
= {} &(m\mid n,k) \\
= {} &[m\le n]\frac{\binom{m - 1}{k - 1}}{\binom{n}{k}}
\end{align}</math>


The [[weighted sum]] is finite for ''k'' 3 :
The likelihood function of ''n'' is the same expression
:<math>\mathcal{L}(n) = [n \ge m]\frac{\binom{m - 1}{k - 1}}{\binom{n}{k}}</math>
:<math>
\begin{align}
S_1 &
= \sum_N N \cdot L(N)
= \sum_N \frac{N[N \ge m]}{\binom N k} \\ &
= \sum_N \frac{k[N \ge m]}{\binom {N-1}{k-1}}
= \frac {k(k-1)}{(k-2)\binom{m-2}{k-2}}
\end{align}
</math>


The total likelihood is finite for ''k'' ≥ 2:
so the mean value is
:<math>\begin{align}
:<math>\mu={S_1 \over S_0}=\frac{(k-1)(m-1)}{k-2}.</math>
\sum_n \mathcal{L}(n)
&= \frac{\binom{m - 1}{k - 1}}{1} \sum_{n=m}^\infty {1 \over \binom n k} \\
&= \frac{\binom{m - 1}{k - 1}}{1} \cdot \frac{k}{k-1} \cdot \frac{1}{\binom{m - 1}{k - 1}} \\
&= \frac k{k - 1}
\end{align}</math>


The credibility mass distribution function is
The weighted sum of squares is finite for ''k'' ≥ 4:
:<math>
:<math>\begin{align}
&(N=n\mid M=m,K=k \ge 2) = (n\mid m,k) \\
\begin{align}
= {} &\frac{\mathcal{L}(n)}{\sum_n \mathcal{L}(n)} \\
S_2 &
= {} &[n\ge m]\frac{k-1}{k} \frac{\binom{m - 1}{k - 1}}{\binom n k} \\
= \sum_N N^2 \cdot L(N)
= \sum_N \frac{N^3[N \ge m]}{\binom N k} \\ &
= {} &[n\ge m]\frac{m-1}{n} \frac{\binom{m - 2}{k - 2}}{\binom{n - 1}{k - 1}} \\
= \sum_N [N \ge m]\frac{N+N(N-1)}{\binom N k}
= {} &[n\ge m]\frac{m-1}{n} \frac{m - 2}{n - 1} \frac{k - 1}{k - 2} \frac{\binom{m - 3}{k - 3}}{\binom{n-2}{k-2}}
\end{align}</math>
= S_1+\sum_N [N \ge m]\frac{N(N-1)}{\binom N k} \\
& = S_1+\sum_N [N \ge m]\frac{k(k-1)}{\binom {N-2}{k-2}}
= S_1+\frac{k(k-1)(k-2)}{(k-3)\binom {m-3}{k-3}}
\end{align}
</math>


The [[complementary cumulative distribution function]] is the credibility that ''N'' > ''x''
so the [[variance]] is
:<math>\sigma^2
:<math>\begin{align}
&(N>x\mid M=m,K=k) \\[4pt]
={S_2 \over S_0}-\mu^2
= {} &\begin{cases}
=\mu+\frac{(k-1)(m-1)(m-2)}{k-3}-\mu^2</math>
1 &\text{if }x < m \\
\sum_{n=x+1}^\infty (n\mid m,k) &\text{if }x \ge m
\end{cases} \\
= {} &[x<m] + [x \ge m]\sum_{n=x+1}^\infty \frac{k - 1}{k}\frac{\binom{m - 1}{k - 1}}{\binom{N}{k}} \\[4pt]
= {} &[x<m] + [x \ge m]\frac{k - 1}{k} \frac{\binom{m - 1}{k - 1}}{1} \sum_{n=x+1}^\infty \frac{1}{\binom{n}{k}} \\[4pt]
= {} &[x<m] + [x \ge m]\frac{k - 1}{k} \frac{\binom{m - 1}{k - 1}}{1} \cdot \frac{k}{k - 1} \frac{1}{\binom{x}{k - 1}} \\[4pt]
= {} &[x<m] + [x \ge m]\frac{\binom{m - 1}{k - 1}}{\binom{x}{k - 1}}
\end{align}</math>


The [[cumulative distribution function]] is the credibility that ''N'' ≤ ''x''
and the [[standard deviation]] is
:<math>\sigma
:<math>\begin{align}
&(N\le x\mid M=m,K=k) \\[4pt]
=\sqrt{\mu+\frac{(k-1)(m-1)(m-2)}{k-3}-\mu^2}</math>
= {} &1 - (N>x\mid M=m,K=k) \\[4pt]
= {} &[x \ge m]\left(1 - \frac{\binom{m - 1}{k - 1}}{\binom{x}{k - 1}}\right)
\end{align}</math>


==== Order of magnitude ====
The estimate is ''N''&nbsp;≈&nbsp;μ&nbsp;±&nbsp;σ.
The order of magnitude of the number of enemy tanks is
:<math>\begin{align}
\mu &= \sum_n n\cdot(N=n\mid M=m,K=k) \\[4pt]
& = \sum_n n [n\ge m]\frac {m-1}n \frac {\binom{m-2}{k-2}}{\binom{n-1}{k-1}} \\[4pt]
& = \frac{m-1}1 \frac{\binom{m-2}{k-2}}1\sum_{n=m}^\infty \frac 1{\binom{n-1}{k-1}}\\[4pt]
& = \frac{m-1}1 \frac{\binom{m-2}{k-2}}1 \cdot \frac{k-1}{k-2}\frac {1}{\binom{m-2}{k-2}}\\[4pt]
& = \frac{m-1}1 \frac{k-1}{k-2}
\end{align}</math>


==== Statistical uncertainty ====
==See also==
The statistical uncertainty is the standard deviation <math>\sigma</math>, satisfying the equation
* [[Capture-recapture]], other method of estimating population size
:<math>\sigma^2 + \mu^2 = \sum_n n^2 \cdot (N=n\mid M=m,K=k)</math>

So
:<math>\begin{align}
\sigma^2+\mu^2-\mu & = \sum_n n(n-1)\cdot(N=n\mid M=m,K=k)\\[4pt]
& = \sum_{n=m}^\infty n(n-1)\frac{m-1}n \frac{m-2}{n-1} \frac{k-1}{k-2} \frac{\binom{m-3}{k-3}}{\binom{n-2}{k-2}}\\[4pt]
& = \frac{m-1}1 \frac{m-2}1 \frac{k-1}{k-2} \cdot \frac{\binom{m-3}{k-3}}1 \sum_{n=m}^\infty \frac 1{\binom{n-2}{k-2}}\\[4pt]
& = \frac{m-1}1 \frac{m-2}1 \frac{k-1}{k-2} \frac{\binom{m-3}{k-3}}1 \frac{k-2}{k-3} \frac 1{\binom{m-3}{k-3}}\\[4pt]
& = \frac{m-1}1 \frac{m-2}1 \frac{k-1}{k-3}
\end{align}</math>

and
:<math>\begin{align}
\sigma &= \sqrt{\frac{m-1}1 \frac{m-2}1 \frac{k-1}{k-3}+\mu-\mu^2} \\[4pt]
&= \sqrt{\frac{(k-1)(m-1)(m-k+1)}{(k-3)(k-2)^2}}
\end{align}</math>

The [[variance-to-mean ratio]] is simply
:<math>\frac{\sigma^2}\mu = \frac{m - k + 1}{(k - 3)(k - 2)}</math>

== See also ==
* [[Mark and recapture]], other method of estimating population size
* [[Maximum spacing estimation]], which generalizes the intuition of "assume uniformly distributed"
* [[Maximum spacing estimation]], which generalizes the intuition of "assume uniformly distributed"
* [[Copernican principle]] and [[Lindy effect]], analogous predictions of lifetime assuming just one observation in the sample (current age).
** The [[Doomsday argument]], application to estimate expected survival time of the human race.
* [[Generalized extreme value distribution]], possible limit distributions of sample maximum (opposite question).
* [[Maximum likelihood#Bias|Maximum likelihood]]
* [[Bias of an estimator#Maximum of a discrete uniform distribution|Bias of an estimator]]
* [[Likelihood function#Example|Likelihood function]]
== Further reading ==
* {{cite journal | last = Goodman | first = L. A. | year = 1954 | title = Some Practical Techniques in Serial Number Analysis | journal = [[Journal of the American Statistical Association]] | publisher = American Statistical Association | volume = 49 | issue = 265 | pages = 97–112 | doi = 10.2307/2281038 | jstor = 2281038}}


== External links ==
===Other discussions of the estimation===
* [[Maximum likelihood#Bias]]
* [[Bias of an estimator#Maximum of a discrete uniform distribution]]
* [[Likelihood function#Example 5]]


* {{cite web |last1=Grime |first1=James |title=The Clever Way to Count Tanks - Numberphile |url=https://www.youtube.com/watch?v=WLCwMRJBhuI |website=YouTube |publisher=[[Brady Haran]] |access-date=18 October 2024 |format=video}}
==Notes==
{{Reflist|group=notes}}


==References==
== Notes ==
{{Reflist}}
{{reflist|group=N}}
{{notelist}}
;General

{{refbegin}}
== References ==
* {{Citation
{{reflist|2}}
|last=Goodman

|first=L. A.
=== Works cited ===
|year=1954
{{Refbegin}}
|title=Some Practical Techniques in Serial Number Analysis
*{{Cite journal | last = Johnson | first = R. W. | date =Summer 1994 | title = Estimating the Size of a Population | journal = Teaching Statistics | volume = 16 | issue = 2 | pages = 50–52 | doi = 10.1111/j.1467-9639.1994.tb00688.x
|journal=[[Journal of the American Statistical Association]]
|archive-url=https://web.archive.org/web/20140223104835/http://www.rsscse-edu.org.uk/tsj/wp-content/uploads/2011/03/johnson.pdf |url=http://www.rsscse-edu.org.uk/tsj/wp-content/uploads/2011/03/johnson.pdf |archive-date=23 February 2014}}
|volume=49
* {{Cite journal | last2 = Brodie | first2 = H. | year = 1947| title = An Empirical Approach to Economic Intelligence in World War II| journal = [[Journal of the American Statistical Association]]| volume = 42| issue = 237| pages = 72| doi = 10.1080/01621459.1947.10501915| jstor = 2280189| last1 = Ruggles | first1 = R. | author-link1 = Richard Ruggles}}
|pages=97–112
* {{Cite journal | last = Volz | first = A. G. | date=July 2008 | title = A Soviet Estimate of German Tank Production | journal = The Journal of Slavic Military Studies | volume = 21 | issue = 3 | pages = 588–590 | doi = 10.1080/13518040802313902 | s2cid = 144483708}}
|doi=10.2307/2281038
|url=http://jstor.org/stable/2281038
|issue=265
|publisher=American Statistical Association}}
{{refend}}
{{refend}}
{{Use dmy dates|date=September 2010}}


{{Probability distributions|discrete-infinite}}
{{DEFAULTSORT:German Tank Problem}}

[[Category:Estimation for specific distributions]]
[[Category:Estimation methods]]
[[Category:World War II tanks of Germany]]
[[Category:World War II tanks of Germany]]
[[Category:Applied mathematics]]
[[Category:Applied mathematics]]
[[Category:Data analysis]]
[[Category:Bayesian statistics]]
[[Category:Probability problems]]

[[Category:Discrete distributions]]
[[es:Problema de los tanques alemanes]]
[[Category:Theory of probability distributions]]
[[eu:Alemaniar tankeen ebazkizuna]]
[[Category:Parametric statistics]]
[[Category:Serial numbers]]
[[Category:Combat modeling]]

Latest revision as of 15:35, 18 October 2024

During World War II, production of German tanks such as the Panther was accurately estimated by Allied intelligence using statistical methods.

In the statistical theory of estimation, the German tank problem consists of estimating the maximum of a discrete uniform distribution from sampling without replacement. In simple terms, suppose there exists an unknown number of items which are sequentially numbered from 1 to N. A random sample of these items is taken and their sequence numbers observed; the problem is to estimate N from these observed numbers.

The problem can be approached using either frequentist inference or Bayesian inference, leading to different results. Estimating the population maximum based on a single sample yields divergent results, whereas estimation based on multiple samples is a practical estimation question whose answer is simple (especially in the frequentist setting) but not obvious (especially in the Bayesian setting).

The problem is named after its historical application by Allied forces in World War II to the estimation of the monthly rate of German tank production from very limited data. This exploited the manufacturing practice of assigning and attaching ascending sequences of serial numbers to tank components (chassis, gearbox, engine, wheels), with some of the tanks eventually being captured in battle by Allied forces.

Suppositions

[edit]

The adversary is presumed to have manufactured a series of tanks marked with consecutive whole numbers, beginning with serial number 1. Additionally, regardless of a tank's date of manufacture, history of service, or the serial number it bears, the distribution over serial numbers becoming revealed to analysis is uniform, up to the point in time when the analysis is conducted.

Example

[edit]
Estimated population size (N). The number of observations in the sample is k. The largest sample serial number is m. Frequentist analysis is shown with dotted lines. Bayesian analysis has solid yellow lines with mean and shading to show range from minimum possible value to mean plus 1 standard deviation). The example shows if four tanks are observed and the highest serial number is "60", frequentist analysis predicts 74, whereas Bayesian analysis predicts a mean of 88.5 and standard deviation of 138.72 − 88.5 = 50.22, and a minimum of 60 tanks. In the SVG file, hover over a graph to highlight it.

Assuming tanks are assigned sequential serial numbers starting with 1, suppose that four tanks are captured and that they have the serial numbers: 19, 40, 42 and 60.

A frequentist approach (using the minimum-variance unbiased estimator) predicts the total number of tanks produced will be:

A Bayesian approach (using a uniform prior over the integers in for any suitably large ) predicts that the median number of tanks produced will be very similar to the frequentist prediction:

whereas the Bayesian mean predicts that the number of tanks produced would be:

Let N equal the total number of tanks predicted to have been produced, m equal the highest serial number observed and k equal the number of tanks captured.

The frequentist prediction is calculated as:

The Bayesian median is calculated as:

The Bayesian mean is calculated as:

These Bayesian quantities are derived from the Bayesian posterior distribution:

This probability mass function has a positive skewness, related to the fact that there are at least 60 tanks. Because of this skewness, the mean may not be the most meaningful estimate. The median in this example is 74.5, in close agreement with the frequentist formula. Using Stirling's approximation, the posterior may be approximated by an exponentially decaying function of n,

which results in the following approximation for the median:

and the following approximations for the mean and standard deviation:

Historical example of the problem

[edit]
Panther tanks are loaded for transport to frontline units, 1943.

During the course of the Second World War, the Western Allies made sustained efforts to determine the extent of German production and approached this in two major ways: conventional intelligence gathering and statistical estimation. In many cases, statistical analysis substantially improved on conventional intelligence. In some cases, conventional intelligence was used in conjunction with statistical methods, as was the case in estimation of Panther tank production just prior to D-Day.

The allied command structure had thought the Panzer V (Panther) tanks seen in Italy, with their high velocity, long-barreled 75 mm/L70 guns, were unusual heavy tanks and would only be seen in northern France in small numbers, much the same way as the Tiger I was seen in Tunisia. The US Army was confident that the Sherman tank would continue to perform well, as it had versus the Panzer III and Panzer IV tanks in North Africa and Sicily.[a] Shortly before D-Day, rumors indicated that large numbers of Panzer V tanks were being used.

To determine whether this was true, the Allies attempted to estimate the number of tanks being produced. To do this, they used the serial numbers on captured or destroyed tanks. The principal numbers used were gearbox numbers, as these fell in two unbroken sequences. Chassis and engine numbers were also used, though their use was more complicated. Various other components were used to cross-check the analysis. Similar analyses were done on wheels, which were observed to be sequentially numbered (i.e., 1, 2, 3, ..., N).[2][b][3][4]

The analysis of tank wheels yielded an estimate for the number of wheel molds that were in use. A discussion with British road wheel makers then estimated the number of wheels that could be produced from this many molds, which yielded the number of tanks that were being produced each month. Analysis of wheels from two tanks (32 road wheels each, 64 road wheels total) yielded an estimate of 270 tanks produced in February 1944, substantially more than had previously been suspected.[5]

German records after the war showed production for the month of February 1944 was 276.[6][c] The statistical approach proved to be far more accurate than conventional intelligence methods, and the phrase "German tank problem" became accepted as a descriptor for this type of statistical analysis.

Estimating production was not the only use of this serial-number analysis. It was also used to understand German production more generally, including number of factories, relative importance of factories, length of supply chain (based on lag between production and use), changes in production, and use of resources such as rubber.

Specific data

[edit]

According to conventional Allied intelligence estimates, the Germans were producing around 1,400 tanks a month between June 1940 and September 1942. Applying the formula below to the serial numbers of captured tanks, the number was calculated to be 246 a month. After the war, captured German production figures from the ministry of Albert Speer showed the actual number to be 245.[3]

Estimates for some specific months are given as:[7]

Month Statistical estimate Intelligence estimate German records
June 1940 169 1,000 122
June 1941 244 1,550 271
August 1942 327 1,550 342

Similar analyses

[edit]
V-2 rocket production was accurately estimated by statistical methods.

Similar serial-number analysis was used for other military equipment during World War II, most successfully for the V-2 rocket.[8]

Factory markings on Soviet military equipment were analyzed during the Korean War, and by German intelligence during World War II.[9]

In the 1980s, some Americans were given access to the production line of Israel's Merkava tanks. The production numbers were classified, but the tanks had serial numbers, allowing estimation of production.[10]

The formula has been used in non-military contexts, for example to estimate the number of Commodore 64 computers built, where the result (12.5 million) matches the low-end estimates.[11]

Countermeasures

[edit]

To confound serial-number analysis, serial numbers can be excluded, or usable auxiliary information reduced. Alternatively, serial numbers that resist cryptanalysis can be used, most effectively by randomly choosing numbers without replacement from a list that is much larger than the number of objects produced, or by producing random numbers and checking them against the list of already assigned numbers; collisions are likely to occur unless the number of digits possible is more than twice the number of digits in the number of objects produced (where the serial number can be in any base); see birthday problem.[d] For this, a cryptographically secure pseudorandom number generator may be used. All these methods require a lookup table (or breaking the cypher) to back out from serial number to production order, which complicates use of serial numbers: a range of serial numbers cannot be recalled, for instance, but each must be looked up individually, or a list generated.

Alternatively, sequential serial numbers can be encrypted with a simple substitution cipher, which allows easy decoding, but is also easily broken by frequency analysis: even if starting from an arbitrary point, the plaintext has a pattern (namely, numbers are in sequence). One example is given in Ken Follett's novel Code to Zero, where the encryption of the Jupiter-C rocket serial numbers is given by:

H U N T S V I L E X
1 2 3 4 5 6 7 8 9 0

The code word here is Huntsville (with repeated letters omitted) to get a 10-letter key.[12] The rocket number 13 was therefore "HN", and the rocket number 24 was "UT".

Frequentist analysis

[edit]

Minimum-variance unbiased estimator

[edit]

For point estimation (estimating a single value for the total, ), the minimum-variance unbiased estimator (MVUE, or UMVU estimator) is given by:[e]

where m is the largest serial number observed (sample maximum) and k is the number of tanks observed (sample size).[10][13] Note that once a serial number has been observed, it is no longer in the pool and will not be observed again.

This has a variance[10]

so the standard deviation is approximately N/k, the expected size of the gap between sorted observations in the sample.

The formula may be understood intuitively as the sample maximum plus the average gap between observations in the sample, the sample maximum being chosen as the initial estimator, due to being the maximum likelihood estimator,[f] with the gap being added to compensate for the negative bias of the sample maximum as an estimator for the population maximum,[g] and written as

This can be visualized by imagining that the observations in the sample are evenly spaced throughout the range, with additional observations just outside the range at 0 and N + 1. If starting with an initial gap between 0 and the lowest observation in the sample (the sample minimum), the average gap between consecutive observations in the sample is ; the being because the observations themselves are not counted in computing the gap between observations.[h]. A derivation of the expected value and the variance of the sample maximum are shown in the page of the discrete uniform distribution.

This philosophy is formalized and generalized in the method of maximum spacing estimation; a similar heuristic is used for plotting position in a Q–Q plot, plotting sample points at k / (n + 1), which is evenly on the uniform distribution, with a gap at the end.

Confidence intervals

[edit]

Instead of, or in addition to, point estimation, interval estimation can be carried out, such as confidence intervals. These are easily computed, based on the observation that the probability that k observations in the sample will fall in an interval covering p of the range (0 ≤ p ≤ 1) is pk (assuming in this section that draws are with replacement, to simplify computations; if draws are without replacement, this overstates the likelihood, and intervals will be overly conservative).

Thus the sampling distribution of the quantile of the sample maximum is the graph x1/k from 0 to 1: the p-th to q-th quantile of the sample maximum m are the interval [p1/kNq1/kN]. Inverting this yields the corresponding confidence interval for the population maximum of [m/q1/km/p1/k].

For example, taking the symmetric 95% interval p = 2.5% and q = 97.5% for k = 5 yields 0.0251/5 ≈ 0.48, 0.9751/5 ≈ 0.995, so the confidence interval is approximately [1.005m, 2.08m]. The lower bound is very close to m, thus more informative is the asymmetric confidence interval from p = 5% to 100%; for k = 5 this yields 0.051/5 ≈ 0.55 and the interval [m, 1.82m].

More generally, the (downward biased) 95% confidence interval is [m, m/0.051/k] = [m, m·201/k]. For a range of k values, with the UMVU point estimator (plus 1 for legibility) for reference, this yields:

k Point estimate Confidence interval
1 2m [m, 20m]
2 1.5m [m, 4.5m]
5 1.2m [m, 1.82m]
10 1.1m [m, 1.35m]
20 1.05m [m, 1.16m]

Immediate observations are:

  • For small sample sizes, the confidence interval is very wide, reflecting great uncertainty in the estimate.
  • The range shrinks rapidly, reflecting the exponentially decaying probability that all observations in the sample will be significantly below the maximum.
  • The confidence interval exhibits positive skew, as N can never be below the sample maximum, but can potentially be arbitrarily high above it.

Note that m/k cannot be used naively (or rather (m + m/k − 1)/k) as an estimate of the standard error SE, as the standard error of an estimator is based on the population maximum (a parameter), and using an estimate to estimate the error in that very estimate is circular reasoning.

Bayesian analysis

[edit]

The Bayesian approach to the German tank problem[14] is to consider the posterior probability that the number of enemy tanks is , when the number of observed tanks is , and the maximum observed serial number is .

The answer to this problem depends on the choice of prior for . One can proceed using a proper prior over the positive integers, e.g., the Poisson or Negative Binomial distribution, where a closed formula for the posterior mean and posterior variance can be obtained.[15] Below, we will instead adopt a bounded uniform prior.

For brevity, in what follows, is written .

Conditional probability

[edit]

The rule for conditional probability gives

Probability of M knowing N and K

[edit]

The expression

is the conditional probability that the maximum serial number observed, , is equal to , when the number of enemy tanks, , is known to be equal to , and the number of enemy tanks observed, , is known to be equal to .

It is

where is a binomial coefficient and is an Iverson bracket.

The expression can be derived as follows: answers the question: "What is the probability of a specific serial number being the highest number observed in a sample of tanks, given there are tanks in total?"

One can think of the sample of size to be the result of individual draws without replacement. Assume is observed on draw number . The probability of this occurring is:

As can be seen from the right-hand side, this expression is independent of and therefore the same for each . As can be drawn on different draws, the probability of any specific being the largest one observed is times the above probability:

Probability of M knowing only K

[edit]

The expression is the probability that the maximum serial number is equal to once tanks have been observed but before the serial numbers have actually been observed.

The expression can be re-written in terms of the other quantities by marginalizing over all possible .

Prior probability of N knowing only K

[edit]

We assume that is fixed in advance so that we do not have to consider any distribution over . Thus, our prior can depend on .

The expression

is the credibility that the total number of tanks, , is equal to when the number tanks observed is known to be , but before the serial numbers have been observed. Assume that it is some discrete uniform distribution

The upper limit must be finite, because the function

is not a mass distribution function. Our result below will not depend on .

Posterior probability of N knowing M and K

[edit]

Provided that , so that the prior is consistent with the observed data:

As , the summation approaches (which is finite if k ≥ 2). Thus, for suitably large , we have

For k ≥ 1 the mode of the distribution of the number of enemy tanks is m.

For k ≥ 2, the credibility that the number of enemy tanks is equal to , is

The credibility that the number of enemy tanks, N, is greater than n, is

Mean value and standard deviation

[edit]

For k ≥ 3, N has the finite mean value:

For k ≥ 4, N has the finite standard deviation:

These formulas are derived below.

Summation formula

[edit]

The following binomial coefficient identity is used below for simplifying series relating to the German Tank Problem.

This sum formula is somewhat analogous to the integral formula

These formulas apply for k > 1.

One tank

[edit]

Observing one tank randomly out of a population of n tanks gives the serial number m with probability 1/n for m ≤ n, and zero probability for m > n. Using Iverson bracket notation this is written

This is the conditional probability mass distribution function of .

When considered a function of n for fixed m this is a likelihood function.

The maximum likelihood estimate for the total number of tanks is N0 = m, clearly a biased estimate since the true number can be more than this, potentially many more, but cannot be fewer.

The marginal likelihood (i.e. marginalized over all models) is infinite, being a tail of the harmonic series.

but

where is the harmonic number.

The credibility mass distribution function depends on the prior limit :

The mean value of is

Two tanks

[edit]

If two tanks rather than one are observed, then the probability that the larger of the observed two serial numbers is equal to m, is

When considered a function of n for fixed m this is a likelihood function

The total likelihood is

and the credibility mass distribution function is

The median satisfies

so

and so the median is

but the mean value of is infinite

Many tanks

[edit]

Credibility mass distribution function

[edit]

The conditional probability that the largest of k observations taken from the serial numbers {1,...,n}, is equal to m, is

The likelihood function of n is the same expression

The total likelihood is finite for k ≥ 2:

The credibility mass distribution function is

The complementary cumulative distribution function is the credibility that N > x

The cumulative distribution function is the credibility that Nx

Order of magnitude

[edit]

The order of magnitude of the number of enemy tanks is

Statistical uncertainty

[edit]

The statistical uncertainty is the standard deviation , satisfying the equation

So

and

The variance-to-mean ratio is simply

See also

[edit]

Further reading

[edit]
  • Goodman, L. A. (1954). "Some Practical Techniques in Serial Number Analysis". Journal of the American Statistical Association. 49 (265). American Statistical Association: 97–112. doi:10.2307/2281038. JSTOR 2281038.
[edit]

Notes

[edit]
  1. ^ An Armored Ground Forces policy statement of November 1943 concluded: "The recommendation of a limited proportion of tanks carrying a 90 mm gun is not concurred in for the following reasons: The M4 tank has been hailed widely as the best tank of the battlefield today. ... There appears to be no fear on the part of our forces of the German Mark VI (Tiger) tank. There can be no basis for the T26 tank other than the conception of a tank-vs.-tank duel – which is believed to be unsound and unnecessary."[1]
  2. ^ The lower bound was unknown, but to simplify the discussion, this detail is generally omitted, taking the lower bound as known to be 1.
  3. ^ Ruggles & Brodie is largely a practical analysis and summary, not a mathematical one – the estimation problem is only mentioned in footnote 3 on page 82, where they estimate the maximum as "sample maximum + average gap".
  4. ^ As discussed in birthday attack, one can expect a collision after 1.25H numbers, if choosing from H possible outputs. This square root corresponds to half the digits. For example, in any base, the square root of a number with 100 digits is approximately a number with 50 digits.
  5. ^ In a continuous distribution, there is no −1 term.
  6. ^ Given a particular set of observations, this set is most likely to occur if the population maximum is the sample maximum, not a higher value (it cannot be lower).
  7. ^ The sample maximum is never more than the population maximum, but can be less, hence it is a biased estimator: it will tend to underestimate the population maximum.
  8. ^ For example, the gap between 2 and 7 is (7 − 2) − 1 = 4, consisting of 3, 4, 5, and 6.

References

[edit]
  1. ^ AGF policy statement. Chief of staff AGF. November 1943. MHI
  2. ^ Ruggles & Brodie 1947, pp. 73–74.
  3. ^ a b "Gavyn Davies does the maths – How a statistical formula won the war". The Guardian. 20 July 2006. Retrieved 6 July 2014.
  4. ^ Matthews, Robert (23 May 1998), "Data sleuths go to war, sidebar in feature "Hidden truths"", New Scientist, archived from the original on 18 April 2001
  5. ^ Bob Carruthers (1 March 2012). Panther V in Combat. Coda Books. pp. 94–. ISBN 978-1-908538-15-4.
  6. ^ Ruggles & Brodie 1947, pp. 82–83.
  7. ^ Ruggles & Brodie 1947, p. 89.
  8. ^ Ruggles & Brodie 1947, pp. 90–91.
  9. ^ Volz 2008.
  10. ^ a b c Johnson 1994.
  11. ^ "How many Commodore 64 computers were really sold?". pagetable.com. 1 February 2011. Archived from the original on 6 March 2016. Retrieved 6 July 2014.
  12. ^ "Rockets and Missiles". www.spaceline.org.
  13. ^ Joyce, Smart. "German Tank Problem". Logan High School. Archived from the original on 24 April 2012. Retrieved 8 July 2014.
  14. ^ Simon, Cory (2023). "A Bayesian Treatment of the German Tank Problem". The Mathematical Intelligencer. 46 (2): 117–127. arXiv:2301.00046. doi:10.1007/s00283-023-10274-6. PMC 11147940. PMID 38841650.
  15. ^ Höhle, M.; Held, L. (2006). "Bayesian Estimation of the Size of a Population" (PDF). Technical Report SFB 386, No. 399, Department of Statistics, University of Munich. Retrieved 17 April 2016.

Works cited

[edit]