Inception score

The Inception Score (IS) is an algorithm used to assess the quality of images created by a generative model, like a generative adversarial network (GAN).^[1] It has been somewhat superseded by the Fréchet inception distance.^[2]

Definition

Let there be two spaces, the space of images $\Omega _{X}$ and the space of labels $\Omega _{Y}$ . The space of labels is finite.

Let $p_{gen}$ be a probability distribution over $\Omega _{X}$ that we wish to judge.

Let a discriminator be a function of type $p_{dis}:\Omega _{X}\to M(\Omega _{Y})$ where $M(\Omega _{Y})$ is the set of all probability distributions on $\Omega _{Y}$ . For any image $x$ , and any label $y$ , let $p_{dis}(y|x)$ be the probability that image $x$ has label $y$ , according to the discriminator. It is usually implemented as an Inception-v3 network trained on ImageNet.

The Inception Score of $p_{gen}$ relative to $p_{dis}$ is $IS(p_{gen},p_{dis}):=\exp \left(\mathbb {E} _{x\sim p_{gen}}\left[D_{KL}\left(p_{dis}(\cdot |x)\|\int p_{dis}(\cdot |x)p_{gen}(x)dx\right)\right]\right)$ Equivalent rewrites include $\ln IS(p_{gen},p_{inc}):=\mathbb {E} _{x\sim p_{gen}}\left[D_{KL}\left(p_{dis}(\cdot |x)\|\mathbb {E} _{x\sim p_{gen}}[p_{dis}(\cdot |x)]\right)\right]$ $\ln IS(p_{gen},p_{dis}):=H[\mathbb {E} _{x\sim p_{gen}}[p_{dis}(\cdot |x)]]-\mathbb {E} _{x\sim p_{gen}}[H[p_{dis}(\cdot |x)]]$ To show that this is nonnegative, use Jensen's inequality.

Pseudocode:

INPUT discriminator $p_{dis}$ .
INPUT generator $g$ .
Sample images $x_{i}$ from generator.
Compute $p_{dis}(\cdot |x_{i})$ , the probability distribution over labels conditional on image $x_{i}$ .
Sum up the results to obtain ${\hat {p}}$ , an empirical estimate of $\int p_{dis}(\cdot |x)p_{gen}(x)dx$ .
Sample more images $x_{i}$ from generator, and for each, compute $D_{KL}\left(p_{dis}(\cdot |x_{i})\|{\hat {p}}\right)$ .
Average the results, and take its exponential.
Return the result.

Interpretation

A higher inception score is interpreted as "better", as it means that $p_{gen}$ is a "sharp and distinct" collection of pictures.

$\ln IS(p_{gen},p_{dis})\in [0,\ln N]$ , where $N$ is the total number of possible labels.

$\ln IS(p_{gen},p_{dis})=0$ iff for almost all $x\sim p_{gen}$ $p_{dis}(\cdot |x)=\int p_{dis}(\cdot |x)p_{gen}(x)dx$ That means $p_{gen}$ is completely "indistinct". That is, for any image $x$ sampled from $p_{gen}$ , discriminator returns exactly the same label predictions $p_{dis}(\cdot |x)$ .

The highest inception score $N$ is achieved if and only if the two conditions are both true:

For almost all $x\sim p_{gen}$ , the distribution $p_{dis}(y|x)$ is concentrated on one label. That is, $H_{y}[p_{dis}(y|x)]=0$ . That is, every image sampled from $p_{gen}$ is exactly classified by the discriminator.
For every label $y$ , the proportion of generated images labelled as $y$ is exactly $\mathbb {E} _{x\sim p_{gen}}[p_{dis}(y|x)]={\frac {1}{N}}$ . That is, the generated images are equally distributed over all labels.

References

^ Salimans, Tim; Goodfellow, Ian; Zaremba, Wojciech; Cheung, Vicki; Radford, Alec; Chen, Xi; Chen, Xi (2016). "Improved Techniques for Training GANs". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc.
^ Borji, Ali (2022). "Pros and cons of GAN evaluation measures: New developments". Computer Vision and Image Understanding. 215: 103329. doi:10.1016/j.cviu.2021.103329.

[1] Salimans, Tim; Goodfellow, Ian; Zaremba, Wojciech; Cheung, Vicki; Radford, Alec; Chen, Xi; Chen, Xi (2016). "Improved Techniques for Training GANs". Advances in Neural Information Processing Systems. 29. Curran Associates, Inc.

[2] Borji, Ali (2022). "Pros and cons of GAN evaluation measures: New developments". Computer Vision and Image Understanding. 215: 103329. doi:10.1016/j.cviu.2021.103329.

[1]

[2]

v t e Machine learning evaluation metrics
Regression	MSE MAE sMAPE MAPE MASE MSPE RMS RMSE/RMSD R² MDA MAD
Classification	F-score P4 Accuracy Precision Recall Kappa MCC AUC ROC Sensitivity and specificity Logarithmic Loss
Clustering	Silhouette Calinski-Harabasz index Davies-Bouldin Dunn index Hopkins statistic Jaccard index Rand index Similarity measure SMC SimHash
Ranking	MRR NDCG AP
Computer Vision	PSNR SSIM IoU
NLP	Perplexity BLEU
Deep Learning Related Metrics	Inception score FID
Recommender system	Coverage Intra-list Similarity
Similarity	Cosine similarity Euclidean distance Pearson correlation coefficient
Confusion matrix

Definition

Interpretation

See also

References