User:Citing/sandbox3: Difference between revisions

Content deleted Content added

Inline

Revision as of 22:17, 13 June 2021

Description

The basic cause of population structure in sexually reproducing species is non-random mating between groups: if all individuals within a population mate randomly, then the allele frequencies should be similar between groups. Population structure commonly arises from physical separation by distance or barriers, like mountains and rivers, followed by genetic drift. Other causes include gene flow from migrations, population bottlenecks and expansions, founder effects, evolutionary pressure, random chance, and (in humans) cultural factors. Even in lieu of these factors, individuals tend to stay close to where they were born, which means that alleles will not be distributed at random with respect to the full range of the species.^[1]^[2] There are many methods to measure and capture population structure.

Models

Heterozygosity measures

One of the results of population structure is a reduction in heterozygosity compared to a population mating totally at random. When populations split, alleles have a higher chance of reaching fixation within subpopulations, especially if the subpopulations are small or have been isolated for long periods of time. This reduction in heterozygosity can be thought of as an extension of measures of inbreeding or overlapping sets of pedigrees, with individuals in subpopulations being more likely to share a recent common ancestor.^[3] The scale of allele frequencies is important — an individual with both parents born in the United Kingdom is not inbred relative to that country's population, but is more inbred than two humans selected from the entire world. This motivates the derivation of Wright's F-statistics (also called "fixation indices"), which measure inbreeding through observed versus expected heterozygosity. For example, $F_{IS}$ measures the inbreeding coefficient at a single locus for an individual $I$ relative to some subpopulation $S$ :^[4]

$F_{IS}=1-{\frac {H_{I}}{H_{S}}}$

Here, $H_{I}$ is the fraction of individuals in subpopulation $S$ that are heterozygous. Assuming there are two alleles, $A_{1},A_{2}$ that occur at respective frequencies $p_{S},q_{S}$ , it is expected that under random mating the subpopulation $S$ will have a heterozygosity rate of $H_{S}=2p_{S}(1-p_{S})=2p_{S}q_{S}$ . Then:

$F_{IS}=1-{\frac {H_{I}}{2p_{S}q_{S}}}$

Similarly, for the total population $T$ , we can define $H_{T}=2p_{T}q_{T}$ allowing us to compare the expected heterozygosity of subpopulation $S$ and the value $F_{ST}$ as:^[4]

$F_{ST}=1-{\frac {H_{S}}{H_{T}}}=1-{\frac {2p_{S}q_{S}}{2p_{T}q_{T}}}$

If F is 0, then the allele frequencies between populations are the same, suggesting no structure. The theoretical maximum value is 1, but most observed maximum values are far lower.^[3] F_ST is one of the most common measures of population structure and there are several different formulations depending on the number of populations and the alleles of interest. Although it is sometimes used as a genetic distance between populations, it does not always satisfy the triangle inequality and thus is not a metric.^[5]

Admixture inference

One family of methods models the genotype of an individual by assuming that it is made up of contributions from K discrete clusters of populations.^[4] Each cluster is defined by the frequencies of its own genotypes, and the contribution of a cluster to an individual's genotypes is measured via an estimator. In 2000, Jonathan K. Pritchard introduced the STRUCTURE algorithm to estimate these proportions via Markov chain Monte Carlo. Since then, algorithms (such as ADMIXTURE) have been developed using other methods.^[6] Estimated contributions can be visualized using bar plots — each bar represents an individual, and is subdivided to represent the proportion of an individual's genetic ancestry from one of the K populations.^[4]

Varying K can illustrate different scales of population structure; using a small K for the entire human population will subdivide people roughly by continent, while using large K will partition populations into finer subgroups.^[4] Though the methods are popular, they are open to misinterpretation: for non-simulated data, there is never a "true" value of K, but rather an approximation considered useful for the study.^[7] They are sensitive to sample size and close relatives in data sets; there may be no discrete populations at all; there may be hierarchical structure where subpopulations are nested.^[7] Clusters may be admixed themselves — analyses have presented Europeans as one cluster, though ancient DNA studies demonstrate that they descended from multiple distinct populations.^[4]

Population structure of humans in Northern Africa and neighboring populations modelled using ADMIXTURE and assuming K=2,4,6,8 populations (Figure B, top to bottom). As the value of K is increased, the number of inferred ancestral groups for some individuals rises.^[8]

Dimensionality reduction

Genetic data are high dimensional and dimensionality reduction techniques can capture population structure. Principal component analysis (PCA) was first applied in population genetics in 1978 by Cavalli-Sforza and colleagues, and its use resurged when new technology made high-throughput sequencing possible.^[4]^[10] Initially PCA was used on allele frequencies at known genetic markers for populations, though later it was found that by coding SNPs as integers (for example, as the number of non-reference alleles) and normalizing the values, PCA could be applied at the level of individuals. One formulation considers $N$ individuals and $S$ bi-allelic SNPs. For each individual $i$ , the value at locus $l$ is $g_{i,l}$ is the number of non-reference alleles (one of $0,1,2$ ). If the allele frequency at $l$ is $p_{l}$ , then the resulting $N\times S$ matrix of normalized genotypes has entries:^[4]

${\frac {g_{i,l}-2p_{l}}{\sqrt {2p_{l}(1-p_{l})}}}$

PCA transforms data to maximize variance; given enough data, when each individual is visualized as point on a plot, discrete clusters can form.^[6] Individuals with admixed ancestries will tend to fall between clusters, and when there is homogenous isolation by distance in the data, the top PC vectors will reflect geographic variation.^[6]^[11]

Add note about geography being a big source of the endpoints of PC plots

HWE

Under Hardy-Weinberg equilibrium, a population that mates completely at random with respect to genotype may have the frequencies of its genotypes derived. If there two alleles, $A_{1}$ and $A_{2}$ , that occur at frequencies $p$ and $q$ where $p+q=1$ , then the frequencies of the genotypes will be:

Genotype	$f_{11}=A_{1}A_{1}$	$f_{22}=A_{2}A_{2}$	$f_{12}=A_{1}A_{2}$
Frequency	$p^{2}$	$q^{2}$	$2pq$

Refs

^[4]

^[12]

^[13]

^[14]

^[15]

^[3]

^ Cardon LR, Palmer LJ (February 2003). "Population stratification and spurious allelic association". Lancet. 361 (9357): 598–604. doi:10.1016/S0140-6736(03)12520-2. PMID 12598158. S2CID 14255234.
^ McVean G (2001). "Population Structure" (PDF). Archived from the original (PDF) on 2018-11-23. Retrieved 2020-11-14.
^ ^a ^b ^c Hartl, Daniel L.; Clark, Andrew G. (1997). Principles of population genetics (3rd ed.). Sunderland, MA: Sinauer Associates. pp. 111–163. ISBN 0-87893-306-9. OCLC 37481398.
^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ Coop, Graham (2019). Population and Quantitative Genetics. pp. 22–44. Cite error: The named reference "Coop2019" was defined multiple times with different content (see the help page).
^ Arbisser, Ilana M.; Rosenberg, Noah A. (2020). "FST and the triangle inequality for biallelic markers". Theoretical Population Biology. 133: 117–129. doi:10.1016/j.tpb.2019.05.003. ISSN 0040-5809.
^ ^a ^b ^c Novembre, John; Ramachandran, Sohini (2011). "Perspectives on Human Population Structure at the Cusp of the Sequencing Era". Annual Review of Genomics and Human Genetics. 12 (1): 245–274. doi:10.1146/annurev-genom-090810-183123. ISSN 1527-8204.
^ ^a ^b Lawson, Daniel J.; van Dorp, Lucy; Falush, Daniel (2018). "A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots". Nature Communications. 9 (1). doi:10.1038/s41467-018-05257-7. ISSN 2041-1723.
^ Schierup, Mikkel H.; Henn, Brenna M.; Botigué, Laura R.; Gravel, Simon; Wang, Wei; Brisbin, Abra; Byrnes, Jake K.; Fadhlaoui-Zid, Karima; Zalloua, Pierre A.; Moreno-Estrada, Andres; Bertranpetit, Jaume; Bustamante, Carlos D.; Comas, David (2012). "Genomic Ancestry of North Africans Supports Back-to-Africa Migrations". PLoS Genetics. 8 (1): e1002397. doi:10.1371/journal.pgen.1002397. ISSN 1553-7404.{{cite journal}}: CS1 maint: unflagged free DOI (link)
^ Williams, Scott M.; Wang, Chaolong; Zöllner, Sebastian; Rosenberg, Noah A. (2012). "A Quantitative Comparison of the Similarity between Genes and Geography in Worldwide Human Populations". PLoS Genetics. 8 (8): e1002886. doi:10.1371/journal.pgen.1002886. ISSN 1553-7404.{{cite journal}}: CS1 maint: unflagged free DOI (link)
^ Menozzi, P; Piazza, A; Cavalli-Sforza, L (1978). "Synthetic maps of human gene frequencies in Europeans". Science. 201 (4358): 786–792. doi:10.1126/science.356262. ISSN 0036-8075.
^ Novembre, John; Johnson, Toby; Bryc, Katarzyna; Kutalik, Zoltán; Boyko, Adam R.; Auton, Adam; Indap, Amit; King, Karen S.; Bergmann, Sven; Nelson, Matthew R.; Stephens, Matthew; Bustamante, Carlos D. (2008). "Genes mirror geography within Europe". Nature. 456 (7218): 98–101. doi:10.1038/nature07331. ISSN 0028-0836.
^ Meirmans, Patrick G.; Hedrick, Philip W. (2010). "Assessing population structure:FST and related measures". Molecular Ecology Resources. 11 (1): 5–18. doi:10.1111/j.1755-0998.2010.02927.x. ISSN 1755-098X.
^ Barroso, Gustavo V.; Moutinho, Ana Filipa; Dutheil, Julien Y. (2020), Dutheil, Julien Y. (ed.), "A Population Genomics Lexicon", Statistical Population Genomics, vol. 2090, New York, NY: Springer US, pp. 3–17, doi:10.1007/978-1-0716-0199-0_1, ISBN 978-1-0716-0198-3, retrieved 2021-05-31
^ Liu, Chi-Chun; Shringarpure, Suyash; Lange, Kenneth; Novembre, John (2020), Dutheil, Julien Y. (ed.), "Exploring Population Structure with Admixture Models and Principal Component Analysis", Statistical Population Genomics, vol. 2090, New York, NY: Springer US, pp. 67–86, doi:10.1007/978-1-0716-0199-0_4, ISBN 978-1-0716-0198-3, retrieved 2021-05-31
^ Gillespie, John H. (1998). "4". Population genetics : a concise guide. Baltimore, Md: The Johns Hopkins University Press. ISBN 0-8018-5754-6. OCLC 36817311.

[1] Cardon LR, Palmer LJ (February 2003). "Population stratification and spurious allelic association". Lancet. 361 (9357): 598–604. doi:10.1016/S0140-6736(03)12520-2. PMID 12598158. S2CID 14255234.

[2] McVean G (2001). "Population Structure" (PDF). Archived from the original (PDF) on 2018-11-23. Retrieved 2020-11-14.

[HartlClark-3] Hartl, Daniel L.; Clark, Andrew G. (1997). Principles of population genetics (3rd ed.). Sunderland, MA: Sinauer Associates. pp. 111–163. ISBN 0-87893-306-9. OCLC 37481398.

[Coop2019-4] ^ ^a ^b ^c ^d ^e ^f ^g ^h ⁱ Coop, Graham (2019). Population and Quantitative Genetics. pp. 22–44. Cite error: The named reference "Coop2019" was defined multiple times with different content (see the help page).

[ArbisserRosenberg2020-5] Arbisser, Ilana M.; Rosenberg, Noah A. (2020). "FST and the triangle inequality for biallelic markers". Theoretical Population Biology. 133: 117–129. doi:10.1016/j.tpb.2019.05.003. ISSN 0040-5809.

[NovembreRamachandran2011-6] Novembre, John; Ramachandran, Sohini (2011). "Perspectives on Human Population Structure at the Cusp of the Sequencing Era". Annual Review of Genomics and Human Genetics. 12 (1): 245–274. doi:10.1146/annurev-genom-090810-183123. ISSN 1527-8204.

[Lawsonvan_Dorp2018-7] Lawson, Daniel J.; van Dorp, Lucy; Falush, Daniel (2018). "A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots". Nature Communications. 9 (1). doi:10.1038/s41467-018-05257-7. ISSN 2041-1723.

[SchierupHenn2012-8] Schierup, Mikkel H.; Henn, Brenna M.; Botigué, Laura R.; Gravel, Simon; Wang, Wei; Brisbin, Abra; Byrnes, Jake K.; Fadhlaoui-Zid, Karima; Zalloua, Pierre A.; Moreno-Estrada, Andres; Bertranpetit, Jaume; Bustamante, Carlos D.; Comas, David (2012). "Genomic Ancestry of North Africans Supports Back-to-Africa Migrations". PLoS Genetics. 8 (1): e1002397. doi:10.1371/journal.pgen.1002397. ISSN 1553-7404.{{cite journal}}: CS1 maint: unflagged free DOI (link)

[WilliamsWang2012-9] Williams, Scott M.; Wang, Chaolong; Zöllner, Sebastian; Rosenberg, Noah A. (2012). "A Quantitative Comparison of the Similarity between Genes and Geography in Worldwide Human Populations". PLoS Genetics. 8 (8): e1002886. doi:10.1371/journal.pgen.1002886. ISSN 1553-7404.{{cite journal}}: CS1 maint: unflagged free DOI (link)

[MenozziPiazza1978-10] Menozzi, P; Piazza, A; Cavalli-Sforza, L (1978). "Synthetic maps of human gene frequencies in Europeans". Science. 201 (4358): 786–792. doi:10.1126/science.356262. ISSN 0036-8075.

[NovembreJohnson2008-11] Novembre, John; Johnson, Toby; Bryc, Katarzyna; Kutalik, Zoltán; Boyko, Adam R.; Auton, Adam; Indap, Amit; King, Karen S.; Bergmann, Sven; Nelson, Matthew R.; Stephens, Matthew; Bustamante, Carlos D. (2008). "Genes mirror geography within Europe". Nature. 456 (7218): 98–101. doi:10.1038/nature07331. ISSN 0028-0836.

[MeirmansHedrick2010-12] Meirmans, Patrick G.; Hedrick, Philip W. (2010). "Assessing population structure:FST and related measures". Molecular Ecology Resources. 11 (1): 5–18. doi:10.1111/j.1755-0998.2010.02927.x. ISSN 1755-098X.

[Barroso2020-13] Barroso, Gustavo V.; Moutinho, Ana Filipa; Dutheil, Julien Y. (2020), Dutheil, Julien Y. (ed.), "A Population Genomics Lexicon", Statistical Population Genomics, vol. 2090, New York, NY: Springer US, pp. 3–17, doi:10.1007/978-1-0716-0199-0_1, ISBN 978-1-0716-0198-3, retrieved 2021-05-31

[Liu2020-14] Liu, Chi-Chun; Shringarpure, Suyash; Lange, Kenneth; Novembre, John (2020), Dutheil, Julien Y. (ed.), "Exploring Population Structure with Admixture Models and Principal Component Analysis", Statistical Population Genomics, vol. 2090, New York, NY: Springer US, pp. 67–86, doi:10.1007/978-1-0716-0199-0_4, ISBN 978-1-0716-0198-3, retrieved 2021-05-31

[gillespie1998-15] Gillespie, John H. (1998). "4". Population genetics : a concise guide. Baltimore, Md: The Johns Hopkins University Press. ISBN 0-8018-5754-6. OCLC 36817311.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

@@ Line 37: / Line 37: @@
 PCA transforms data to maximize variance; given enough data, when each individual is visualized as point on a plot, discrete clusters can form.<ref name="NovembreRamachandran2011"/> Individuals with admixed ancestries will tend to fall between clusters, and when there is homogenous [[isolation by distance]] in the data, the top PC vectors will reflect geographic variation.<ref name="NovembreRamachandran2011"/><ref name="NovembreJohnson2008">{{cite journal|last1=Novembre|first1=John|last2=Johnson|first2=Toby|last3=Bryc|first3=Katarzyna|last4=Kutalik|first4=Zoltán|last5=Boyko|first5=Adam R.|last6=Auton|first6=Adam|last7=Indap|first7=Amit|last8=King|first8=Karen S.|last9=Bergmann|first9=Sven|last10=Nelson|first10=Matthew R.|last11=Stephens|first11=Matthew|last12=Bustamante|first12=Carlos D.|title=Genes mirror geography within Europe|journal=Nature|volume=456|issue=7218|year=2008|pages=98–101|issn=0028-0836|doi=10.1038/nature07331}}</ref>
+* Add note about geography being a big source of the endpoints of PC plots
 === HWE ===