Jump to content

User:Citing/sandbox3

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Citing (talk | contribs) at 22:20, 14 June 2021 (refresh). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Measures

Population structure is a complex phenomenon and no single measure captures it entirely. Understanding a population's structure requires a combination of methods and measures.[1][2]

Heterozygosity

A population bottleneck can result in a loss of heterozygosity. In this hypothetical population, an allele has become fixed after the population repeatedly dropped from 10 to 3.

One of the results of population structure is a reduction in heterozygosity. When populations split, alleles have a higher chance of reaching fixation within subpopulations, especially if the subpopulations are small or have been isolated for long periods. This reduction in heterozygosity can be thought of as an extension of inbreeding, with individuals in subpopulations being more likely to share a recent common ancestor.[3] The scale is important — an individual with both parents born in the United Kingdom is not inbred relative to that country's population, but is more inbred than two humans selected from the entire world. This motivates the derivation of Wright's F-statistics (also called "fixation indices"), which measure inbreeding through observed versus expected heterozygosity.[4] For example, measures the inbreeding coefficient at a single locus for an individual relative to some subpopulation :[5]

Here, is the fraction of individuals in subpopulation that are heterozygous. Assuming there are two alleles, that occur at respective frequencies , it is expected that under random mating the subpopulation will have a heterozygosity rate of . Then:

Similarly, for the total population , we can define allowing us to compute the expected heterozygosity of subpopulation and the value as:[5]


If F is 0, then the allele frequencies between populations are identical, suggesting no structure. The theoretical maximum value of 1 is attained when an allele reaches total fixation, but most observed maximum values are far lower.[3] FST is one of the most common measures of population structure and there are several different formulations depending on the number of populations and the alleles of interest. Although it is sometimes used as a genetic distance between populations, it does not always satisfy the triangle inequality and thus is not a metric.[6] It also depends on within-population diversity, which makes interpretation and comparison difficult.[2]

Admixture inference

An individual's genotype can be modelled as an admixture between K discrete clusters of populations.[5] Each cluster is defined by the frequencies of its genotypes, and the contribution of a cluster to an individual's genotypes is measured via an estimator. In 2000, Jonathan K. Pritchard introduced the STRUCTURE algorithm to estimate these proportions via Markov chain Monte Carlo.[7] Since then, algorithms (such as ADMIXTURE) have been developed using other estimation techniques.[8][9] Estimated proportions can be visualized using bar plots — each bar represents an individual, and is subdivided to represent the proportion of an individual's genetic ancestry from one of the K populations.[5]

Varying K can illustrate different scales of population structure; using a small K for the entire human population will subdivide people roughly by continent, while using large K will partition populations into finer subgroups.[5] Though clustering methods are popular, they are open to misinterpretation: for non-simulated data, there is never a "true" value of K, but rather an approximation considered useful for a given question.[1] They are sensitive to sampling strategies, sample size, and close relatives in data sets; there may be no discrete populations at all; and there may be hierarchical structure where subpopulations are nested.[1] Clusters may be admixed themselves,[5] and may not have a useful interpretation as source populations.[10]


A study of population structure of humans in Northern Africa and neighboring populations modelled using ADMIXTURE and assuming K=2,4,6,8 populations (Figure B, top to bottom). Varying K changes the scale of clustering. At K=2, 80% of the inferred ancestry for most North Africans is assigned to cluster that is common to Basque, Tuscan, and Qatari Arab individuals (in purple). At K=4, clines of North African ancestry appear (in light blue). At K=6, opposite clines of Near Eastern (Qatari) ancestry appear (in green). At K=8, Tunisian Berbers appear as a cluster (in dark blue).[11]

Dimensionality reduction

A map of the locations of genetic samples of several African populations (left) and principal components 1 and 2 of the data superimposed on the map (right). The principal coordinate plane has been rotated 16.11° to align with the map. It corresponds to the east-west and north-south distributions of the populations.[12]

Genetic data are high dimensional and dimensionality reduction techniques can capture population structure. Principal component analysis (PCA) was first applied in population genetics in 1978 by Cavalli-Sforza and colleagues and resurged with high-throughput sequencing.[5][13]

Initially PCA was used on allele frequencies at known genetic markers for populations, though later it was found that by coding SNPs as integers (for example, as the number of non-reference alleles) and normalizing the values, PCA could be applied at the level of individuals.[9] One formulation considers individuals and bi-allelic SNPs. For each individual , the value at locus is is the number of non-reference alleles (one of ). If the allele frequency at is , then the resulting matrix of normalized genotypes has entries:[5]

PCA transforms data to maximize variance; given enough data, when each individual is visualized as point on a plot, discrete clusters can form.[9] Individuals with admixed ancestries will tend to fall between clusters, and when there is homogenous isolation by distance in the data, the top PC vectors will reflect geographic variation.[9][14] The eigenvectors generated by PCA can be explicitly written in terms of the mean coalescent times for pairs of individuals, making PCA useful for inference about the population histories of groups in a given sample. PCA cannot, however, distinguish between different processes that lead to the same mean coalescent times.[15]

Applications

  • Ancient stuff
  • Medical
  • Population histories
  • Conservation
  • Descriptive
  • In humans, non-human animals, plants, bacteria, etc

Refs

Non-human

  1. ^ a b c Lawson, Daniel J.; van Dorp, Lucy; Falush, Daniel (2018). "A tutorial on how not to over-interpret STRUCTURE and ADMIXTURE bar plots". Nature Communications. 9 (1). doi:10.1038/s41467-018-05257-7. ISSN 2041-1723. PMC 6092366.
  2. ^ a b c Meirmans, Patrick G.; Hedrick, Philip W. (2010). "Assessing population structure:FST and related measures". Molecular Ecology Resources. 11 (1): 5–18. doi:10.1111/j.1755-0998.2010.02927.x. ISSN 1755-098X.
  3. ^ a b c Hartl, Daniel L.; Clark, Andrew G. (1997). Principles of population genetics (3rd ed.). Sunderland, MA: Sinauer Associates. pp. 111–163. ISBN 0-87893-306-9. OCLC 37481398.
  4. ^ Wright, Sewall (1949). "THE GENETICAL STRUCTURE OF POPULATIONS". Annals of Eugenics. 15 (1): 323–354. doi:10.1111/j.1469-1809.1949.tb02451.x. ISSN 2050-1420.
  5. ^ a b c d e f g h i Coop, Graham (2019). Population and Quantitative Genetics. pp. 22–44. Cite error: The named reference "Coop2019" was defined multiple times with different content (see the help page).
  6. ^ Arbisser, Ilana M.; Rosenberg, Noah A. (2020). "FST and the triangle inequality for biallelic markers". Theoretical Population Biology. 133: 117–129. doi:10.1016/j.tpb.2019.05.003. ISSN 0040-5809.
  7. ^ Pritchard, Jonathan K; Stephens, Matthew; Donnelly, Peter (2000). "Inference of Population Structure Using Multilocus Genotype Data". Genetics. 155 (2): 945–959. doi:10.1093/genetics/155.2.945. ISSN 1943-2631.
  8. ^ Alexander, D. H.; Novembre, J.; Lange, K. (2009). "Fast model-based estimation of ancestry in unrelated individuals". Genome Research. 19 (9): 1655–1664. doi:10.1101/gr.094052.109. ISSN 1088-9051. PMC 2752134.
  9. ^ a b c d Novembre, John; Ramachandran, Sohini (2011). "Perspectives on Human Population Structure at the Cusp of the Sequencing Era". Annual Review of Genomics and Human Genetics. 12 (1): 245–274. doi:10.1146/annurev-genom-090810-183123. ISSN 1527-8204.
  10. ^ Novembre, John (2016). "Pritchard, Stephens, and Donnelly on Population Structure". Genetics. 204 (2): 391–393. doi:10.1534/genetics.116.195164. ISSN 1943-2631.
  11. ^ Henn BM, Botigué LR, Gravel S, Wang W, Brisbin A, Byrnes JK, Fadhlaoui-Zid K, Zalloua PA, Moreno-Estrada A, Bertranpetit J, Bustamante CD, Comas D (January 2012). "Genomic ancestry of North Africans supports back-to-Africa migrations". PLoS Genet. 8 (1): e1002397. doi:10.1371/journal.pgen.1002397. PMC 3257290. PMID 22253600.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  12. ^ Wang C, Zöllner S, Rosenberg NA (August 2012). "A quantitative comparison of the similarity between genes and geography in worldwide human populations". PLoS Genet. 8 (8): e1002886. doi:10.1371/journal.pgen.1002886. PMC 3426559. PMID 22927824.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  13. ^ Menozzi, P; Piazza, A; Cavalli-Sforza, L (1978). "Synthetic maps of human gene frequencies in Europeans". Science. 201 (4358): 786–792. doi:10.1126/science.356262. ISSN 0036-8075.
  14. ^ Novembre, John; Johnson, Toby; Bryc, Katarzyna; Kutalik, Zoltán; Boyko, Adam R.; Auton, Adam; Indap, Amit; King, Karen S.; Bergmann, Sven; Nelson, Matthew R.; Stephens, Matthew; Bustamante, Carlos D. (2008). "Genes mirror geography within Europe". Nature. 456 (7218): 98–101. doi:10.1038/nature07331. ISSN 0028-0836.
  15. ^ McVean, Gil (2009). "A Genealogical Interpretation of Principal Components Analysis". PLoS Genetics. 5 (10): e1000686. doi:10.1371/journal.pgen.1000686. ISSN 1553-7404.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  16. ^ Barroso, Gustavo V.; Moutinho, Ana Filipa; Dutheil, Julien Y. (2020), Dutheil, Julien Y. (ed.), "A Population Genomics Lexicon", Statistical Population Genomics, vol. 2090, New York, NY: Springer US, pp. 3–17, doi:10.1007/978-1-0716-0199-0_1, ISBN 978-1-0716-0198-3, retrieved 2021-05-31
  17. ^ Liu, Chi-Chun; Shringarpure, Suyash; Lange, Kenneth; Novembre, John (2020), Dutheil, Julien Y. (ed.), "Exploring Population Structure with Admixture Models and Principal Component Analysis", Statistical Population Genomics, vol. 2090, New York, NY: Springer US, pp. 67–86, doi:10.1007/978-1-0716-0199-0_4, ISBN 978-1-0716-0198-3, retrieved 2021-05-31
  18. ^ Gillespie, John H. (1998). "4". Population genetics : a concise guide. Baltimore, Md: The Johns Hopkins University Press. ISBN 0-8018-5754-6. OCLC 36817311.
  19. ^ Kittayapong, Pattamaporn; Carvajal, Thaddeus M.; Ogishi, Kohei; Yaegeshi, Sakiko; Hernandez, Lara Fides T.; Viacrusis, Katherine M.; Ho, Howell T.; Amalin, Divina M.; Watanabe, Kozo (2020). "Fine-scale population genetic structure of dengue mosquito vector, Aedes aegypti, in Metropolitan Manila, Philippines". PLOS Neglected Tropical Diseases. 14 (5): e0008279. doi:10.1371/journal.pntd.0008279. ISSN 1935-2735.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  20. ^ Tunstall, Tate; Kock, Richard; Vahala, Jiri; Diekhans, Mark; Fiddes, Ian; Armstrong, Joel; Paten, Benedict; Ryder, Oliver A.; Steiner, Cynthia C. (2018). "Evaluating recovery potential of the northern white rhinoceros from cryopreserved somatic cells". Genome Research. 28 (6): 780–788. doi:10.1101/gr.227603.117. ISSN 1088-9051.