# Population genetics of pan-genome

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am beginning to work in the field of human gut microbiome, and wondering how (and if) the concepts of population genetics could be applied there.

• Considering the competition between the species seems straightforward, although ecological models are probably more appropriate in this case.
• Considering the pan-genome of single species however seems next to impossible, as it cannot be divided into independent alleles. One could probably define a number if allels as the number of copies of a particular gene in the community. However, as it confers fitness on the whole community, it is not clear how to define selection. The notion of genetic drift also seems vague.

I will appreciate clarifications and references to relevant literature.

Update
To clarify the question:

• The question is rather theoretical, motivated more by my previous work than by what I am about to do in the field of metagenomics. The latter will be analysis of shotgun sequencing data, where the focus is on identifying new species and comparing the species composition in many patients. I doubt that we will have any data permitting to study the intra-species evolution within a single patient. However, understanding how bacteria evolve to be a particular strain in a particular patient could be a plus: if members of a bacterial community with different genomes live as a group, we can't meaningfully say that some of them are more fit or less fit.
• My past experience with population genetics is in the field of HIV, which for most purposes can be treated as a diploid hermaphrodite (each virion carries two copies of the genome, and the genomes are shuffled when a cell is co-infected with several virions). Thus, many basic population genetics concepts are easily applied, although the speed of evolution and the need for constant adaptation add other difficulties (motivating such approaches as fitness edge).
• Finally, as was pointed in the comments, the problem is not specific to bacteria - altruism does take place in many communities, and should raise the same kinds of issues as I mentioned previously. I admit that I am not familiar with the necessary theoretical frameworks, and this can be considered a part of the question. I would guess however that most such frameworks do not account for direct gene exchange.

Update 2
A relatively recent article by Eduardo Rocha (2018) discusses the shortcomings of the neutral model (as the principal null model for modern genomics) when applied to procaryotes. This has clarified a lot for me, and I give below a short list of their points:

1. Mutations in bacteria cannot be treated as neutral.
2. Polymorphism concept is poorly defined for genomes of varying length, which requires new mathematical framework.
3. Promiscuity of bacteria, freely recombing and exchanging genes via the horisontal gene transfer makes it difficult to define species and apply many concepts of the traditional population genetics.
4. Difficulties in separating demographics and genetics (essentially, one needs a combined genetico-ecological approach).

Update 3
Inclusive fitness theory / kin theory (also here) probably fits into this context.

It is possible to perform gene-level population genetics analyses on microbiome-derived bacterial species.1,2 However, the biology of bacteria and the nature of metagenomic sequencing data make such analyses difficult and intractable to many of the well-worked tools of population genetics. Here are some of the things you should keep in mind as you build a conceptual framework for understanding the field:

## Bacterial genomes are flexible

The term "allele" connotes gene variants, and the underlying assumption of analyses involving alleles is that each (diploid) individual in a population has one or two alleles of a given gene. When considering a bacterial population, specifically all the cells that constitute a species, some genes will not be present in all cells. For this reason, "allele" is not frequently used in discussions of complex bacterial communities. For well-studied species like E. coli and K. pneumoniae with many complete strain genomes, plotting the number of genes found across genomes gives a characteristic U-shaped distribution.3 For a given species, this suggests that many genes are present in most strains, and many genes are present in only one or two strains. The simple thing to do is to constrain your analysis to the gene set that you observe to appear in all individuals in a population (the core genome, to use a pangenomics term), but this likely excludes genes that are important for niche-specific fitness like virulence factors, antibiotic resistance genes, carbohydrate degradation operons, etc. Recently, this problem has been addressed computationally by representing pangenomes as weighted graphs with genomic features at nodes 4 and by building composite reference genomes from strain genomes data encoded as reference graphs.3 #

## Metagenomes are messy (but getting cleaner!)

Horizontal gene transfer is likely a frequent occurrence in microbiomes,5 confounding analyses that rely on strict inheritance. Since microbiome sequencing is often short-read, the problem of horizontal gene transfer is exacerbated by the difficulty in reliably assembling transferred genes with specific metagenome-assembled genomes. As noted by Maximillian Press, there are several ways to address this issue. At the bench, proximity ligation sequencing can supplement traditional metagenomic assembly-by-alignment using the ground truth of physical closeness.5,6 Likewise, single-cell isolation combined with multiplexed whole-genome amplification greatly reduces complexity of assembly.7 At the computer, sequence composition (i.e. kmer content) and coverage profiles can be leveraged to associate metagenomic contigs in bins,8,9 which are like pseudo-genomes.§ If you're lucky enough to have long-read sequencing data, many of the problems associated with assembly of short reads are no longer an issue,10 with the added bonus that SMRT and nanopore sequencing produce methylation data -- unique methylation profiles can be used to infer which contigs likely belong to the same genome or to associate extragenomic contigs (e.g. plasmids) with their hosts.11,12 All of this to say that getting complete genomes from microbiome sequencing is non-trivial, but there are many techniques and tools at your disposal.

## Microbiome data is relative

Much of population genetics deals in counted data -- $$n$$ individuals with allele $$a$$, $$m$$ individuals with allele $$b$$, and so on. Concerning microbiomes, the absolute abundances of individual microbes are not recoverable from sequencing alone. Microbiome data is therefore said to be compositional. Like RNA-seq, gene or organism abundances derived from metagenomic sequencing are relative, and, importantly, many of the assumptions underlying the statistical analyses applied to absolute counts do not hold for compositional data.13,14 Thankfully, statisical tools for population-level analysis of microbiomes have been developed,15,16 though their widespread adoption has been slow.

## Microbiome members are often interdependent

As hinted at in the question update, bacteria display a type of community altruism, where an individual cell with a specific gene can influence the fitness of neighboring cells that lack the gene. For an example, see my answer to Will all bacteria become resistant against all antibiotics in the long term? concerning secreted β-lactamases. Therefore, spatial association is an added factor when considering population dynamics of a genetically heterogeneous bacterial species. Some methods that address cell-cell spatial proximity in microbiomes include sequencing of cryofractured fragments 17 and probe-based spectral imaging.18 Even if microbes are not spatially associated, different microbes may play complementary roles in the iterative metabolism of large carbohydrates into small metabolites.19,20

Surely, this discussion is incomplete, though I hope my answer has given you the footing you need to continue your own exploration to find the appropriate resources for your research. For a more in-depth discussion of the points I've addressed here, see What Is Metagenomics Teaching Us, and What Is Missed?,21 particularly the sections titled Strain-Level Analyses and Ecoevolutionary Modeling.

† I recommend keeping up with the publications of Ran Blekhman, Peer Bork, and Katie Pollard.

‡ Core genome membership seldom requires that 100% of individuals represented by that genome contain the gene in question, and setting such a strict cutoff would likely exclude truly essential genes that were not found associated with a given strain due to gaps. In practice, core genome cutoff thresholds are on the order of 90%.

§ Contigs and Bins and MAGs, oh my! As with any field, metagenomics has field-specific jargon. Coursera seems to have a good lecture that includes these concepts and how they relate to eachother.

References

1. Garud NR, Pollard KS. Population Genetics in the Human Microbiome. Trends Genet. 2020 Jan;36(1):53-67. doi: 10.1016/j.tig.2019.10.010.
2. Priya S, Blekhman R. Population dynamics of the human gut microbiome: change is the only constant. Genome Biol. 2019 Jul 31;20(1):150.
3. Colquhoun RM et al. Nucleotide-resolution bacterial pan-genomics with reference graphs. bioRxiv 2020.11.12.380378.
4. Gautreau G et al. PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph. PLoS Comput Biol. 2020;16(3):e1007732.
5. Kent AG et al. Widespread transfer of mobile antibiotic resistance genes within individual gut microbiomes revealed through bacterial Hi-C. Nat Commun. 2020 Sep 1;11(1):4379.
6. Stalder T et al. Linking the resistome and plasmidome to the microbiome. ISME J. 2019 Oct;13(10):2437-2446.
7. Chijiiwa R et al. Single-cell genomics of uncultured bacteria reveals dietary fiber responders in the mouse gut microbiota. Microbiome 8, 5 (2020).
8. Alneberg J et al. Binning metagenomic contigs by coverage and composition. Nat Methods. 2014 Nov;11(11):1144-6.
9. Kang DD et al. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ. 2015 Aug 27;3:e1165.
10. Kolmogorov M et al. metaFlye: scalable long-read metagenome assembly using repeat graphs. Nat Methods 17, 1103-1110 (2020).
11. Beaulaurier J et al. Metagenomic binning and association of plasmids with bacterial host genomes using DNA methylation. Nat Biotechnol. 2018;36(1):61-69.
12. Tourancheau A et al. Discovering and exploiting multiple types of DNA methylation from individual bacteria and microbiome using nanopore sequencing. bioRxiv. 2020.02.18.954636.
13. Tsilimigras MC and Fodor AA. Compositional data analysis of the microbiome: fundamentals, tools, and challenges. Ann Epidemiol. 2016 May;26(5):330-5.
14. Gloor GB et al. Microbiome Datasets Are Compositional: And This Is Not Optional. Front Microbiol. 2017 Nov 15;8:2224.
15. Shi P et al. Regression analysis for microbiome compositional data. Ann. Appl. Stat. 10 (2016), no. 2, 1019--1040.
16. Kurtz ZD et al. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput Biol. 2015 May 7;11(5):e1004226.
17. Sheth RU et al. Spatial metagenomic characterization of microbial biogeography in the gut. Nat Biotechnol. 2019 Aug;37(8):877-883.
18. Shi H et al. Highly multiplexed spatial mapping of microbial communities. Nature. 2020 Dec.
19. Kundu P et al. Species-wide Metabolic Interaction Network for Understanding Natural Lignocellulose Digestion in Termite Gut Microbiota. Sci Rep 9, 16329 (2019).
20. Selber-Hnatiw S et al. Metabolic networks of the human gut microbiota. Microbiology. 2020 Feb;166(2):96-119.
21. New F, Brito IL. What Is Metagenomics Teaching Us, and What Is Missed? Annual Review of Microbiology. 2020 74:1, 117-135.

I will answer this as best as I am able, the updates are helpful for providing context. I'll also note that there are several questions packed in here, which makes it a bit more laborious to answer. I am going to try to answer the following questions:

1. How can I infer selection on microbiota within single hosts (patients)?
2. How can I resolve genomes of related organisms from a complex mixture?
3. How can I do population genetics in the presence of horizontal gene transfer (HGT)?

Question 1

I think that there are some assumptions here that rest on methodological tractability that are, as far as I can tell, not accurate. For example: "… if members of a bacterial community with different genomes live as a group, we can't meaningfully say that some of them are more fit or less fit."

This seems to be the crux of the issue. I would argue that if you are willing to take time-courses from the same patient, it is not at all difficult to measure changes in abundance. Causal inference of e.g. selection is a somewhat harder problem, but one that has been treated exhaustively elsewhere.

If you are willing to postulate that you can deconvolute lineages from a single sample, and measure multiple samples from the same patient, then you can try to measure selection over time. Some steps of this are hard (e.g. if lineages are closely related), but those are merely technical problems that can be solved.

There are also methods that try to infer bacterial growth rates from a single sample, but they're a bit more complicated.

Question 2

Resolution of complex mixed populations into genomes is a well-studied problem. THere are a number of approaches to do this (all of which require relatively high investment and sequence assembly probably):

• Long-read single-molecule sequencing: simply use Nanopore or high-coverage PacBio reads to directly ascertain the genome. Expensive (especially for complex samples) but perfectly possible. If you use Nanopore you may not even need to assemble!
• Assembly binning: Statistical approaches such as MetaBAT2 attempt to deconvolute metagenomic assemblies based on sequence composition and contig coverage. Limited by the resolution of your assembly- if a contig is collapsed in the assembly, you can't put it in two places (as in the case of very closely related genomes).
• Proximity-ligation binning: similar to assembly binning, but using complementary single-cell information to place contigs into the same bin. Some additional refs. More accurate and higher-resolution, but involves an additional data type/expense. Full disclosure: I work for a company that provides this as a service and sells kits to do it.

All of these methods, in principle, could be applied to a subset of samples (e.g. beginning and end) and reference genomes could be imputed to the other samples, where you can just measure them by low-coverage shotgun.

Question 3

This is a big one, I would suggest simply googling it to learn more than I could hope to tell you about population genetic theory in the presence of HGT.

The shorter version is that detectable HGT is not super common at the time scales that you are likely to be dealing with. Most HGT ends quickly with the transferred DNA getting digested for energy or selected out as genomic junk.

Nonetheless there is a clear case that selection of lineages can be driven by HGT events, especially under very strong challenges like antibiotics.

I would argue that if you can measure abundance of a lineage and assign HGTs to genomes in specific samples (readily possible with proximity ligation at least), there is no barrier to doing population genetics and measuring selection. It's just another kind of mutation that happens to have a high rate and to occur in parallel. People measure the selective advantage of e.g. antibiotic resistance all the time.

You may be using heuristics and ad hoc methods for some steps, but genomics is just heuristics and ad hoc methods anyways.

## Study Notes on Population Genetics

Study of the frequency of genes and genotypes in a mendelian population is known as population genetics. In other words, it is a branch of genetics which deals with the frequency of genes and genotypes in mendelian populations. Before dealing with population genetics, it is essential to define mendelian population, gene frequency and genotype frequency.

There are two important features of mendelian population, viz:

(2) Equal survival of all genotypes.

In case of random mating, each individual of one sex has equal chance of mating with every individual of opposite sex. In other words, there is no restriction on mating of one individual with other individuals. Such inter-mating populations are also known as panmictic populations.

Random mating populations maintain high level of variability and adaptability. Random mating individuals belong to the same species and same gene pool. The gene pool refers to the sum total of genes in a mendelian population.

Gene frequency refers to the proportion of different alleles of a gene in a random mating population. It is also known as genetic frequency. In other words, the proportion of each type of allele at a particular locus in a random mating population is referred to as gene frequency. The composition of a population is described in terms of gene frequencies.

Estimation of gene frequencies in a population consists of three important steps as given below:

First a random sample of individual is drawn from the random mating population under study. The size of sample should be adequate to represent all the individuals of a population.

After sampling, the individuals are grouped into different classes for a gene and their number is counted.

3. Calculation of gene frequency:

Suppose a random sample of 100 individuals was drawn from a random mating population of four ‘O’ clock plant (Mirabilis jalapa). Out of 100 plants, 30 were with red flower, 40 with pink flower and 30 with white flower.

Now, the allele frequency will be worked out as follows:

(a) In four o’clock plant, a cross between red and white flowered strains produces pink flower in Fi and red, pink and white flowered plants in 1 : 2 : 1 ratio in F2 generation. Thus, plants with red colour are homozygous for dominant allele (RR) and individuals with white flower colour are homozygous for recessive allele (rr).

(b) Each heterozygous individual with pink colour will have dominant (R) and recessive (r) alleles in equal number.

1. Number of R alleles in the Sample (30 individuals)

= 2 (No. of red individuals) + No. of pink individuals

2. Proportion of R alleles in the sample

= Number of RR Alleles/2 (Total plants in a sample)

Similarly, the number of r alleles

Therefore, the frequency of RR and rr alleles is 0.50 each.

It refers to the ratio of different genotypes in a mendelian population. Genotypic frequency is also known as zygotic frequency. The estimation of genotypic frequency for a gene in a population also consists of three important steps mentioned above.

Thus, the genotypic frequency of three types of individuals from the above sample will be calculated as ratio of each individual, class or genotypes to the total individuals in a sample. Thus,

1. Frequency of Red (RR) individuals = 30/100 = 0.30

2. Frequency of Pink (Rr) individuals = 40/100 = 0.40

3. Frequency of White (rr) individuals = 30/100 = 0.30

Hardy-Weinberg Law:

Foundation of population genetics was laid by G.H. Hardy, an English mathematician and W. Weinberg, a German physician in 1908. They independently discovered a principle concerned with the frequency of genes (alleles) in a population. Their principle is commonly known as Hardy-Weinberg Law.

The Hardy-Weinberg Law states that:

1. In a random mating population, the frequency of genes and genotypes remains constant generation after generation, if there is no selection, mutation, migration and random genetic drift.

2. They also developed a mathematical relationship to describe the equilibrium between alleles. According to this relationship, the frequencies of three genotypes for a single locus with two alleles (A and a) are in the ratio of P 2 AA : 2PqAa : q aa. where P and q are the frequencies of allele A and ‘a’ respectively. P + q are always equal to 1 or P = q = 0.50.

P + q = 1 or P= 1 -q and q = 1 – P

Effect of Random Mating:

Random mating results in maintaining the equilibrium of gene frequency in a population. For example, if the frequency of allele A is P and that of ‘a’ is q. If we make a cross between AA and aa, it will produces Aa. If individuals with Aa genotype are allowed to mate randomly, the gene frequency of three genotypes will be in the ratio of P 2 AA + 2PqAa +q 2 aa (Fig. 30.1).

When gene frequencies are in equilibrium, it indicates absence of mutation, selection, migration and genetic drift in a population.

Factors Affecting Gene Frequency:

Hardy-Weinberg principle is based on three main assumptions, viz:

(2) Equal survival of all genotypes, and

(3) Absence of evolutionary forces like selection, mutation, migration and random genetic drift. Non fulfillment of these assumptions will lead to alteration in gene and genotype frequencies in a population.

However, the last assumption is seldom fulfilled. Mutation, migration and genetic drift change gene frequencies in a population. These factors are also known as forces of evolution because they play a key role in natural evolution.

These are briefly discussed below:

Selection refers to a process which favours the survival and reproduction of some individuals in a population. The process of evolution in nature in which the fittest individuals survive and restore wiped out is known as natural selection. Natural selection favours those characters which are advantageous for survival.

The selection by human efforts is known as artificial selection. Such selection favours those plant characters which are useful for mankind like productivity. Before discussing the effect of various types of selection, it is necessary to give brief account of fitness and selection coefficient.

The relative reproductive success of different genotypes of a population in the same environment under natural selection is known as fitness or selective value or adaptive value or selective advantage. It is denoted by W. If the value of W is unity (W = 1), there is 100 per cent survival and if this value is 0 (W = 0), the genotype is completely lethal.

Survival depends on two main factors, viz:

(i) The number of seeds produced by each genotype, and

(ii) The proportion of seeds of each genotype which reaches maturity and produces offspring.

The reproductive rate of different genotypes is estimated in relation to the most fit genotype. If the reproductive rate of the most fit genotype is X] and that of other genotypes is X2 and X3 then Fitness W = X1/X1, X2/X1, X3/X1, etc. The value of W varies between 0 and 1.

Selection Coefficient:

Selection coefficient is a measure of the rate of elimination of different genotypes from a population under natural selection in a particular environment. In other words, it is the measure of the rate of reduction in the adaptive value of a genotype in relation to standard or the most favoured genotype. It is also known as selective disadvantage and is represented by S.

The greater the value of selection coefficient, lesser the survival rate and lesser the value of S greater the survival value. The value of S varies between 0 and 1. If S = 1 there is no survival at all, if S = 0 there is 100 per cent survival.

There is a close relationship between fitness (W) and selection coefficient (S) as given below:

W = 1 – S and S = 1 – W. Thus, selection coefficient is estimated with the help of fitness value or for the estimation of selection coefficient first the value of fitness (W) is estimated. Selection coefficient differs from selection differential in three ways (Table 30.1).

Thus, selection coefficient measures the rate of elimination of different genotypes from a population under natural selection, whereas the selection differential is a measure of difference between the mean phenotypic value of selected plants and mean phenotypic value of parental population under human selection.

Selection may operate at any stage of life (gametic or zygotic) cycle of an individual. Sometimes, selection acts at gametic stage which is referred to as gametic selection. Such selection acts mostly in haploid organisms and in some higher organisms. The tendency of higher organisms to exhibit differential survival rate of gametes is termed as segregation distortion or meiotic drive.

Meiotic drive is generally restricted to either male or female sex in a species. The zygotic selection operates generally in higher organisms. When certain genotypes are favoured by selection, the Hardy-Weinberg equilibrium will be disturbed. In such situation, frequency of some alleles in the population will increase while those of others will decrease.

Zygotic selection may act in three ways, viz:

(i) Against dominant phenotypes,

(ii) Against recessive phenotypes, and

(iii) In favour of heterozygotes.

(i) Selection against Dominant Phenotypes:

When selection acts against dominant phenotypes, it will eliminate both AA and Aa individuals from a population and favour only recessive phenotypes (aa). The elimination process will continue till the entire population is converted into homozygous recessive (aa) phenotypes.

Such selection, leads to fixation of recessive genes and elimination of dominant genes in a population. Since the phenotypes of both homozygous dominant (AA) and heterozygous dominant (Aa) are same, the allele A cannot be protected from elimination even in the heterozygous condition. In such situation, the value of S is 1 for AA and Aa genotypes.

(ii) Selection against Recessive Phenotypes:

Such type of selection leads to elimination of homozygous recessive phenotypes (aa) from a population. Under such type of selection, the value of coefficient of selection (S) is 1 for aa phenotypes. Such selection will lead to increase of AA and Aa genotypes in a population. However, Aa genotypes will continuously produce aa phenotypes due to segregation.

(iii) Selection in Favour of Heterozygotes:

Such type of selection leads to elimination of both dominant and recessive homozygotes (AA and aa). The value of S in such situation is 1 for AA and aa genotypes.

The excess of heterozygotes in a population is an indication of selection in favour of heterozygotes or against both the homozygotes the frequency of homozygotes decreases sharply and the population is dominated by heterozygotes. Such heterozygotes are available in Oenothera.

Genetic Polymorphism:

The regular occurrence of several phenotypes in a genetic population is known as genetic polymorphism. The genetic polymorphism is usually maintained due to superiority of heterozygotes over both homozygotes. When polymorphism is maintained as a result of heterozygote advantage, it is known as balanced polymorphism.

Sometimes it is difficult to identify the polymorphic allelic forms by visual observations. The best way of detecting the polymorphic alleles is the isozyme studies or gel electrophoretic studies. It has been reported that two third of the loci in a population exhibit polymorphism.

Genetic polymorphism increases the adaptive value or buffering capacity of a population by providing increased diversity of genotypes in a population. Thus, genetic polymorphism enhances the adaptability of a population, because heterozygotes are more adaptable than homozygotes.

Mutation refers to a sudden heritable change in the features of an organism. Mutations differ from segregants in terms of their extremely low frequency. Gene mutations are ultimate sources of new alleles and thus of genetic variability.

The new mutation which we observe today would have originated long ago. Mutations lead to alteration of gene frequencies in a population. Alleles change from one form to another by way of mutation. Mutations may occur in both forward and reverse directions, but the frequency of forward mutations is much higher than reverse mutations.

When there is mutation in both the directions the equilibrium condition can be expressed as follows:

The equilibrium is attained very slowly.

Joint Effects of Mutation and Selection:

The rate of change in gene frequency will increase, if mutation and selection are in the same direction. However, if they are in opposite direction which is the usual case, a stable equilibrium may be observed. If a dominant allele arises by mutation at the rate u per generation and is opposed by selection at the rate S, the equilibrium frequency of mutant q will be as follows:

If s equals the selection pressure against the heterozygote and u is the mutation rate from A —>a, then equilibrium value for harmful recessive would be:

Gene flow or migration can also change frequencies of alleles in populations. Migration includes both immigration (in coming) and emigration (outgoing) of alleles in a population. Mass immigration and emigration have tremendous potential in changing allelic frequencies in populations.

Migration generally refers to the movement of individuals into a population from a different populations. Migration may introduce new alleles into the population. These new alleles after mating with the individuals of original population may alter gene and genotype frequencies in a population.

The rate of change in gene frequency, through migration depends on the number of migrants. If the number of migrants is high, the rate of change will be rapid and vice versa. Emigration of some individuals from a population results in decrease in the frequency of alleles migrated to another population.

Random drift or genetic drift refers to random change in gene frequency due to sampling error. Random drift is generally more in case of small sample size. Large sample size provides true representative value of a population or value which is nearer to the population mean.

Therefore, sample size should be adequate to avoid sampling error. Three forces of evolution viz., selection, mutation and migration alter gene and genotype frequency in a particular direction and are called as directional factors. However random genetic drift is a non-directional factor because it does not change the gene frequency in a particular direction.

The direction of change in the gene frequency may differ from generation to generation. In one generation, the change of gene frequency may be in one direction, which may change to opposite direction in the next generation.

Sometimes a new population is established by a single or few individuals in the main population. Such individuals are referred to as founders and effect of such individuals on the gene frequency of a population is known as founder effect. Founder effect is an important factor which sometimes results in the formation of new species.

Significance of Population Genetics:

1. Knowledge of gene and genotype frequency in a population is useful for a plant breeder in the assessment of competitive ability of various genotypes in varietal mixtures. Such studies help in identification of genotypes with high adaptive value.

If such studies are conducted over multiplications, the varietal flexibility or stability can also be assessed in varietal blends. Hardy-Weinberg Law operates in random mating or panmictic species.

2. Study of gene frequency in a population also reveals significance of various factors in natural evolution. In cross pollinated crops, development of composite and synthetic varieties is based on Hardy-Weinberg principle.

## Background

Population dynamics dictate the evolution of species, such that organisms with large effective population sizes (Ne) evolve under effective selection, preventing most deleterious alleles to reach fixation in the population, and those with small Ne are more susceptible to genetic drift, whereby alleles can sometimes reach fixation irrespective of their adaptive value. Like other traits, the structure of genomes is shaped by selection and drift, such that organisms with smaller Ne accumulate weakly deleterious sequences, such as mobile elements, intergenic DNA, and introns [1]. Conversely, in species with large Ne, deleterious sequences have a low probability of reaching fixation through stochastic processes and are eliminated by selection. Thus, the genomes of species with large Ne would be expected to lack slightly deleterious, non-functional sequences, and the genomes of species with small population sizes would possess such sequences [1, 2]. For these reasons, Ne is thought to be the main parameter driving the evolution of genome size in eukaryotes and in bacteria [1,2,3].

Multiple parameters contribute to differences in Ne across organisms. Naturally, census population size and its fluctuation over time are the primary determinants of Ne. Population substructure can reduce Ne through non-random breeding in sexual species, such that Ne is animals is largely governed by parental investment and fecundity rather than geographic range or demographic perturbations [4]. In contrast, the determinants of Ne remain largely enigmatic for microbial organisms. Whereas microbes often reach enormous census population sizes, estimates of their effective populations sizes are usually many orders of magnitude lower [5]. This discrepancy between predicted and observed population sizes suggests that demographic fluctuations and other mechanisms contribute to the loss of a large part of their genetic diversity.

Estimating the effective population sizes of bacterial species has been considered problematic for several reasons: (i) Genomic-based methods used to estimate Ne rely on segregating alleles at neutral sites, but since selection might potentially be acting on every nucleotide position in bacterial genomes [6], identification of strictly neutral sites is challenging. Moreover, the imprint of selection is a time-dependent process [7], so Ne estimates that consider any non-neutral sites must be adjusted for divergence time. (ii) Due to clonality and genomic linkage, both background selection against deleterious alleles and selective sweeps of beneficial alleles result in the loss of polymorphism. These processes, better known as Hill-Robertson effects [8], are thought to strongly impair most common estimators of Ne in asexual or variably recombining organisms [9]. (iii) Ne estimates depend on the population in question—typically entire species—and the delineation of species boundaries in bacteria has been fraught with difficulties [10].

In this study, we apply a standardized framework that uniformly defines species borders to derive relative and absolute estimates of Ne across Bacteria and Archaea. We examine multiple traits that can potentially affect Ne across a set of 153 prokaryotic species, and the relationship between Ne and genome size and pan-genome size. By further analyzing the relationship between drift and population size on the complete gene repertoires of bacterial species, we show that pan-genome size—rather than absolute genome size—is likely shaped by the effectiveness of selection across species.

## Building a pan-genome reference for a population

A reference genome is a high quality individual genome that is used as a coordinate system for the genomes of a population, or genomes of closely related subspecies. Given a set of genomes partitioned by homology into alignment blocks we formalize the problem of ordering and orienting the blocks such that the resulting ordering maximally agrees with the underlying genomes' ordering and orientation, creating a pan-genome reference ordering. We show this problem is NP-hard, but also demonstrate, empirically and within simulations, the performance of heuristic algorithms based upon a cactus graph decomposition to find locally maximal solutions. We describe an extension of our Cactus software to create a pan-genome reference for whole genome alignments, and demonstrate how it can be used to create novel genome browser visualizations using human variation data as a test. In addition, we test the use of a pan-genome for describing variations and as a reference for read mapping.

Keywords: algorithms computational molecular biology genomics molecular evolution sequence analysis.

### Figures

An illustration of a pan-genome…

An illustration of a pan-genome reference on a sequence graph. (A) A bidirected…

Prototype UCSC pangenome reference browser…

Prototype UCSC pangenome reference browser screenshots. (Top) Indels. ( Middle ) A segregating…

## DMCA Complaint

If you believe that content available by means of the Website (as defined in our Terms of Service) infringes one or more of your copyrights, please notify us by providing a written notice (“Infringement Notice”) containing the information described below to the designated agent listed below. If Varsity Tutors takes action in response to an Infringement Notice, it will make a good faith attempt to contact the party that made such content available by means of the most recent email address, if any, provided by such party to Varsity Tutors.

Your Infringement Notice may be forwarded to the party that made the content available or to third parties such as ChillingEffects.org.

Please be advised that you will be liable for damages (including costs and attorneys’ fees) if you materially misrepresent that a product or activity is infringing your copyrights. Thus, if you are not sure content located on or linked-to by the Website infringes your copyright, you should consider first contacting an attorney.

You must include the following:

Send your complaint to our designated agent at:

Charles Cohn Varsity Tutors LLC
101 S. Hanley Rd, Suite 300
St. Louis, MO 63105

## Genetics Proves Indian Population Mixture

Between 4,000 and 2,000 years ago, intermarriage in India was rampant. Figure by Thangaraj Kumarasamy

Scientists from Harvard Medical School and the CSIR-Centre for Cellular and Molecular Biology in Hyderabad, India, provide evidence that modern-day India is the result of recent population mixture among divergent demographic groups.

The findings, published August 8 in the American Journal of Human Genetics, describe how India transformed from a country where mixture between different populations was rampant to one where endogamy—that is, marrying within the local community and a key attribute of the caste system—became the norm.

“Only a few thousand years ago, the Indian population structure was vastly different from today,” said co–senior author David Reich, professor of genetics at Harvard Medical School. “The caste system has been around for a long time, but not forever.”

In 2009, Reich and colleagues published a paper based on an analysis of 25 different Indian population groups. The paper described how all populations in India show evidence of a genetic mixture of two ancestral groups: Ancestral North Indians (ANI), who are related to Central Asians, Middle Easterners, Caucasians, and Europeans and Ancestral South Indians (ASI), who are primarily from the subcontinent.

However, the researchers wanted to glean clearer data as to when in history such admixture occurred. For this, the international research team broadened their study pool from 25 to 73 Indian groups.

The researchers took advantage of the fact that the genomes of Indian people are a mosaic of chromosomal segments of ANI and ASI descent. Originally when the ANI and ASI populations mixed, these segments would have been extremely long, extending the entire lengths of chromosomes. However, after mixture these segments would have broken up at one or two places per chromosome, per generation, recombining the maternal and paternal genetic material that occurs during the production of egg and sperm.

By measuring the lengths of the segments of ANI and ASI ancestry in Indian genomes, the authors were thus able to obtain precise estimates of the age of population mixture, which they infer varied about 1,900 to 4,200 years, depending on the population analyzed.

While the findings show that no groups in India are free of such mixture, the researchers did identify a geographic element. “Groups in the north tend to have more recent dates and southern groups have older dates,” said co-first author Priya Moorjani, a graduate student in Reich’s lab at Harvard Medical School. “This is likely because the northern groups have multiple mixtures.”

“This genetic datatells us a three-part cultural and historical story,” said Reich, who is also an associate member of the Broad Institute. “Prior to about 4000 years ago there was no mixture. After that, widespread mixture affected almost every group in India, even the most isolated tribal groups. And finally, endogamy set in and froze everything in place.”

“The fact that every population in India evolved from randomly mixed populations suggests that social classifications like the caste system are not likely to have existed in the same way before the mixture,” said co–senior author Lalji Singh, currently of Banaras Hindu University, in Varanasi, India, and formerly of the CSIR-Centre for Cellular and Molecular Biology. “Thus, the present-day structure of the caste system came into being only relatively recently in Indian history.”*

But once established, the caste system became genetically effective, the researchers observed. Mixture across groups became very rare.

“An important consequence of these results is that the high incidence of genetic and population-specific diseases that is characteristic of present-day India is likely to have increased only in the last few thousand years when groups in India started following strict endogamous marriage,” said co–first author Kumarasamy Thangaraj, of the CSIR-Centre for Cellular and Molecular Biology, Hyderabad, India.**

Mohan Rao, Director, CSIR-CCMB said, “CCMB's continuing efforts over a decade on this field had helped in understanding the complexity of Indian population history and social structure, such as caste systems.”

This study was funded by the NIH (GM100233) NSF (HOMINID grant 1032255) a UKIERI Major Award (RG-4772) the Network Project (GENESIS: BSC0121) fund from the Council of Scientific and Industrial Research, Government of India a Bhatnagar Fellowship grant from the Council of Scientific and Industrial Research of the Government of India and a J.C. Bose Fellowship from Department of Science and Technology, Government of India.

*, ** Quotes adapted from American Journal of Human Genetics news release.

## What Use Is Population Genetics?

The Genetic Society of America’s Thomas Hunt Morgan Medal is awarded to an individual GSA member for lifetime achievement in the field of genetics. For over 40 years, 2015 recipient Brian Charlesworth has been a leader in both theoretical and empirical evolutionary genetics, making substantial contributions to our understanding of how evolution acts on genetic variation. Some of the areas in which Charlesworth’s research has been most influential are the evolution of sex chromosomes, transposable elements, deleterious mutations, sexual reproduction, and life history. He also developed the influential theory of background selection, whereby the recurrent elimination of deleterious mutations reduces variation at linked sites, providing a general explanation for the correlation between recombination rate and genetic variation.

I am grateful to the Genetics Society of America for honoring me with the Thomas Hunt Morgan Medal and for inviting me to contribute this essay. I have spent nearly 50 years doing research in population genetics. This branch of genetics uses knowledge of the rules of inheritance to predict how the genetic composition of a population will change under the forces of evolution and compares the predictions to relevant data. As our knowledge of how genomes are organized and function has increased, so has the range of problems confronted by population geneticists. We are, however, a relatively small part of the genetics community, and sometimes it seems that our field is regarded as less important than those branches of genetics concerned with the properties of cells and individual organisms.

I will take this opportunity to explain why I believe that population genetics is useful to a broad range of biologists. The fundamental importance of population genetics is the basic insights it provides into the mechanisms of evolution, some of which are far from intuitively obvious. Many of these insights came from the work of the first generation of population geneticists, notably Fisher, Haldane, and Wright. Their mathematical models showed that, contrary to what was believed by the majority of biologists in the 1920s, natural selection operating on Mendelian variation can cause evolutionary change at rates sufficient to explain historical patterns of evolution. This led to the modern synthesis of evolution (Provine 1971). No one can claim to understand how evolution works without some basic understanding of classical population genetics those who do run the risk of making mistakes such as asserting that rapid evolutionary change is most likely to occur in small founder populations (Mayr 1954).

As our knowledge of how genomes are organized and function has increased, so has the range of problems confronted by population geneticists. We are, however, a relatively small part of the genetics community, and sometimes it seems that our field is regarded as less important than those branches of genetics concerned with the properties of cells and individual organisms.—B.C.

The modern synthesis is getting on for 80 years old, so this argument will probably not convince skeptical molecular geneticists that population genetics has a lot to offer the modern biologist. I provide two examples of the useful role that population genetic studies can play. First, one of the most notable discoveries of the past 40 years was the finding that the genomes of most species contain families of transposable elements (TEs) with the capacity to make new copies that insert elsewhere in the genome (Shapiro 1983). This led to two schools of thought about why they are present in the genome. One claimed that TEs are maintained because they confer benefits on the host by producing adaptively useful mutations (Syvanen 1984) the other believed that they are parasites, maintained by their ability to replicate within the genome despite potentially deleterious fitness effects of TE insertions (Doolittle and Sapienza 1980 Orgel and Crick 1980).

The second hypothesis can be tested by comparing population genetic predictions with the results of TE surveys within populations. In the early 1980s, Chuck Langley, myself and several collaborators tried to do just this, using populations of Drosophila melanogaster (Charlesworth and Langley 1989). The models predicted that most Drosophila TEs should be found at low population frequencies at their insertion sites. This is so because D. melanogaster populations have large effective sizes (Ne). Ne is essentially the number of individuals that genetically contribute to the next generation. Large Ne means that a very small selection pressure can keep deleterious elements at low frequencies. This is a consequence of one of the most important findings of classical population genetics—the fate of a variant in a population is the product of Ne and the strength of selection (Fisher 1930 Kimura 1962). If, for example, Ne is 1000, a mutation that reduces fitness relative to wild type by 0.001 will be eliminated from the population with near certainty.

Using the crude tools then available (restriction mapping of cloned genomic regions and in situ hybridization of labeled TE probes to polytene chromosomes), we found that nearly all TEs are indeed present at low frequencies in the population (Charlesworth and Langley 1989). Most of the exceptions to this rule were found in genomic regions in which little crossing over occurs (Maside et al. 2005). This is consistent with Chuck’s proposal that a major contributor to the removal of TEs from the population is selection against aneuploid progeny created by crossing over among homologous TEs at different locations in the genome (Langley et al. 1988). It is now a familiar finding that nonrecombining genomes or genomic regions tend to be full of TEs and other kinds of repetitive sequences the population genetic reasons for this, discussed by Charlesworth et al. (1994), are perhaps not so familiar.

Modern genomic methods provide much more powerful means for identifying TE insertions. Recent population surveys using these methods have confirmed the older findings: most TEs in Drosophila are present at low frequencies, and there is statistical evidence for selection against insertions (Barron et al. 2014). This is consistent with the existence of elaborate molecular mechanisms for repressing TE activity, such as the Piwi-interacting RNA (piRNA) pathway of animals (Senti and Brennecke 2010) there would be no reason to evolve such mechanisms if TEs were harmless. In a few cases, TEs have swept to high frequencies or fixation, and there is convincing evidence that at least some of these events are associated with increased fitness caused by the TE insertions themselves (Barron et al. 2014). These cases do not contradict the intragenomic parasite hypothesis for the maintenance of TEs favorable mutations induced by TEs are too rare to outweigh the elimination of deleterious insertions unless new insertions continually replace those that are lost.

From the theory of aging, to the degeneration of Y chromosomes, to the dynamics of transposable elements, our understanding of the genetic basis of evolution is deeper and richer as a result of Charlesworth’s many contributions to the field. —Charles Langley, University of California, Davis

My other example is a population genetics discovery about a fundamental biological process: the PRDM9 protein involved in establishing recombination hot spots in humans. This was enabled by the revolution in population genetics brought about by coalescence theory (Hudson 1990), which is a powerful tool for looking at the statistical properties of a sample from a population under the hypothesis of selective neutrality. The basic idea is simple: if we sample two homologous, nonrecombining haploid genomes (e.g., mitochondrial DNA) from a large population, there is a probability of 1/(2Ne) that they are derived from the same parental genome in the preceding generation i.e., they coalesce (Ne is the effective population size for the genome region in question). If they fail to coalesce in that generation, there is a probability of 1/(2Ne) that they coalesce one generation further back, and so on. If n genomes are sampled, there is a bifurcating tree connecting them back to their common ancestor. The size and shape of this tree are highly random, so genetically independent components of the genome experience different trees, even if they share the same Ne. The properties of sequence variability in the sample can be modeled by throwing mutations at random onto the tree (Hudson 1990).

Recombination causes different sites in the genome to experience different trees, but closely linked sites have much more similar trees than independent sites. At the level of sequence variability, close linkage results in nonrandom associations between neutral variants—linkage disequilibrium (LD). The extent of LD among neutral variants at different sites is determined by the product of Ne and the frequency of recombination between them c (Ohta and Kimura 1971 McVean 2002). Richard Hudson proposed a statistical method for estimating Nec from data on variants at multiple sites across the genome (Hudson 2001) that was implemented in a widely used computer program LDhat by Gil McVean and colleagues (McVean et al. 2002). Applications to large data sets on human sequence variability showed that the genome is full of recombination hot spots and cold spots, consistent with previous molecular genetic studies of specific loci (Myers et al. 2005). Most recombination occurs in hot spots and very little in between them, accounting for the fact that there is almost complete LD over tens or even hundreds of kilobases in humans. The identification of a large number of hot spots led to the discovery of a sequence motif bound by a zinc finger protein, PRDM9, at about the same time that mouse geneticists also discovered that PRDM9 promotes recombination (McVean and Myers 2010 Baudat et al. 2014). These discoveries have led to many interesting observations, such as associations between PRDM9 variants in humans and individual variation in recombination rates, generating an ongoing research program of great scientific interest (Baudat et al. 2014).

With the ever-increasing use of genomic data, I am confident that many more such fruitful interactions between molecular and population genetics will take place. A take-home message is that more needs to be done to integrate training in population, molecular, and computational approaches to provide the next generation of researchers with the broad range of knowledge they will need.

## Genetic Evolution of Species | Cell Biology

The concept of ‘organic evolution’ envisages that all the living forms of today developed from a common ancestor. That is, the various life forms are related by descent, which accounts for the similarities among them. The idea of organic evo­lution was not widely accepted until 1859 when Darwin published his classic work ‘The Origin of Species’.

This work contained a large body of evidence in favour of the idea that evolution continuous and it provided an attractive hypothesis to explain the mode of evolution.

Subsequently, various concepts regarding the mechanism of evolution were developed Haldane, Fischer, Wright and several others, Information’s from diverse areas of study, such as, geology, palaeontology, taxonomy, population genetics, biochemistry, molecular genetics and others have been collated and resynthesized to understand evolution.

#### Present Status of Genetic Evolution of Species:

The modality of evolution of species in the plant kingdom involves a combination of pro­cesses and phenomena in nature. The processes cover all the changes inherent in the concepts of Drawin, de Vries, and lately by Stebbins.

The basic materials bringing about changes in the individual of a population, are the genes and their alterations. In fact, the random gene changes provide with basic raw materials in the evolutionary process.

Such changes may be major or minor, involving alterations in structure and numbers of genes as well as of chromosomes and chromosome segments. In short, genie and chromosomal alterations occurring at random in the individuals of a population, provide the basic materials for the evolution.

The next step in the evolutionary process at the population level, is the recombination of genes between different individuals. The random hybridization between different individuals containing different genetic changes leads to the origin of new individuals with newer gene combinations. At this step, the population may represent a heterogeneous mass of individuals containing different gene combinations.

The next step in evolution is the operation of natural selection in the struggle for existence among the heterogeneous recombination’s, for opti­mum utilization of the resources in their specific environments. Ultimately through natural selec­tion, certain individuals with altered gene com­plements occupy the environmental niche with the gradual exclusion of others.

Through cross bree­ding amongst themselves, such a population ulti­mately becomes stable with specific altered gene combinations and becomes a stable genotype.

The stable population characterized by a particular gene combination, stands apart from the parental species to which the population initially belonged. Such a stabilized population, characterizing a genotype differing in phenotype from its predecessors, is often considered as attaining an incipient species level.

Such an incipient species can even undergo intercrossing with individuals of the parental population and may lose identity.

Allopatric Speciation:

As such, the attain­ment of a species status from the level of incipi­ent species, would require a compatibility barri­er between the new and the old populations. Without this barrier, despite phenotypic differen­ces, the identity of the new population cannot be maintained.

There is every possibility of its mer­ger with parental species through breeding in absence of barrier leading to the origin of a series of graded phenotypes. The barrier to compati­bility, essential for attaining species status, can be achieved through different means.

The method without involving any genie changes leading to compatibility barrier is migration. The migration of the new population to new environment, far removed from the original, leads to geographical isolation. Such geographical isolation enables a population to develop its own phenotypic cha­racteristic adapted to the changed environment, far removed from the original.

Such species are also termed allopatric species.

Sympatric speciation:

The common method, other than the migration and consequent geo­graphical isolation, is the genie changes or muta­tions leading to a barrier to fertilization.

Such barrier to fertilization between species-occupying the same geographical area, otherwise termed as sympatric species, can be achieved through sea­sonal isolation, i.e., blooming at different seasons caused by genie changes in the individual.

Not necessarily seasonal, but the barrier may be pre­sent even between two species maintaining their individuality, occupying the same habitat and blooming in the same season. The compatible barrier between the two species, original and derived, can also be due to incompatibility of germinal line, the pollens and ovule.

Such genie sterility may be manifested either in the absence of fertilization or barrier to post-fertilization embry­onic development. Such sterility barrier at the genie level is the principal factor in stabilization and as such evolution of species.

## Section Summary

Both genetic and environmental factors can cause phenotypic variation in a population. Different alleles can confer different phenotypes, and different environments can also cause individuals to look or act differently. Only those differences encoded in an individual’s genes, however, can be passed to its offspring and, thus, be a target of natural selection. Natural selection works by selecting for alleles that confer beneficial traits or behaviors, while selecting against those for deleterious qualities. Genetic drift stems from the chance occurrence that some individuals in the germ line have more offspring than others. When individuals leave or join the population, allele frequencies can change as a result of gene flow. Mutations to an individual’s DNA may introduce new variation into a population. Allele frequencies can also be altered when individuals do not randomly mate with others in the group.

## Environmental Variance

Genes are not the only players involved in determining population variation. Other factors, such as the environment (Figure) also influence phenotypes. A beachgoer is likely to have darker skin than a city dweller, for example, due to regular exposure to the sun, an environmental factor. For some species, the environment determines some major characteristics, such as gender. For example, some turtles and other reptiles have temperature-dependent sex determination (TSD). TSD means that individuals develop into males if their eggs are incubated within a certain temperature range, or females at a different temperature range.

The temperature at which the eggs are incubated determine the American alligator's (Alligator mississippiensis) sex. Eggs incubated at 30°C produce females, and eggs incubated at 33°C produce males. (credit: Steve Hillebrand, USFWS)

If there is gene flow between the populations, the individuals will likely show gradual differences in phenotype along the cline. Restricted gene flow, alternatively can lead to abrupt differences, even speciation.