# H is the set of all possible Diplotypes that are consistent with genotype data

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I am new to Biology.SX. I have a statistics background, and have almost zero knowledge in genetics. I am trying to understand some things related to genetics in a certain paper of biostatistics.

1. Haplotype data were simulated using the haplotype patterns and frequencies (shown below) for 5-SNPs along a dabetes susceptibility region on chromosome 22, reported in FUSION study.

I know that haplotypes can be represented as binary sequences. I wonder why all the possible $$2^5$$ are not present here. (?)

The paper also says that

1. Let $$G=(g_1,ldots,g_M)$$ denote the unphased genotype data for the $$M$$ loci. $$mathcal{H}_G$$ denote the set of all possible diplotypes that are consistent with the genotype data $$G$$.

If a subject carries at most one copy of the causal haplotype '01100', it belongs to dominant genetic model, if it carries two copies of this haplotype, it belongs to recessive model.

The set $$mathcal{H}_G$$ is not clear to me. While computing the likelihood, I need to know $$mathcal{H}_G$$. Any help or suggestions?

Disclaimer: Like the OP, I know very little about genetics and I suppose other people in the site can give better answers than me. Anyway, since the question has been unanswered for months I'm posting my answer. Hopefully someone will improve it.

Just beware that "possible" in this context doesn't mean all haplotypes that we can imagine. As the paper says, "\$mathcal{H}_G\$ denote the set of all possible diplotypes that are consistent with the genotype data \$G\$" and that is an additional restriction that allows just some diplotypes - the seven ones listed, not the 32 we can imagine.

Furthermore, please notice that sum of frequencies in the table equals 1, therefore there can be no more haplotypes in the set.

However, it would help if you linked the source where you got those data and quotes.

## Genotype imputation using the Positional Burrows Wheeler Transform

Genotype imputation is the process of predicting unobserved genotypes in a sample of individuals using a reference panel of haplotypes. In the last 10 years reference panels have increased in size by more than 100 fold. Increasing reference panel size improves accuracy of markers with low minor allele frequencies but poses ever increasing computational challenges for imputation methods. Here we present IMPUTE5, a genotype imputation method that can scale to reference panels with millions of samples. This method continues to refine the observation made in the IMPUTE2 method, that accuracy is optimized via use of a custom subset of haplotypes when imputing each individual. It achieves fast, accurate, and memory-efficient imputation by selecting haplotypes using the Positional Burrows Wheeler Transform (PBWT). By using the PBWT data structure at genotyped markers, IMPUTE5 identifies locally best matching haplotypes and long identical by state segments. The method then uses the selected haplotypes as conditioning states within the IMPUTE model. Using the HRC reference panel, which has ∼65,000 haplotypes, we show that IMPUTE5 is up to 30x faster than MINIMAC4 and up to 3x faster than BEAGLE5.1, and uses less memory than both these methods. Using simulated reference panels we show that IMPUTE5 scales sub-linearly with reference panel size. For example, keeping the number of imputed markers constant, increasing the reference panel size from 10,000 to 1 million haplotypes requires less than twice the computation time. As the reference panel increases in size IMPUTE5 is able to utilize a smaller number of reference haplotypes, thus reducing computational cost.

## Background

Reference-based genetic variant identification comprises two related processes: genotyping and phasing. Genotyping is the process of determining which genetic variants are present in an individual’s genome. A genotype at a given site describes whether both chromosomal copies carry a variant allele, whether only one of them carries it, or whether the variant allele is not present at all. Phasing refers to determining an individual’s haplotypes, which consist of variants that lie near each other on the same chromosome and are inherited together. To completely describe all of the genetic variation in an organism, both genotyping and phasing are needed. Together, the two processes are called diplotyping.

Many existing variant analysis pipelines are designed for short DNA sequencing reads [1, 2]. Though short reads are very accurate at a per-base level, they can suffer from being difficult to unambiguously align to the genome, especially in repetitive or duplicated regions [3]. The result is that millions of bases of the reference human genome are not currently reliably genotyped by short reads, primarily in multi-megabase gaps near the centromeres and short arms of chromosomes [4]. While short reads are unable to uniquely map to these regions, long reads can potentially span into or even across them. Long reads have already proven useful for read-based haplotyping, large structural variant detection, and de novo assembly [5–8]. Here, we demonstrate the utility of long reads for more comprehensive genotyping. Due to the historically greater relative cost and higher sequencing error rates of these technologies, little attention has been given thus far to this problem. However, long-read DNA sequencing technologies are rapidly falling in price and increasing in general availability. Such technologies include single-molecule real-time (SMRT) sequencing by Pacific Biosciences (PacBio) and nanopore sequencing by Oxford Nanopore Technologies (ONT), both of which we assess here.

The genotyping problem is related to the task of inferring haplotypes from long-read sequencing data, on which a rich literature and many tools exist [8–14], including our own software WhatsHap [15, 16]. The most common formalization of haplotype reconstruction is the minimum error correction (MEC) problem. The MEC problem seeks to partition the reads by haplotype such that a minimum number of errors need to be corrected in order to make the reads from the same haplotype consistent with each other. In principle, this problem formulation could serve to infer genotypes, but in practice, the “all heterozygous” assumption is made: tools for haplotype reconstruction generally assume that a set of heterozygous positions is given as input and exclusively work on these sites.

Motivation and overview of diplotyping. a Gray sequences illustrate the haplotypes the reads are shown in red and blue. The red reads originate from the upper haplotype, the blue ones from the lower. Genotyping each SNV individually would lead to the conclusion that all of them are heterozygous. Using the haplotype context reveals uncertainty about the genotype of the second SNV. b Clockwise starting top left: first, sequencing reads aligned to a reference genome are given as input second, the read alignments are used to nominate candidate variants (red vertical bars), which are characterized by the differences to the reference genome third, a hidden Markov model (HMM) is constructed where each candidate variant gives rise to one “row” of states, representing possible ways of assigning each read to one of the two haplotypes as well as possible genotypes (see the “Methods” section for details) forth, the HMM is used to perform diplotyping, i.e., we infer genotypes of each candidate variant as well as how the alleles are assigned to haplotypes

### Contributions

In this paper, we show that for contemporary long read technologies, read-based phase inference can be simultaneously combined with the genotyping process for SNVs to produce accurate diplotypes and to detect variants in regions not mappable by short reads. We show that key to this inference is the detection of linkage relationships between heterozygous sites within the reads. To do this, we describe a novel algorithm to accurately predict diplotypes from noisy long reads that scales to deeply sequenced human genomes.

We then apply this algorithm to diplotype one individual from the 1000 Genomes Project, NA12878, using long reads from both PacBio and ONT. NA12878 has been extensively sequenced and studied, and the Genome in a Bottle Consortium has published sets of high confidence regions and a corresponding set of highly confident variant calls inside these genomic regions [20]. We demonstrate that our method is accurate, that it can be used to confirm variants in regions of uncertainty, and that it allows for the discovery of variants in regions which are unmappable using short DNA read sequencing technologies.

## H is the set of all possible Diplotypes that are consistent with genotype data - Biology

Copy of snphap-1.3.1 source, written by David Clayton, put here because existing website is closing.

I am not a maintainer, but a user who wants to keep a copy where I can install from in future. What follows is David’s documentation.

A program for estimating frequencies of large haplotypes of SNPs

Department of Medical Genetics Cambridge Institute for Medical Research Wellcome Trust/MRC Building Addenbrooke’s Hospital, Cambridge, CB2 2XY

This program has now had a fair bit of use by our group and others. However it comes with no guarantees. I’d like to know of any difficulties/bugs that people experience and will try and fdeal with them, but this may take some time.

This program implements a fairly standard method for estimating haplotype frequencies using data from unrelated individuals. It uses an EM algorithm to calculate ML estimates of haplotype frequencies given genotype measurements which do not specify phase. The algorithm also allows for some genotype measurements to be missing (due, for example, to PCR failure). It also allows multiple imputation of individual haplotypes.

The well-known algorithm first expands the data for each subject into the complete set of (fully phased) pairs of haplotype consistent with the observed data (the “haplotype instances”). The EM algorithm then proceeds as follows:

• E step:: Given current estimates of haplotype probabilities and assuming Hardy-Weinberg equilibrium, calculate the probability of each phased and complete genotype assignment for each subject. Scale these so that they sum to 1.0 within each subject. These are then the POSTERIOR PROBABILITIES of the genotype assignments.
• M step:: Calculate the next set of estimates of haplotype probabilities by summing posterior probabilities over instances of each distinct haplotype. After scaling to sum to 1.0, These provide the new estimates of the PRIOR PROBABILITIES.

This algorithm is widely used and not novel. However, for a large number of loci, the number of possible haplotype instances rapidly becomes impossibly large, even though the eventual solution may only give appreciable support to a rather limited set of haplotypes. This program avoids this difficulty by starting by fitting 2-locus haplotypes and extending the solution by one locus at a time. As each new locus is added, the number of haplotype instances to be considered is first expanded, by considering all possible larger haplotypes. Then, after applyiny the EM algorithm to estimate the “prior” haplotype probabilities and the posterior probilities of haplotype instances, the haplotype instances are culled in two ways:

• Posterior trimming:: Any genotype assignment whose posterior probability falls below a given threshold is deleted and the posterior probabilities of assignments of genotype to the subject are recomputed.
• Prior trimming:: All instances of any haplotype whose prior probability falls below a threshold are removed. This option is not used by default since it can lead to difficulties in comparing likelihoods (see below).

We add one locus at a time until completion.

The process of culling haplotype assignments at early stages can lead to solutions which are not optimal. For example, haplotype 1.1.1 may have zero estimated frequency in the maximum likelihood analysis of the three-locus haplotype, while 1.1.1.x may have non-zero estimated frequency in the ML solution to the four-locus problem. It is not clear how often this will be a problem. A partial solution is to try including loci in different orders, seeing if the soultion obtained varies. A further protection is not to cull haplotypes after inclusion of every locus, but only every k loci, although there will be a penalty both in computer time and use of memory incurred by choice of large values of k. Note that you will only get the benefit of choosing k>1 if each EM algorithm is started from a random starting point (see -rs option).

Sampling the (Bayesian) posterior distribution of individual haplotype data is conveniently carried out using a Gibbs sampler. This mimics the EM algorithmm but uses stochastic steps rather than deterministic ones. It has been termed the IP (Imputation/Posterior sampling) algorithm. In our case the algorithm works as follows:

I-step (replaces the E-step): For each subject, pick a haplotype assignment from the possible instances, with probability given by the current posterior, or “full conditional” distribution.

P-step (replaces the M-step): Sample the haplotype population frequencies from their full conditional Bayesian posterior. If the prior is a Dirichlet distribution with constant df on all possible haplotypes*, the full conditional posterior distribution is also Dirichlet. To obtain the set of df parameters for this posterior Dirichlet, we simply add the constant prior df to the number of chromosomes currently assigned to each haplotype.

(* i.e. if the unknown population haplotype relative frequencies are denoted p_1, p2, …, p_i, …, p_n, then their prior density is assumed to be proportional to

where d is the prior degree of freedom parameter)

There can be difficulties in sampling the entire space using this algorithm if the prior Dirchlet df is taken as zero if a haplotype is not assigned to any individual at in one step, then the full conditional posterior is improper and the haplotype will be given zero probability at the next step. Thereafter it can never be sampled again. Also, when there are multiple maxima in the likelihood, the algorithm may become “stuck” under one peak. To avoid these difficulties, provision is made to start the prior df parameter at a relatively large value, thereby giving all haplotypes an appreciable probability of being sampled. Thereafter the prior df parameter is reduced at each step. This algorithm is repeated for a fixed number of steps to obtain a single imputation. The prior df parameter is then set back up to the high value, the population haplotype frequencies restored to their MLE’s, and the process repeated to obtain the next imputation. And so on.

Warning: Although multiple imputation using the IP algorithm is an established technique (see Schafer J.L. “Analysis of Incomplete Multivariate Data” Chapman and Hall: London, 1997), it remains to be rigorously validated in this application.

It is well known that the likelihood surface for this problem may have multiple maxima and that the EM algorithm will only converge to a local maximum. After all loci have been added and a final trimmed list of haplotype instances has been computed, the EM algorithm may be repeated multiple times from random starting points in order to search for the global maximum. The random starting points may be chosen in one of two ways: (a) from randomly chosen values for the prior haplotype probabilities, or (b) from randomly chosen posterior probabilities for each haplotype assignment. Random starting points can also be chosen in the first set of EM iterations and, in this case, method (b) is used.

The program is invoked from the command line by

The input file should contain the data in subject order, with a subject identifier followed by pairs of alleles of each locus. The subject identifier need not be numeric, but must not include “white space” (blanks or tabs). The alleles should either be coded 1, 2 (numeric coding), or A,C,G or T (“nucleotide” coding). Missing data is indicated by 0 in numeric coding and, for nucleotide coding, by any character not hitherto mentioned. Data fields should be separated by any “white space” (any number of blanks, tabs or new-line characters).

By default loci, are added in the same order that they appear on the input file but, optionally, they may be added in

1. Reverse order
2. Random order
3. In decreasing order completeness of data
4. In decreasing order of minor allele frequency (MAF)

I have little experience yet of the effect of changing the order of inclusion. The idea of (3) is to stop too much proliferation of possible haplotypes early on in the process, when there is little data on which to reliably pick rare haplotypes to cull. The idea of (4) is to concentrate first on older haplotypes.

The log likelihood output from this program should be used with some caution, particularly when prior trimming has been applied, since likelihoods which do not consider the same subsets of possible haplotypes may not be comparable.

The optional flags allow one to set the following parameters:

Multiple imputation options:

If the command is issued without options or arguments, a brief description of available options is written to the screen.

1. Iteration progress reports (written to the screen). Note that some terminal emulators which provide “scrolling” may seriously slow down operation of the program. In this case you should either use a standard non-scrolling xterm, or invoke the -q option which suppresses this output.
2. A file listing the haplotypes found, and their probabilities (output-file-1). The list is in descending order of probability and a cumulative probability is also listed. The cumulative probability is suppressed if the -ss option is in force.
3. A file listing assignments of haplotypes to subjects (output-file-2). This file contains all assignments whose posterior probability exceeds a multiple of that of the most probable assignment (see -th option).
4. A file (named “snphap-warnings”) which contains any warning messages.

Output files output-file-1 and output-file-2 are in a compressed and easily readable format. Alternativelly they can be saved as tab-delimited text files suitable for reading into a spreadsheet program, or a statistical program such as “Stata”. Both file names are optional and a missing argument can be indicated with a single “.” (period or full-stop) character. But since it must be assumed that you want SOME output, omission of both file names causes the program to default to “snphap.out” for output-file-1.

In multiple imputation mode, an additional series of files is created. Each imputation causes a fresh file (or pair of files) to be written. The file names are as specified on the command line, but the strings .001, .002, .003 … etc. are appended.

A primitive Makefile is supplied. This uses the gcc compiler and will need to be edited if a different C compiler is to be used. You may also need to edit the CMP_FLAGS and LD_FLAGS options (which provide flags used by the compiler at compile and load stages respectively)

For Microsoft Windows users, I suggest use of the “Cygwin” Unix emulation package. See

I found that setting LD_FLAGS to -lm worked for me on both Linux and Solaris (this is the default setting), but on Cygwin I had to omit this flag.

The default uniform random number generator (UNIFORM_RANDOM) is set to be the standard 48-bit function `drand48’, and the corresponding seeding function (RANDOM_SEED) is `srand48’. However, for systems which do no support the 48-bit functions (this includes Cygwin), the 32-bit versions can be chosen:

`drand()’ is defined as a macro evaluating to (0.5+rand())/(1+RAND_MAX).

A short test data file is also included. This contains typings of 100 subjects for 51 SNPs in a small region. To test the program:

Altrenatively, if you wish to incorporate locus names in the output,

Thanks to Newton Morton and Nikolas Maniatis for their helpful comments and suggestions on an early previous version. Thanks also to anyone who has pointed out bugs in earlier versions.

## HaplotypeCaller Follow

The HaplotypeCaller is capable of calling SNPs and indels simultaneously via local de-novo assembly of haplotypes in an active region. In other words, whenever the program encounters a region showing signs of variation, it discards the existing mapping information and completely reassembles the reads in that region. This allows the HaplotypeCaller to be more accurate when calling regions that are traditionally difficult to call, for example when they contain different types of variants close to each other. It also makes the HaplotypeCaller much better at calling indels than position-based callers like UnifiedGenotyper.

In the GVCF workflow used for scalable variant calling in DNA sequence data, HaplotypeCaller runs per-sample to generate an intermediate GVCF (not to be used in final analysis), which can then be used in GenotypeGVCFs for joint genotyping of multiple samples in a very efficient way. The GVCF workflow enables rapid incremental processing of samples as they roll off the sequencer, as well as scaling to very large cohort sizes (e.g. the 92K exomes of ExAC).

In addition, HaplotypeCaller is able to handle non-diploid organisms as well as pooled experiment data. Note however that the algorithms used to calculate variant likelihoods is not well suited to extreme allele frequencies (relative to ploidy) so its use is not recommended for somatic (cancer) variant discovery. For that purpose, use Mutect2 instead.

Finally, HaplotypeCaller is also able to correctly handle the splice junctions that make RNAseq a challenge for most variant callers, on the condition that the input read data has previously been processed according to our recommendations as documented here.

### How HaplotypeCaller works

#### 1. Define active regions

The program determines which regions of the genome it needs to operate on (active regions), based on the presence of evidence for variation.

#### 2. Determine haplotypes by assembly of the active region

For each active region, the program builds a De Bruijn-like graph to reassemble the active region and identifies what are the possible haplotypes present in the data. The program then realigns each haplotype against the reference haplotype using the Smith-Waterman algorithm in order to identify potentially variant sites.

#### 3. Determine likelihoods of the haplotypes given the read data

For each active region, the program performs a pairwise alignment of each read against each haplotype using the PairHMM algorithm. This produces a matrix of likelihoods of haplotypes given the read data. These likelihoods are then marginalized to obtain the likelihoods of alleles for each potentially variant site given the read data.

#### 4. Assign sample genotypes

For each potentially variant site, the program applies Bayes' rule, using the likelihoods of alleles given the read data to calculate the likelihoods of each genotype per sample given the read data observed for that sample. The most likely genotype is then assigned to the sample.

### Input

Input bam file(s) from which to make variant calls

### Output

Either a VCF or GVCF file with raw, unfiltered SNP and indel calls. Regular VCFs must be filtered either by variant recalibration (Best Practice) or hard-filtering before use in downstream analyses. If using the GVCF workflow, the output is a GVCF file that must first be run through GenotypeGVCFs and then filtering before further analysis.

### Usage examples

These are example commands that show how to run HaplotypeCaller for typical use cases. Have a look at the method documentation for the basic GVCF workflow.

### Caveats

• We have not yet fully tested the interaction between the GVCF-based calling or the multisample calling and the RNAseq-specific functionalities. Use those in combination at your own risk.

### Special note on ploidy

This tool is able to handle many non-diploid use cases the desired ploidy can be specified using the -ploidy argument. Note however that very high ploidies (such as are encountered in large pooled experiments) may cause performance challenges including excessive slowness. We are working on resolving these limitations.

• When working with PCR-free data, be sure to set `-pcr_indel_model NONE` (see argument below).
• When running in `-ERC GVCF` or `-ERC BP_RESOLUTION` modes, the confidence threshold is automatically set to 0. This cannot be overridden by the command line. The threshold can be set manually to the desired level in the next step of the workflow (GenotypeGVCFs)
• We recommend using a list of intervals to speed up analysis. See this document for details.

## INDUSTRIAL APPLICABILITY

According to the present invention, it is possible to perform maximum likelihood estimation of haplotype frequencies in a population, diplotype distributions of individuals (posterior probability distributions of diplotype configurations) and penetrances by using genotype data and phenotype data, with no need to definitely determine diplotype configurations of the individuals. If the algorithm in accordance with the present invention is used, the association between the existence of a haplotype and one phenotype can be tested by using genotype data and phenotype data obtained as a result of, for example, a cohort study, a clinical trial or a case-control study.

This work was supported by the National Natural Science Foundation of China [grant numbers 81460169 and 8196030236] and the “Medical Excellence Award” Funded by the Creative Research Development Grant from the First Affiliated Hospital of Guangxi Medical University. This work was made possible through my IACN-ISN-HKSN funded scholarship.

L.P: study design, data collection and analysis, sample collection, experiment conduction, and drafting of the manuscript R.X.Y: study design, data analysis, and revision of the manuscript Y.H.L: study design, data and sample collection, revision of the manuscript M.Q.M: sample collection, experiment conduction, and data collection and Q.H.Z: data analysis. All authors read and approved the final manuscript.

## Methods

### Seed and soil collection

During the summer of 2015 we visited 192 wild populations spanning the native distribution of H. annuus, H. petiolaris, and H. argophyllus, and collected seeds from 21-37 individuals from each population. Seeds from ten additional populations of H. annuus had been previously collected in the summer of 2011. Three to five soil samples (0 - 25 cm depth) were collected with a corer at each population, from across the area in which seeds were collected. Soils were air dried in the field, further dried at 60 °C in to the lab, and passed through a 2 mm sieve to remove roots and rocks. Soils were then submitted to Midwest Laboratories Inc. (Omaha, NE, USA) for analysis.

### Common garden

Ten plants from each of 151 selected populations were grown at the Totem Plant Science Field Station of the University of British Columbia (Vancouver, Canada) in the summer of 2016. Pairs of plants from the same population of origin were sown using a completely randomized design. At least three flowers from each plant were bagged before anthesis to prevent pollination, and manually crossed to an individual from the same population of origin. Phenotypic measurements were performed throughout plant growth, and leaves, stem, inflorescences and seeds were collected and digitally imaged to extract relevant morphometric data (see Supplementary Table 1).

### Library preparation and sequencing

Whole-genome shotgun (WGS) sequencing libraries were prepared for 719 H. annuus, 488 H. petiolaris, 299 H. argophyllus individuals, and twelve additional samples from annual and perennial sunflowers (Supplementary Table 1). Genomic DNA was sheared to ∼400 bp fragments using a Covaris M220 ultrasonicator (Covaris, Woburn, Massachusetts, USA) and libraries were prepared using a protocol largely based on Rowan et al., 2015 51 , the TruSeq DNA Sample Preparation Guide from Illumina (Illumina, San Diego, CA, USA) and Rolhand et al., 2012 52 . In order to reduce the proportion of repetitive sequences, libraries were treated with a Duplex-Specific Nuclease (DSN Evrogen, Moscow, Russia), following the protocols reported in Shagina et al. 2010 10 and Matvienko et al. 2013 53 , with modifications (see Supplementary Methods for details). All libraries were sequenced at the McGill University and Génome Québec Innovation Center on HiSeq2500, HiSeq4000 and HiSeqX instruments (Illumina, San Diego, CA, USA), to produce paired end, 150 bp reads. Libraries with fewer reads were re-sequenced to increase genome coverage. After quality filtering (see below), a total of 60.7 billion read pairs were retained, equivalent to 14.5 Tbp of sequence data.

### Variant calling

The call set included the 1518 samples described above, the Sunflower Association Mapping (SAM) population (a set of cultivated H. annuus lines 54 ), and wild Helianthus samples previously sequenced for other projects 54–56 , for a total of 2392 samples (Supplementary Table 1). The additional samples were included to improve SNP calling, and to identify haploblock genotypes. Sequences were trimmed for low quality using Trimmomatic 57 (v0.36) and aligned to the H. annuus XRQv1 genome 9 using NextGenMap 58 (v0.5.3). We followed the best practices recommendations of The Genome Analysis ToolKit (GATK) 59 , and executed steps documented in GATK’s germline short variant discovery pipeline (for GATK 4.0.1.2). During genotyping, to reduce computational time and improve variant quality, genomic regions containing transposable elements were excluded 9 . Since performing joint-genotyping on the whole ensemble of samples would have been computationally impractical, genotyping was performed independently on three per-species cohorts (H. annuus, H. argophyllus and H. petiolaris).

### Variant quality filtering

Genotyping produced VCF files featuring an extremely large number of variant sites (222M, 78M and 167M SNPs and indels for H. annuus, H. argophyllus and H. petiolaris, respectively). Over the called portion of the genome, this corresponds to 0.07 to 0.2 variants per bp, with 30-47% percent of variable sites being indel variation. To remove low-quality calls and produce a dataset of a more manageable size, we used GATK’s VariantRecalibrator (v4.0.1.2), which filters variants in the call set according to a machine learning model inferred from a small set of “true” variants. In the absence of an externally-validated set of known sunflower variants to use as calibration, we computed a stringently-filtered set from top-N samples with highest sequencing coverage for each species (N=67 (SAM) samples for H. annuus, and N=20 otherwise). The stringency of the algorithm in classifying true/false variants was adjusted by comparing variant sets produced for different parameter values (tranche 100.0, 99.0, 90.0, 70.0, and 50.0). For each cohort, results for tranche = 90.0 were chosen for downstream analysis, based on heuristics: the number of novel SNPs identified, and improvements to the transition/transversion ratio (towards GATK’s default target of 2.15).

### Remapping sites to the HA412-HOv2 reference genome

Our initial analysis of haploblocks (see section “Population genomic detection of haploblocks”), as well as GWA/GEA results for haploblocks regions, found many instances of disconnected haploblocks and high linkage between distant parts of the genome, suggesting problems in contig ordering. We remapped genomic locations from XRQv1 9 to HA412-HOv2 11 using BWA 60 . Measures of LD using vcftools 61 showed that remapping significantly improved LD decay (Extended Data Fig. 1a) and produced more contiguous haploblocks (Extended Data Fig. 1b), supporting the accuracy of the new genome assembly and our remapping procedure. While we recognize that this approach reduces accuracy at the local scale, and would not be appropriate, for example, for determining the effects of variants on coding sequences, it produces a more accurate reflection of the genome and linkage structure.

### Phylogenetic analysis

Variants were called for 20 windows of 1 Mbp, randomly selected across the genome. Indels were removed and SNP sites were filtered for <20% missing data and minor allele frequency >0.1%. All sites were then concatenated and analyzed using IQtree 62–64 with ascertainment bias and otherwise default parameters.

### Genome-wide association mapping

Genome-wide association analyses were performed for 86, 30 and 69 phenotypic traits in H. annuus, H. argophyllus and H. petiolaris, respectively, using the EMMAX (v07Mar2010) or the EMMAX module in EasyGWAS 65 an annotated list of candidate genes is reported in Supplementary Table 2. Inflorescence and seed traits could not be collected for H. argophyllus, since most plants of this species flowered very late in our common garden, and failed to form fully-developed inflorescences and set seeds before temperatures became too low for their survival.

### Genome-environment association analyses

Twenty-four topo-climatic factors were extracted from climate data collected over a 30-year period (1961-1990) for the geographic coordinates of the population collection sites, using the software package Climate NA 66 . Soil samples from each population were also analyzed for 15 soil properties (Supplementary Table 1). The effects of each environmental variable were analyzed using BayPass 67 version 2.1. Following Gautier, 2015 67 , we employed Jeffreys’ rule 68 , and quantified the strength of associations between SNPs and variables as “strong” (10 dB ≤ BFis < 15 dB), “very strong” (15 dB ≤ BFis < 20 dB) and decisive (BFis ≥ 20 dB). An annotated list of candidates genes from GEA analyses is reported in Supplementary Table 2.

### Transgenes and expression assays

The complete coding sequences (CDS) of HaFT1, HaFT2 and HaFT6 were amplified from complementary DNA (cDNA) from H. argophyllus individuals carrying the early and late haplotype for arg06.01. Two alleles of the HaFT2 CDS were identified in late-flowering H. argophyllus plants (one of them identical to the HaFT2 CDS from early-flowering individuals), differing only for two synonymous substitutions at position 285 and 288. All alleles were placed under control of the constitutive CaMV 35S promoter in pFK210 derived from pGREEN 69 . Constructs were introduced into plants by Agrobacterium tumefaciens-mediated transformation 70 . Col-0 and ft-10 seeds were obtained from the Arabidopsis Biological Resource Center. All primer sequences are reported in Supplementary Table 3.

### Population genomic detection of haploblocks

The program lostruct (local PCA/population structure) was used to detect genomic regions with abnormal population structure 28 . Lostruct divides the genome into non-overlapping windows and calculates a PCA for each window. It then compares the PCAs derived from each window and calculates a similarity score. The matrix of similarity scores is then visualized using a multidimensional scaling (MDS) transformation. Lostruct analyses were performed on the H. annuus, H. argophyllus, H. petiolaris petiolaris, and H. petiolaris fallax datasets, as well as in a H. petiolaris dataset including both H. petiolaris petiolaris and H. petiolaris fallax individuals. For each dataset, lostruct was run with 100 SNP-wide windows and independently for each chromosome. Each MDS axis was then visualized by plotting the MDS score against the position of each window in the chromosome.

Many localized regions of extreme MDS values with high variation in MDS scores and sharp boundaries were detected (Fig. 4a Extended Data Fig. 4). Localized changes to population structure could occur due to selection or introgression, but both the size and discrete nature of the regions are consistent with underlying structural changes defining the boundaries and preventing recombination. For example, inversions prevent recombination between orientations and if inversion haplotypes are diverged enough, they will show up in lostruct scans 28 . Since we are interested in recombination suppression in the context of adaptation, we focused on regions that had the following features: (1) a PCA in the region should divide samples into three groups representing 0/0, 0/1 and 1/1 genotypes, (2) the middle 0/1 genotype should have higher average heterozygosity and (3) there should be high linkage disequilibrium (LD) within the region.

The combined evidence of PCA and linkage suggests that the lostruct outlier regions are characterized by long haplotypes with little or no recombination between haplotypes. We refer to these as haploblocks. To explore the haplotype structure underlying the haploblocks, sites correlated (R 2 > 0.8) with PC1 in the PCA of the haploblock were extracted as haplotype diagnostic sites and used to genotype the haploblocks. Since there is seemingly little recombination between haplotypes, this is conceptually similar to a hybrid index and we expect all samples to be consistently homozygous for one haplotypes alleles or be heterozygous at all sites (i.e. similar to an F1 hybrid). Haploblock genotypes were assigned to all samples using equation (1), where p is the proportion of haplotype 1 alleles and h is the observed heterozygosity. The haplotype structure was also visualized by plotting diagnostic SNP genotypes for each sample, with samples ordered by the proportion of alleles from haplotype 1 (e.g. Fig. 2f).

Lostruct was run in SNP datasets containing H. petiolaris petiolaris, H. petiolaris fallax, and both subspecies together. Although each dataset produced a collection of haploblocks, they were not identical. Some haploblocks were identified in one subspecies, but not the other, and some were only identified when both subspecies were analyzed together. In some cases, it was clear that haploblocks identified in both subspecies represented the same underlying haploblock because they physically overlapped and had overlapping diagnostic markers. We manually curated the list of haploblocks and merged those found in multiple datasets. We set the boundaries of these merged haploblocks to be inclusive (i.e. include windows found in either) and the diagnostic markers to be exclusive (i.e. only include sites found in both). For this merged set of haploblocks, all H. petiolaris samples were genotyped using diagnostic markers.

### Design of genetic markers for haploblock screening

Diagnostic SNPs for haploblocks were extracted from filtered vcf files. The resulting markers Cleaved-Amplified Polymorphic Sequence (CAPS) or direct sequencing markers were tested on representative subsets of individuals included in the original local PCA analysis (Fig. 4a, Extended Data Fig. 4), for which the genotype at haploblocks of interest was known. Marker information are reported in Supplementary Table 3.

### Sequencing coverage analysis

To detect the presence of potential deletions in the late-flowering allele of arg06.01, SNP in the haploblock region with average coverage of at least 4 across at least one of the genotypic classes were selected (in order to exclude positions with overall low mapping quality). SNP positions with coverage 0 or 1 in one genotypic class were counted as missing data for that genotypic class (Extended Data Fig. 2c).

### H. annuus reference assemblies comparisons

Masked reference sequences for the H. annuus cultivars HA412-HOv2 and PSC8 11, 12 were aligned using MUMmer 71 (v4.0.0b2). The programs nucmer (parameters -b 1000 -c 200 -g 500) and dnadiff within the MUMmer package were used. Only orthologous chromosomes were aligned together because of the high similarity and known conservation of chromosome structure. The one-to-one output file was then visualized in R and only included alignments where both sequences were > 5000 bp. Inversion boundaries and sequence identity between haplotypes were further determined using Syri 72 .

### Genetic maps comparisons

Fourteen genetic maps were used: the seven H. annuus genetic maps used in the creation of the XRQv1 genome 9 three newly generated H. annuus maps obtained from wild X cultivar F2 populations (E.B.M.D., M.T., G.L.O., L.H.R., in preparation) two previously published H. petiolaris genetic maps obtained from F1 crosses 50 and two newly generated H. petiolaris maps (K.H., Rose L. Andrews, G.L.O., K.L.O., L.H.R., in preparation). Whenever necessary, marker positions relative to XRQv1 were re-mapped to the HA12-HOv2 assembly (see above). Six of the previously described H. annuus maps were obtained from crosses between cultivars (the seventh one was obtained from a wild X cultivar cross) in order to determine which haploblock could be expected to segregate in the genetic maps, all of the H. annuus SAM population lines were genotyped for each H. annuus haploblock using diagnostic markers identified in wild H. annuus. Ann01.01 and ann05.01 were found to be highly polymorphic in the SAM population, while other haploblocks were fixed or nearly fixed for a single allele. For all fourteen maps, marker order was compared to physical positions in the HA412-HOv2 reference assembly, and evidence for suppressed recombination or structural variation was recorded (Extended Data Table 1).

Pairs of H. petiolaris and H. argophyllus populations that diverged for a large number of haploblocks were selected. Individuals from these populations were genotypes using haploblock diagnostic markers (see “Design of genetic markers for haploblock screening”) to identify, for each species, a pair of individuals with different genotypes at the largest possible number of haploblocks. Chromosome conformation capture sequencing 36, 73 (Hi-C) libraries were prepared by Dovetail Genomics (Scotts Valley, CA, USA) and sequenced on a single lane of HiSeq X with 150 bp paired end reads. Reads were trimmed for enzyme cut site and base quality using the tool trim in the package HOMER 74 (v4.10) and aligned to the HA412-HOv2 reference genome using NextGenMap 58 (v0.5.4). Interactions were quantified using the calls ‘makeTagDirectory - tbp 1-mapq 10’ and ‘analyzeHiC -res 1000000 -coverageNorm’ from HOMER. Hi-C data were used in two ways to identify structural changes. First, the difference between interaction matrices for samples of the same species was plotted for each haploblock region where the two samples had different genotypes. Second, the difference between interaction matrices for H. annuus (using the HiC data that were generated to scaffold the HA412-HOv2 reference assembly 11 ) and each H. petiolaris and H. argophyllus sample were plotted.

### Haploblock phenotype and environment associations

Since haploblocks are large enough to affect genome wide population structure, their associations with phenotypes of environmental variables may be masked when controlling for population structure. Therefore, a version of the variant file was created with all haploblock sites removed GWA and GEA analyses were performed as before, but kinship, PCA and genetic covariance matrix were calculated using this haploblock-free variant file. Regions of high associations co-localizing with haploblock regions were identified, and haploblocks were also directly tested by coding each haploblock as a single bi-allelic locus.

To examine the relative importance of haploblocks to trait evolution and environmental adaptation, association results were compared between haploblocks and SNPs. Using SNPs as a baseline allows to control for the correlation between traits or environmental variables. To make values comparable, both SNPs and SVs with minor allele frequency ≤ 0.03 were removed. Each locus was classified as associated (p < 0.001 or BFis > 10 dB) or not to each trait. The number of traits or climate variable each locus was associated with was then counted. The proportion of loci with ≥ 1 traits/climate variables associated for SNPs and haploblocks was then compared using prop.test in R 75 (Extended Data Fig. 9b).

### Haploblocks phylogenies and dating

The phylogeny of each haploblock region was estimated by Bayesian inference using BEAST 76 1.10.4 for 100 genes within the region. The dataset was partitioned, assuming unlinked substitution and clock models for the genes, and analyzed under the HKY model with 4 Gamma categories for site heterogeneity: a strict clock, a “Constant Size” tree prior with a Gamma distribution with shape parameter 10.0 and a scale parameter 0.004 for the population size. Default priors were used for the other parameters. A custom Perl script was used to combine FASTA sequences and the model parameters into XML format for BEAST input. The Markov chain Monte Carlo (MCMC) process was run for 1 million iterations and sampled every 1000 states. The convergence of chains was inspected in Tracer 77 1.7.1. In order to estimate divergence times, the resulting trees were calibrated using a mutation rate estimate of 6.9 × 10 −9 substitutions/site/year for sunflowers 78 , and visualized with R package ggtree 79 and Figtree v1.4.4 80 . Divergence times were extracted from the trees and plotted showing the 95% highest posterior density (HPD) interval based on the BEAST posterior distribution. This was repeated for 100 non-haploblock genes to estimate the species divergence times.

For the 10 Mb region on chromosome 6 controlling flowering time in H. argophyllus, the early flowering haplotype grouped with H. annuus. To determine if it is the product of an ancient haplotype that has retained polymorphism only in H. annuus or if it is introgressed from H. annuus, the phylogeny of 10 representative H. argophyllus samples homozygous for each haploblock allele, as well as 200 H. annuus samples, was inferred using IQtree. SNPs from the 10 Mb region were concatenated and the maximum likelihood tree was constructed using the GTR model with ascertainment bias correction. Branch support was estimated using ultrafast bootstrap implemented in IQtree 62–64 with 1,000 bootstrap replicates. Phylogenies of haploblock arg03.01, arg03.02 and arg06.02 were inferred using the same approach. To explore intra-specific history of the H. petiolaris haploblocks, all samples homozygous for either allele for each haploblock were selected, and phylogenies were constructed using IQtree with the same settings.

## Method

### Electrophysiology

Cell lines stably expressing Kv11.1-1A or Kv11.1-3.1 channels were maintained as previously described (13). Patch clamp electrophysiology recordings were undertaken as previously described (13 a summary is provided in the data supplement that accompanies the online edition of this article).

Drug block was calculated as Idrug / Icontrol and dose response curves were fitted with a modified Hill equation: where Idrug is current recorded in the presence of drug, Icontrol is current recorded in control conditions, D is the drug concentration, h is the Hill coefficient, and IC50 is the half maximal inhibitory concentration of D.

Data are presented as mean and standard error of the mean. Data were analyzed using one-tailed paired t tests and analysis of variance, followed by the Tukey t test for pairwise comparison. The significance threshold was set at 0.05.

### Clinical Cohort

The clinical cohort consisted of patients randomly assigned to one of five antipsychotic medications during phase 1/1A (first drug assigned) of the CATIE trial. The details of the overall design for the CATIE study, genotyping of the KCNH2 SNPs, and the participants’ demographic characteristics have been described previously (6, 9).

Because of the outpatient and parallel design of the original CATIE study, information about compliance based on drug clearance is an important factor determining symptom change during the CATIE trial. Thus, we only analyzed treatment response from subjects of European ancestry for whom we had drug clearance and genotype data (N=362). We previously showed that drug clearance data substantially improve prediction of treatment response (14). As an ancillary study to the CATIE trial, blood samples were drawn during study visits to measure antipsychotic drug concentrations. Data were collected on the amount of the last dose of medication, time the last dose was taken, and time the blood sample was drawn. This information was used with the drug concentration data to estimate drug clearance for each subject based on nonlinear mixed-effect modeling using NONMEM, version 5 (GloboMax, Ellicott City, Md.) (12). A one-compartment linear model with first-order absorption (NONMEM ADVAN5) using the first-order estimation method was used to estimate drug clearance (12).

We used estimated drug clearance instead of plasma concentrations because it is a dose-independent and time-independent measure, which allows for comparison of drug exposure across all subjects, as described in detail elsewhere (14, 15).

For this analysis, we focused on three SNPs in KCNH2—rs3800779 (SNP1), rs748693 (SNP2), and rs1036145 (SNP3)—which have been associated with increased expression of the novel Kv11.1-3.1 isoform in human postmortem brain samples (8) and overall response to treatment in the CATIE trial (9). Since the three SNPs were in moderate to strong linkage disequilibrium (see Table S1 in the online data supplement), in order to reduce multiple testing and to gain statistical power for detecting association, we constructed three SNP diplotypes to be used for testing diplotype-by-risperidone interaction on the treatment response. Haplotype construction was performed and phased diplotype was assigned using the Phase program (16). Details of genotyping and construction of diplotypes are provided in the data supplement. Diplotype was grouped into three categories according to the number of minor alleles that a diplotype contains at SNP1 and SNP3, coded “0” for no minor allele of either SNP1 or SNP3, “1” for one or two copies of minor alleles, and “2” for three or four copies of minor alleles. The distribution of diplotypes in individuals with drug clearance data was consistent with the total European ancestry sample in phase 1/1A of the CATIE trial, suggesting minimal selection bias (see Table S2 in the data supplement).

### Clinical Data Analysis

In the CATIE sample, because all patients were receiving treatment and because the time and number of Positive and Negative Syndrome Scale (PANSS) evaluations in the study varied considerably among subjects, we treated the baseline PANSS rating as “before treatment” and the last rating as “after treatment” to test for genetic variant-by-risperidone treatment interaction on the treatment response. Since each subject had two measures in the analysis, we used a general linear mixed model to incorporate the relatedness between two observations within a subject (9). We did not perform a separate analysis with only those subjects who completed the trial because that subset was too small (N=39).

We performed this analysis on all subjects for whom drug clearance data were available and for whom diplotypes were assigned with good confidence (N=362), and we controlled for potential covariates of sex, age, time on medication, and whether the patient completed the 18-month trial or discontinued medication before the end of 18 months and therefore switched to phase 2 of the trial.

Individuals who were on risperidone (N=88) had a mean estimated drug clearance rate of 20.62 L/hour (SD=10.73, range=3.61–40.05). Based on tertile distribution of the clearance data range, we classified individuals into three groups: slow (N=30), intermediate (N=29), and fast metabolism (N=29) groups.

The clinical response analysis consisted of a diplotype-by-treatment interaction, as previously described (9). To test our specific hypothesis of a differential effect of diplotype on the antipsychotic response to risperidone, which was based on the differential affinity of risperidone for Kv11.1 in contrast to all other drugs in the trial, we combined all other medications as one group (see the online data supplement for more detail). Because the mean and variance of estimated drug clearance varied with different drugs, we assigned an ordinal measure of 1 to 3 according to the tertile distribution of each drug clearance to make the estimated measurements comparable between drugs. Using an ordinal measure based on the tertiles of estimated drug clearance for each drug while adjusting for the type of drugs in the same model of analysis allowed us to capture the likely nonlinear relationship between estimated drug clearance and treatment response (9). For this analysis, however, since non-risperidone medications were all combined into one group, we considered the possibility that the effect of drug clearance using tertiles may be different between drugs and consequently may affect our assessment of the overall effect of drug clearance. Therefore, we performed a leave-one-out sensitivity analysis to assess for such a potential bias.

Our primary aim is to learn haplotype-cluster models from large training sets and use them to phase samples efficiently and accurately. Here we introduce some modifications to BEAGLE so that the algorithm is better suited to this aim. Our new algorithm is called Underdog.

BEAGLE only represents haplotypes that actually appear in the training examples. However, since we would like to phase new genotype samples that do not necessarily appear in the training set, we set the transition probability for allele a at a given SNP to

(Eq. B1)

where na is the number of times allele a is observed in training data, and nā is the number of times the other allele is observed. This is compared with the BEAGLE formula shown in Algorithm 3. Here, γ is a positive number between 0 and 1. To illustrate the rationale for this choice of transition probability, consider the bottom state of level 2 in Figure 2.1. Instead of having only one transition (to the bottom state in level 3) with 100% probability, we add a second transition for the blue allele (also to the bottom state in level 3) that is visited with probability γ. We define all transition probabilities in the haplotype-cluster model in this way. These transition probabilities are only noticeably different from the transition probabilities in BEAGLE when one allele occurs very infrequently in the training set within a given cluster of haplotypes. With this modification, Underdog allows for genotype phase based on haplotypes that did not appear in the training set.

Although the BEAGLE haplotype-cluster models are intended to be parsimonious, building these models from hundreds of thousands of haplotypes can still yield very large models with millions of states, making it difficult to phase genotype samples in a reasonable amount of time. To address this problem, we first observe that although there is typically a large number of possible ways of phasing a sample, most of these possibilities are extremely unlikely conditioned on a specific haplotype-cluster model. In other words, most of the probability mass is typically concentrated on a small subset of paths through the HMM. To avoid considering all possible paths (which is computationally expensive), at a given level d we retain the smallest number of states such that the probability of being in one of those states is greater than 1 - ε. Even for small values of ε, this heuristic dramatically decreases the computational cost of sampling from the HMM, and computing the most likely phase using the Viterbi algorithm (Figure B1), while incurring very few additional phasing errors.

Figure B1: Relationship between choice of HMM parameter ε and average computation time for phasing a genotype sample (based on chromosome 1 only). If we set ε = 0, the average sample phasing time is 63 seconds, and the average phasing error rate is 0.93%. For choices of ε that are larger, but not too large, we achieve comparable phasing accuracy with a dramatic reduction in computational expense. Note that the computation time here does not include file input/output, nor the time taken to merge the phasing results from multiple windows.

The second modification we make to BEAGLE concerns the criterion for deciding whether two haplotype clusters (i.e., nodes of the haploid Markov model) should be merged during model learning (see Algorithm 4). Since the standard method is overly confident for frequencies that are close to 0 or 1, we regularize the estimates using a symmetric beta distribution as a prior. Specifically, haplotype clusters x and y are not merged unless the following condition is satisfied for some haplotype h:

(Eq. B2)

where nx and ny are the sizes of clusters x and y. The posterior allele frequency estimates in this formula are

(Eq. B3)

where nx(h) and ny(h) are the numbers of haplotypes that begin with haplotype h. We set the parameters of the Beta prior (the prior counts), α and β, to 0.5. Compare this criterion to the one used in Browning (2006), (also refer to Algorithm 3), which merges two clusters unless the following relation holds for some h:

(Eq. B4)

where px (h) is the proportion of haplotypes in cluster x with that begin with haplotype h, and py (h) is the proportion of haplotypes in cluster y that begin with h. We evaluated the phasing accuracy of the algorithm using a few different values for constant C and settled on C = 20.

Algorithm 4 is the modified version of BEAGLE's procedure (Algorithm 3) that applies Eq. B2 to merging haplotypes during model building.

For computational efficiency, on each chromosome we estimate the genotype phase within 500-SNP windows separately. This can result in a loss of phasing accuracy at the beginning and end of each window because information outside the window is ignored, and therefore there is less information about the genotypes at the two extremities of the window. To address this problem, we learn haplotype-cluster models in overlapping windows specifically, we use 500-SNP windows in which two adjacent windows on the same chromosome overlap by 100 SNPs. Since the final phasing estimates produced in the two windows may disagree in the overlapping portion, it is not immediately clear how to combine the phasing estimates from adjacent windows. We propose a simple solution to this problem. First, we select the SNP nearest the midpoint of the overlapping portion at which the genotype is heterozygous (that is, the two allele copies are not the same). We call this the "switch-point SNP." We then join the sequences from the overlapping windows that share the same allele at this switch-point SNP. For example, in Figure B2 we join the top sequence in the left-hand window with the bottom sequence in the right-hand window because they are both estimated to carry the blue allele at the selected switch-point SNP.

Figure B2: Underdog learns haplotype-cluster models in overlapping windows. This figure illustrates how we obtain the final genotype phase from these overlapping windows.

1. Isaakios

This is new

2. Isaiah

Absolutely agrees with you. I think that is the excellent idea.

3. Karlee

The portal is just super, there would be more like it!

4. Zukazahn

Bravo, a beautiful sentence and on time

5. Stanbury

I consider, that you are not right. I can defend the position. Write to me in PM, we will communicate.