First Generation Wheat Hapmap

Overview

A detailed description of DNA sequence variation across the genome is a pre-requisite for the systematic analysis of variants underlying trait variation in wheat and critical for understanding the role of various evolutionary factors in shaping genome diversity. Newly developed sequencing technologies offer the possibility for obtaining a complete catalog of single nucleotide polymorphisms (SNPs) and insertions/deletions (indels) even for the complex wheat genome. The recent release of wheat genome assembly allowed us to describe the chromosomal distribution of variants and their potential effect on gene function. A developed haplotype map will be a valuable tool for imputing genotypes and transferring sequence-level variation data across multiple gene mapping projects, thereby increasing the power and precision of trait mapping in GWAS and helping to understand better the basis of complex phenotypic traits.

The data were generated by re-sequencing 62 diverse wheat lines using whole exome capture (WEC) and genotyping-by-sequencing (GBS) approaches. The panel of wheat lines was selected to capture the genetic diversity of the major global wheat growing regions and included landraces and cultivars (see Fig. 1). Out of these lines 32 are founders of the spring wheat nested-association mapping (NAM) population including 2,400 recombinant inbred lines (RILs). We identified 1.57 million SNPs and 161,719 small indels distributed across all 21 chromosomes. In coding sequences we identified 83,622 non-synonymous and 76,361 synonymous SNPs. Based on high-confidence gene models in the CSS contigs, we determined that only 1,600 and 1,583 SNPs are predicted to produce premature termination codons and splice-site disruptions.

Please refer to our paper for more details.

Wheat Lines Sequenced

Wheat lines Origin Improvement status Growth habit Region Large region
RAC875 Australia cultivar Spring Australia Australia
Opata Mexico cultivar Spring North and Central America The Americas
W7984 Mexico synthetic Spring North and Central America The Americas
PBW343 India cultivar spring South-central Asia Asia
Clear White USA cultivar Spring North and Central America The Americas
Vorobey Mexico cultivar Spring North and Central America The Americas
Klein Chamaco Argentina cultivar Spring South America The Americas
Pavon Mexico cultivar Spring North and Central America The Americas
acc2 USA Breeding line Spring North and Central America The Americas
acc3 USA Breeding line Spring North and Central America The Americas
acc4 USA Breeding line Spring North and Central America The Americas
acc1 USA Breeding line Spring North and Central America The Americas
acc5 USA Breeding line Spring North and Central America The Americas
PI366716 Afghanistan landrace Spring South-central Asia Asia
PI406517 Nepal landrace spring South-central Asia Asia
PI349512 Switzerland landrace spring Western and Northern Europe Europe
PI481923 Sudan landrace spring Northern Africa Africa
PI481718 Bhutan landrace spring South-central Asia Asia
PI382150 Japan landrace spring Eastern Asia Asia
PI366905 Afghanistan landrace spring South-central Asia Asia
PI470817 Algeria landrace spring Northern Africa Africa
PI445736 Nepal landrace spring South-central Asia Asia
Hidhab Algeria cultivar spring Northern Africa Africa
PI477870 Peru landrace spring South America The Americas
Dharwar Dry India cultivar spring South-central Asia Asia
Cham 6 Syria/Lebanon cultivar spring Western Asia Asia
Chakwal 86 Pakistan cultivar spring South-central Asia Asia
Berkut Mexico cultivar spring North and Central America The Americas
PI262611 Turkmenistan landrace spring South-central Asia Asia
PI222669 Iran landrace spring Western Asia Asia
PI278297 Greece landrace spring Southern Europe Europe
PI210945 Cyprus landrace spring Western Asia Asia
PI192569 Sweden landrace spring Western and Northern Europe Europe
PI192147 Ethiopia landrace spring South and East Africa Africa
PI8813 Iraq landrace spring Western Asia Asia
PI565213 Bolivia landrace spring South America The Americas
PI82469 North Korea landrace spring Eastern Asia Asia
PI185715 Portugal landrace spring Southern Europe Europe
PI245368 Guatemala landrace spring North and Central America The Americas
PI166333 Turkey landrace spring Western Asia Asia
PI166180 India landrace spring South-central Asia The Americas
PI177943 Turkey landrace spring Western Asia Asia
PI192001 Angola landrace spring South and East Africa Africa
PI153785 Brazil landrace spring South America The Americas
Marquis Canada cultivar spring North and Central America The Americas
Neepawa Canada cultivar spring North and Central America The Americas
AC Barrie Canada cultivar spring North and Central America The Americas
Chinese Spring China cultivar spring Eastern Asia Asia
Utmost Canada cultivar spring North and Central America The Americas
Rialto United Kingdom cultivar winter Western and Northern Europe Europe
Truman USA cultivar winter North and Central America The Americas
49-2914 H1096 Argentina Breeding line facultative South America The Americas
102 Chile cultivar facultative South America The Americas
93 Bulgaria cultivar facultative Western and Northern Europe Europe
Estacao Portugal cultivar winter Southern Europe Europe
Taxi United Kingdom cultivar winter Western and Northern Europe Europe
PR267 United States cultivar winter North and Central America The Americas
Alabasskaja Kazakhstan cultivar winter Asia
Roemer Winter Germany cultivar winter Western and Northern Europe Europe
407-IV/60 Bosnia and Herzegovina cultivar facultative Western and Northern Europe Europe
403 Chile cultivar winter South America The Americas
Avalon United Kingdom cultivar winter Western and Northern Europe Europe


Figure 1. Right panel. Neighbor-joining tree of 62 diverse accessions color-coded by the place of origin. The neighbor-joining tree for 62 accessions was constructed using R package ape v3.0.6.   Left panel. Principal component analysis of wheat diversity panel. Accessions selected for re-sequencing are shown by red circles; spring and winter what accessions are shown by triangles and circles.


Variant Calling Summary

Distribution of wheat exome capture (WEC) and genotyping-by-sequencing (GBS) variants among the genomic features.
Approach SNP class SNP subclass Total A gen. B gen. D gen.
WEC All SNPs 1,341,350 490,348 645,406 205,596
Nonsyn. Total 77,920 27,645 37,194 13,081
PTC 1,508 548 710 250
Read Through 39 14 19 6
Start Lost 159 71 66 22
Start Gained 2,481 890 1,190 401
Syn. 70,490 23,423 35,262 11,805
Intronic 166,741 58,791 82,545 25,405
3' UTR 29,265 10,797 13,720 4,748
5' UTR 11,440 3,956 5,556 1,928
Upstreama 108,992 39,955 53,229 15,808
Downstreama 120,873 46,005 56,602 18,266
Intergenic 755,629 279,776 361,298 114,555
Indels Total 147,064 53,986 68,104 24,974
CDS 10,112 3,569 4,561 1,982
UTR 15,013 5,417 7,103 2,493
Intronic 32,122 11,388 15,362 5,372
Intergenic 89,817 33,612 41,078 15,127

GBS All SNPs 225,304 99,217 73,395 52,692
Nonsyn. Total 5,702 1,990 2,158 1,554
PTC 92 37 36 19
Read Through 2 1 1 0
Start Lost 9 1 6 2
Start Gained 202 79 90 33
Syn. 5,871 2,083 2,158 1,630
Intronic 13,955 5,121 5,138 3,696
3' UTR 2,495 933 944 618
5' UTR 1,158 387 523 248
Upstreama 11,707 4,206 4,808 2,693
Downstreama 11,800 4,640 4,744 2,416
Intergenic 172,616 79,857 52,922 39,837
Indels Total 14,655 5,971 5,066 3,618
CDS 422 147 163 112
UTR 678 243 290 145
Intronic 1,481 521 553 407
Intergenic 12,074 5,060 4,060 2,954


Wheat Hapmap Applications

  • Selection scans
    • Pairwise Haplotype Sharing (PHS)
    • The PHS statistic was calculated as described by Toomajian et al. using custom Perl script. Utilizing this statistic we can detect genomic regions where haplotypes extend for relatively large portions of the genome normalized by the overall length of haplotype blocks within in the genome. These regions are characteristic of regions undergoing partial selective sweeps. Thresholds of the PHS statistic were determined by taking the 97.5 percentile of the overall distribution of PHS values, which are 0.72, 0.67, and 1.23 for the A, B, and D genomes, respectively.

    • Cross population composite likelihood ratio (XP-CLR)
    • We performed a genome scan for selected regions using a XP-CLR approach that is robust to assumptions regarding recombination rates and demography. In this method two populations are compared for allele frequency differentiation and the extent of linked variation to detect regions where change in frequency occurred too quickly to be caused by random drift. The XP-CLR scores were estimated using code downloaded from here . A set of grid points are placed along the chromosome arms with a spacing of 500 kb, the window size was chosen to be 0.1 cM, and the maximum number of SNPs in each window was fixed to be 500. The critical values for putative selection targets were estimated based on the 97.5 percentile of the test statistic distribution for each wheat genome. Critical values for XP-CLR statistics were 4.9, 4.5, and 5.0 for the A, B, and D genomes, respectively.

  • Genotype Imputation
    • Genotype imputation was performed using Beagle v.4 89 with the following parameters: window=5000 overlap=500 burns-its=10 impute-its=10. To increase the accuracy of imputation, the settings of burns-its and impute-its have been increased from the default settings (burns-its=5, impute-its=5) to 10 (according to recommendations in user's manual). The accuracy of genotype imputation assessed in windows including from 1,000 to 5,000 markers for cultivars Avalon and Rialto showed no significant differences. A setting of window=5000 was selected because of its computational efficiency.
      To test the accuracy of imputation, we sequentially selected each cultivar from the panel of 62 lines and masked all genotyped sites, except ~14,000 SNPs overlapping between the WEC and 90K SNP array. At these SNP sites at least 75% of accessions in both datasets had genotype calls. The remaining 61 cultivars were used as a reference panel for imputing 649,502 SNPs that were ordered along the wheat chromosomes. After imputation genotypes were filtered using the different thresholds of genotype probability assessed by Beagle. The filtered predicted genotypes in each cultivar were compared with the actual genotype calls obtained by WEC sequencing to assess the accuracy of imputation.

      Figure 2. Relationship between the accuracy of genotype imputation and the percentage of missing data, which is estimated after removing genotypes over a range of genotype calling probability thresholds. Imputation in Opata (solid lines) and Rialto (dashed lines) cultivars was performed using the reference panel of 60 lines (Opata and Rialto cultivars were excluded) genotyped using the 90K iSelect assay
  • GWAS with imputed data
    • In genome-wide association studies (GWAS), marker density affects the probability of finding variants in linkage disequilibrium (LD) with a causal variant. We showed that our reference panel of 62 accessions can be effectively used to impute DNA polymorphisms in a GWAS. A panel of 678 diverse wheat landraces phenotyped for resistance to three rust diseases was tested for marker-trait associations using genotyping data generated with the wheat 90 K SNP array. Overall, comparison of marker-trait associations at non-imputed and imputed sites shows that imputed SNPs not only increase marker density but in most cases perform similar to or better than the SNPs directly genotyped using the 90 K assay. These results demonstrate the value of having a more complete ascertainment of DNA polymorphisms for GWAS that is achieved utilizing the high-density SNP variation data developed from 62 lines and the public 90 K iSelect genotyping array.

Dataset

  • VCF files
  • Watkins panel GWAS dataset, including 90K genotyping data, imputed SNPs and phenotyping scores
  • Wheat Exome Capture design file, 107Mb
  • Publication

    Imputation tools

    To facilitate wheat reserachers utilizing the first generation wheat hapmap data, we have developed tools/scripts for data formatting (from or to: hmp, vcf, tsv, et al), as well as scripts for running imputation in parallel and cleaning/filtering the imputed results. The experiened Unix/Linux users may download all the tools/scripts including example data from this link. An online server for running imputation is also being constructed.

    Project participants

    Funding

    USDA National Institute of Food and Agriculture Grant 2011-68002- 30029 (Triticeae-CAP) and 2012-67013-1940, Bill and Melinda Gates Foundation, Genome Prairie, Genome Canada, Saskatchewan Ministry of Agriculture, Western Grains Research Foundation, BBSRC, KSU Plant Biotechnology Center and Kansas Agricultural Experiment Station.