The next generation of next-generation sequencing is upon us. Third-generation sequencing aims to provide long stretches of sequence – ultimately to the chromosome level – at bargain basement prices. Progress is being made toward those goals with the emergence of long-read sequencing techniques and new methods for scaffolding, as well as rapidly developing software for assembling and interpreting the sequences (reviewed in Jiao and Schneeberger, 2017; Bolger et al., 2017). The application of these new technologies to plant genomes has been challenging, in part because of the large size and highly repetitive nature of many plant genomes. New work from Schmidt, Vogel, Denton et al. (2017) reports on the state of the art, using long-read technology to sequence the genome of a wild tomato (Solanum pennellii) accession.
The authors wanted to learn more about genomic differences between S. pennellii accession LYC1722 and the current reference accession LA716. Schmidt, Vogel, Denton et al. took the opportunity to see whether the latest developments in Oxford Nanopore sequencing meant that it could be useful, and cost-effective, for a plant genome. Oxford Nanopore sequencing involves feeding a fragment of DNA through a tiny (nano)pore in a flowcell and determining the sequence of that DNA based on how current through that pore is changed by the specific bases in the DNA. In addition to generating much longer reads than current short-read technology, the nanopore approach does not require a capital investment in expensive equipment, and uses relatively inexpensive reagents. It has been, however, more error prone than current short-read sequencing approaches.
The authors began with more traditional Illumina sequencing to get a sense of the genome size and complexity. They obtained 39 Gb of 300 bp paired reads, and estimated the LYC1722 genome to be ~ 1–1.2 Gb based on k-mer analysis. Thus, its genome appears similar in size to the S. pennellii LA716 reference genome. There were 6.2 million predicted sequence variations compared to the S. pennellii reference genome, which is high compared to variation among current tomato (S. lycopersicum) cultivars. Consistent with this diversity in genome sequence, metabolite contents also showed variation between the two S. pennellii accessions.
Armed with this overview of its genome, the authors applied Oxford Nanopore sequencing and de novo assembly to S. pennellii accession LYC1722. Using 31 flowcells, they were able to get ~100 X coverage: they obtained 134.8 Gb of total sequence, which was winnowed to 110.96 Gb after filtering for quality control. The average read lengths were 6–14 kb depending on preparation protocol, with the longest 153 kb. The authors estimated the quality (i.e., base accuracy) of this unassembled, unpolished data to be 80%, but contend that this value could likely improve if a basecaller were trained for plant data.
The authors tested various software for assembly and were able to achieve assemblies for which the N50 contig length was 2.45 Mb (i.e, half of the assembly was in contigs of 2.45 Mb or longer) and the full genome sequence was assembled in just 899 contigs. Schmidt, Vogel, Denton et al. also provide evidence that longer initial reads (over 20 kb) would give even better assemblies. Based on discrepancies with the Illumina sequencing reads, the authors determine these unpolished assemblies to have error rates of ~1 to 9%, with deletions the most common type of error. Sequence polishing software, however, was able to bring error rates down to values similar to those of Illumina sequencing (~0.016–0.025%).
Finally, the authors used their genome sequencing data to identify five genes present in their focal accession and in Arabidopsis thaliana, but absent from cultivated tomato, providing some interesting candidates for future work to explore the basis of the differences between wild and cultivated tomato. In all, the authors conclude that it is possible to obtain useful sequences of 2 Gb genomes for less than $25000 in reagents, although labor is required for careful quality assessment and polishing. As these tools are rapidly evolving, there is every reason to expect that this long-read sequencing and assembly technology, together with powerful comparative analyses, will pave the way for rapid advancement in our understanding of the genomic dimensions of variation in biological form and function.
Bolger, M.E., Arsova, B., and Usadel, B. (2017). Plant genome and transcriptome annotations: from misconceptions to simple solutions. Briefings in bioinformatics. 10.1093/bib/bbw135
Jiao, W.-B., and Schneeberger, K. (2017). The impact of third generation genomic technologies on plant genome assembly. Current Opinion in Plant Biology 36, 64-70.
Schmidt, M.H.W., Vogel, A., Denton, A.K., Istace, B., Wormit, A., van de Geest, H., Bolger, M.E., Alseekh, S., Maß, J., Pfaff, C., Schurr, U., Chetelat, R.T., Maumus, F., Aury, J.-M, Koren, S., Fernie, A.R., Zamir, D., Bolger, A., Usadel, B. (2017). De novo Assembly of a New Solanum pennellii Accession Using Nanopore Sequencing. Plant Cell. doi: 10.1105/tpc.17.00521.