Trust but Verify: A Lesson in Technology Limitations and Error Propagation
By C. Robin Buell
Unlike molecular biology techniques that are routinely used in individual investigator labs, the high cost of infrastructure historically has resulted in genome sequencing being performed at genome centers in which quality control and quality assessments are a mainstay, and it is the rare exception that major errors are not corrected prior to full data release. However, this is not absolute. Even for Arabidopsis thaliana, a consequence of the bacterial artificial chromosome clone-based sequencing method used in the project was that E. coli and vector sequences were still present in release 7 of the A. thaliana genome (Lamesch et al. 2012) revealing that not only were these contaminants missed in the initial release of the sequence, they had been propagated through multiple releases and corrections of the genome.
Indeed, most likely as a consequence of the high quality A. thaliana genome which has been available to the community for nearly two decades, a significant number—potentially a majority of Arabidopsis researchers—have limited knowledge of errors inherent in genome sequencing projects. The Letter by Sloan et al. (2018) “Correction of persistent errors in Arabidopsis reference mitochondrial genomes” highlights this issue. These researchers describe the history of the A. thaliana Col-0 mitochondrion reference sequence noting that it is the sequence of the C24 ecotype mitochondrion, not the Col-0 ecotype, that is provided in the TAIR10 release. Thus, the official reference sequences for A. thaliana are a hybrid of Col-0 (genome, plastid) and C24 (mitochondrion) even though the sequence of the Col-0 ecotype was reported in 2011 (Davila et al., 2011). After noticing variants in their Col-0 mitochondrial sequence relative to the Col-0 reference mitochondrion sequence, Sloan et al. then utilized multiple datasets generated from improved, diverse next generation sequencing platforms to identify errors in the Col-0 mitochondrial reference genome sequence. Overall, an error every 2.4 kbp and two structural variants were identified; none impacted protein coding potential. Of course, genome sequencing errors and propagation of errors is not restricted to Arabidopsis (e.g., Gallaher et al. 2018) and as sequencing and computational technologies mature, errors will become less frequent. Yet a key concern to the community will be whether funds and more importantly, researchers, are engaged at a sufficient level to identify, correct, and update older reference genome sequences.
Another concern for the community is that genomics is now “democratized” as highlighted by ultra high-throughput sequencing platforms that can be run on a laptop or smart phone, applications of sequencing technologies such as RNA-sequencing (RNA-seq) that are ubiquitous, and open-source software tools and applications that are “push button”. While these are tremendous advances, the simplicity by which any scientist can generate sequence for any organism will certainly lead to a “de-skilling” in which generators of sequence data are not cognizant of the limitations, biases, and errors inherent to the platform or software and as a consequence, can unwittingly release datasets with errors. Thus, a trust-but-verify approach should be employed when using large-scale genome datasets, and scientists are encouraged to thoroughly investigate the provenance and underlying quality in genome datasets, as errors do happen and without a deep historical knowledge of a project, they can be propagated for decades as shown by Sloan et al. (2018).
ACKNOWLEDGMENTS
Image credit: Copyright https://www.123rf.com/profile_ktsdesign
REFERENCES
Davila, J.I., Arrieta-Montiel, M.P., Wamboldt, Y., Cao, J., Hagmann J., Shedge, V., Xu, Y.Z., Weigel, D., Mackenie, S.A. (2011). Double-strand break repair processes drive evolution of the mitochondrial genome in Arabidopsis. BMC Biol. 9:64.
Gallaher, S.D., Fitz-Gibbon, S.T., Strenkert, D., Purvine, S.O., Pellegrini, M., Merchant, S.S. (2018). High-throughput sequencing of the chloroplast and mitochondrion of Chlamydomonas reinhardtii to generate improved de novo assemblies, analyze expression patterns and transcript speciation, and evaluate diversity among laboratory strains and wild isolates. Plant J. 93:545-565.
Lamesch, P., Berardini, T. Z., Li, D., Swarbreck, D., Wilks, C., Sasidharan, R., Muller, R., Dreher, K., Alexander, D.L., Garcia-Hernandez, M., Karthikeyan, A.S., Lee, C.H., Nelson, W.D., Ploetz, L., Singh, S., Wensel. A., Huala E. (2012). The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools. Nuc. Acids Res. 40: D1202–D1210.
Sloan, D.B., Wu, Z., and Sharbrough, J. (2018). Correction of Persistent Errors in Arabidopsis Reference Mitochondrial Genomes. Plant Cell https://doi.org/10.1105/tpc.18.00024.