How To Start Your Coding Journey

While learning how to code may seem like a daunting task, having the ability to code can help streamline your research in plant science. Knowing how to code can significantly improve your data analysis and visualization, and help you automate tasks – not to mention that it looks good on your CV. Although many programming languages exist, the best two to learn for plant scientists – or biologists in general – are Python and R.  

Python is a widely used language both in the sciences and other fields and is often considered a good first language to learn. Packages such as NumPy for computation (Harris et al., 2020) and pandas for data analysis (The pandas development team, 2020) can be easily incorporated in your code. Biopython (Cock et al., 2020) is a powerful set of libraries that can do many tasks to make molecular biology work easier, such as translating DNA sequences to RNA and amino acids, accessing FASTA files from databases like NCBI, and creating phylogenetic trees. Outside of scientific applications, python is often used for web and application development and can help automate tasks such as renaming large amounts of files.  

R, on the other hand, is a powerhouse when it comes to data analysis and visualization. Although a lot can be done with R itself, such as its main use case of statistical analysis, there are a plethora of packages available to make the language even more powerful. Ggplot2 is a fantastic package for data visualization, and tidyr can be incredibly useful for data organization and restructuring (Wickham et al., 2019). And packages like DESeq2 (Love et al., 2014) for RNAseq analysis can help with more specific tasks.  

The next choice you will need to make is how you will learn the language(s) of your choice. If you are an undergraduate or graduate student, your university will likely offer courses to learn these languages. Take a look at your university’s course catalog: often these classes will be offered by biology or plant science departments, but the computer science, math, or statistics departments may also have courses that will be of interest. Learning in a structured course has advantages such as having a knowledgeable professor or TA, having access to office hours if you need assistance, and meeting classmates who can provide support.  

If you are not able to take a class in-person, some universities like Harvard and MIT (among others) offer open-access classes that mimic the actual courses taught at those institutions, but with the advantage of being available entirely for free. You can typically complete these at your own pace, which is valuable for those with busy schedules. However, you will not get the same feedback as a live, in-person course, which may be desired.  

Another option for learning to code would be an online course from a resource like Codecademy, edX, or Coursera. These platforms host courses and tutorials for a plethora of languages and also offer you access to a community of people taking the same course as you. Some also offer lesson plans and feedback from a real instructor. Many online resources have both free and paid options, allowing you to learn for much less money than a university course.  

Learning to code is no different than learning any other skill, and you will encounter some bumps along the road. Online forums like StackOverflow are full of programmers who have likely encountered the same errors or bugs as you. If you are affiliated with a university, you may be able to join groups that meet regularly to discuss coding, statistics, and other aspects of computing as they relate to biology research. And, there may even be people in your own lab who know how to code and would be more than happy to help you out should you need assistance. Once you become comfortable with coding in R and/or Python, you may want to branch out to other languages. SQL may be of use if you work with or want to create databases, and Perl can be useful for reading through large text files. You might even find yourself writing small programs to help automate your tasks outside of science! 

 

References

Harris, C.R., Millman, K.J., van der Walt, S.J. et al. Array programming with NumPy. Nature 585, 357–362 (2020), 10.1038/s41586-020-2649-2.  

The Pandas development team, pandas-dev/pandas: Pandas, 2020, https://doi.org/10.5281/zenodo.3509134 

Peter J. A. Cock, Tiago Antao, Jeffrey T. Chang, Brad A. Chapman, Cymon J. Cox, Andrew Dalke, Iddo Friedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, Michiel J. L. de Hoon, Biopython: freely available Python tools for computational molecular biology and bioinformatics, Bioinformatics, Volume 25, Issue 11, June 2009, Pages 1422–1423, https://doi.org/10.1093/bioinformatics/btp163 

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T., Miller, E., Bache, S., Müller, K., Ooms, J., Robinson, D., Seidel, D., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686, : 10.21105/joss.01686 

Love MI, Huber W, Anders S (2014). “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15, 550. doi:10.1186/s13059-014-0550-8. 

 

______________________________________________

About the Authors

Villő Bernád

Villő Bernád is a final-year PhD student at UCD, and a 2025 Plantae Fellows. She is focusing on the study of waterlogging stress in barley, and her research interests lie in the fields of bioinformatics and computational biology. You can find her on X: @BernadVillo.

Maya Sealander

Maya is a graduate student at the University of Missouri and a 2025 Plantae Fellows. She spends most of her days shining bright lights on plants to investigate the mechanisms behind ROS production. When she’s not in the lab, she enjoys doing art projects, playing Pokemon, and eating vegan sushi.