The discipline of bioinformatics integrates computer science with biology to acquire, store, analyze, and share data about biological systems. Most often this data is concerned with DNA and sequences of amino acids.
Bioinformatics Concept. Image Credit: CI Photos/Shutterstock.com
The rapid growth and development of computer technology, alongside significant advances in our understanding of human biology achieved by the human genome project and ventures it inspired, has facilitated the recent growth in bioinformatics, although the foundations of the discipline date back to the 1950s.
Here, we discuss the history of bioinformatics, how it has evolved, and what role it will play in the future of biology and medical science.
The starting point of protein analysis
Bioinformatics draws together computer technology alongside biology. The interdisciplinary field leverages computer science to collect, manage, and analyze biological data. The origin of the field can be linked back to 1953 when the double-helix structure of DNA was determined.
Scientists recognized the potential applications of being able to sequence DNA, however, it wasn’t until 25 years later that the first DNA sequencing methods emerged. Before figuring out DNA sequencing, scientists first focussed on protein analysis as their starting point.
The technique of crystallography along with the first successful sequencing of a protein helped to stimulate the development of more powerful protein sequencing methods. Out of this work, the Edman degradation method was established, which sequenced over 15 different protein families over the following decade.
However, Edman sequencing faced the limitation of not being effective at revealing large protein sequences. Theoretically, the technique can sequence no more than around 50-60 amino acids in each reaction. In trying to overcome this problem, the first bioinformatics software was devised.
Developing the first bioinformatics software
American physical chemist Margaret Dayhoff is responsible for establishing the use of computational methods in biochemistry. Between 1958 and 1962 she worked alongside Robert S. Ledley at the National Biomedical Resource Foundation to develop the computer program known as COMPROTEIN that is the first occurrence of a de novo sequence assembler, which used Edman peptide sequencing data to ascertain protein primary structure. Later, Dayhoff then evolved the technology, developing the one-letter amino acid code to simplify the method. The one-letter amino acid code is still used today.
Also during this period, Emile Zuckerkandl and Linus Pauling deviated from the mechanistic modeling of enzymes to view biomolecular sequences and carriers of information. Amino acids form proteins just as letters combine to form different words with unique meanings. Zuckerkandl and Pauling hypothesized that understanding how these sequences change over time via divergence from a common ancestor could give biologists a view of the evolutionary history of proteins and allow them to reconstruct the DNA of a species’ ancestors.
It was discovered that the rate at which amino acid replacements occur in proteins is fairly consistent across species. This led biologists to theorize that a single rate of genetic evolution could be established, this is was Zuckerkandl and Pauling described at the 'molecular clock’. They realized that establishing this biological clock could be used to establish timescales of evolutionary divergences. To do this, several computational problems had to be resolved, primarily, the scientists needed a way to reliably estimate the ‘evolutionary value’ of amino acid substitutions. Additionally, they had to address the lack of reproducible algorithms for protein sequence alignment.
In 1970, Needleman and Wunsch partly solved this issue via the establishment of their dynamic programming algorithm for the alignment of pairwise protein sequences. It then took another decade for the first multiple sequence alignment (MSA) algorithms to be developed, with the first practical MSA being created not being created until 1987.
A shift from protein to DNA analysis
In 1976, the Maxam–Gilbert sequencing method became the first widely adopted DNA sequencing method. However, its utilization of radiation and dangerous chemicals limited its use. While manual methods of extracting information from DNA existed, these methods rely on complex comparisons, calculations, and pattern matching, which are performed much faster and more efficiently by computers than humans.
In 1979, Roger Staden created the first available software to analyze DNA using Sanger sequencing, the Staden package is still used today.
Simultaneous advances in computer technology and biology
Between 1980 and 1990, molecular methods developed with the capacity to target and amplify specific genes. The establishment of techniques such as polymerase chain reaction (PCR) and Jackson, Symons, and Berg’s methodology for cutting and inserting DNA acted as milestones in DNA manipulation.
The emergence of DNA sequencing alongside these enhanced DNA manipulation techniques gave access to more and more sequence data. Additionally, advances in computer technology that made it increasingly powerful and accessible during this period further helped to further the field of bioinformatics.
Also at this time, the GNU Manifesto was published, which provided arguments for creating a free and shared operating system. The free software philosophy that the manifesto promoted is thought to have been instrumental in the later development of several key sequencing databases and software, such as the European Molecular Biology Laboratory, GenBank, and DNA Data Bank.
Later, as the use of the internet propagated during the 1990s and the 2000s, bioinformatics tools experienced rapid proliferation due to the support provided by connecting scientists worldwide. During this time, the Human Genome Project established the complete human genome sequence. The process took over a decade and cost millions of dollars to complete.
Now, with the significant advancements in technology that have occurred, the same process would cost roughly $1,000 and take a few days. This advancement in technology is at least partly thanks to the Human Genome Project itself, which required the development of specialized software which then inspired the innovation of other pioneering bioinformatics software.
Today, the field of bioinformatics still faces several challenges. However, thanks to the continuing rapid development of technology, it is believed that bioinformatics will continue to undergo significant advancements and its applications will likely grow.
- Bayat, A., 2002. Science, medicine, and the future: Bioinformatics. BMJ, 324(7344), pp.1018-1022. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1122955/
- Gauthier, J., Vincent, A., Charette, S. and Derome, N., 2018. A brief history of bioinformatics. Briefings in Bioinformatics, 20(6), pp.1981-1996. https://pubmed.ncbi.nlm.nih.gov/30084940/
- Lee, M. and Ho, S., 2016. Molecular clocks. Current Biology, 26(10), pp.R399-R402. https://pubmed.ncbi.nlm.nih.gov/27218841/
- Morgan, G., 1998. Journal of the History of Biology, 31(2), pp.155-178. https://pubmed.ncbi.nlm.nih.gov/11620303/