Like any other industry, the scientific industry, which includes individuals who practice basic science, as well as those involved in clinical research and the treatment of patients, generates big data. With pressure on the life sciences community to generate and publish data faster than ever, big data offers new ways for this industry to make meaningful scientific discoveries at an efficient and rapid rate.
Big Data - Tim Smith
What is big data?
Although the concept of accessing and storing large amounts of data for analytical purposes has been a challenge for many years, it was not until the early 2000s that the term “big data” was coined. During this period, industry analyst Dough Laney defined big data into three V’s, which include volume, velocity, and variety.
Most organizations collect their data from several different sources, some of which can include industrial equipment, videos and pictures, social media, business transactions, smart devices, and much more. Previously, the storage of all of this data would have been a challenge; however, the evolution of big data has allowed for this vast volume of data to be stored on affordable and easily accessible platforms for these organizations.
In addition to volume, the velocity aspect of big data refers to the rapid rate at which data is collected and handled by these organizations. To meet the growing demand of data storage and analysis, big data can utilize various electronic devices such as sensors, smart meters, and radio frequency identification (RFID) tags or transponders.
Thirdly, the variety of big data refers to the many different formats of data that are collected and stored by different organizations. To this end, data can be structured or numerical data that is shared through text documents, emails, videos, pictures, audio files, stock ticker data, and financial transactions.
Big data in scientific discovery
Today, life sciences researchers have a substantial amount of data available to them that comes in many different forms. More specifically, this data ranges from high-throughput screening and mass spectrometry data to metabolomics, transcriptomic, and phenotyping data. Taken together, this data is crucial to advancing scientific discovery, as it can expand the understanding of how these diseases arise in the first place to assist in the development of new and effective preventative and treatment options.
Although a vast amount of work and money has been involved in the production of this spectrum of data, researchers often face difficulties in how to interpret and analyze this massive amount of information. Consider gene sequencing, which is used by many clinical researchers to identify genetic mutations that have been linked to many different diseases ranging from developmental disabilities to cancer.
Gene sequencing studies can produce terabytes of data, which can quickly become unmanageable to analyze, especially when this dataset is combined with proteomic and metabolomic data.
Big Data Concept. Image Credit: carlos castilla/Shutterstock.com
Challenges for big data in life sciences
This is where big data can revolutionize how life sciences studies are conducted. In this situation, big data can combine the gene sequencing information with the applicable proteomic and metabolomic data into a single platform. While this may seem like a straightforward solution to the problem, it is important to recognize that this would require integrating data from hundreds of different sources in a way that researchers can effectively analyze and interpret this data.
Unfortunately, there has been a lack of technological solutions that have been able to meet the immense scale and variety of data. Furthermore, the big data solution that would be required by the life sciences industry would not only need to manage the sheer volume of data that is already available but is also capable of keeping up with the growing amount of data that is published each day.
Currently, it is estimated that over 200,000 clinical trials are currently active, which include 21,000 drug components, 1,357 unique drugs, 22,000 genes, and several hundreds of thousands of proteins. Within each of these areas of study are many different tests and experiments that produce a wide range of data. Moreover, there are currently over 24 million scientific and medical articles that have been published, with an estimated 1.8 million new articles being published each year.
Taken together, any single researcher would have a difficult time adequately absorbing this information. Since the average researcher reads between 250 and 300 articles each year, scientists are missing many opportunities to access information that could potentially contribute to their own research endeavors.
To overcome these challenges, several different bioinformatic workflow systems, as well as Workflow Management Systems (WMS) have been developed to analyze and process existing biological data. Some of the software that is currently available to the public include Galaxy, BioMOBY, Ergatis, Taverna, Genepattern, and OMICTools. Each of these WMSs provides a graphical user interface that supports the analysis of biological data.
As the technology behind machine learning and deep learning continues to advance, life sciences researchers are hopeful that these techniques will be able to meet the growing demand to process and analyze biological big data. Some of the different deep learning techniques that have been explored for this purpose include Artificial Neural Network (ANN), Convolution Neural Network (CNN), Recurrent Neural Network (RNN), and Autoencoder.
As technology continues to allow for the development of even more efficient tools and software platforms, life sciences researchers will be better equipped to manage and analyze biological big data.
- Big Data – What it is and why it matters [Online]. Available from: https://www.sas.com/en_us/insights/big-data/what-is-big-data.html.
- Chen, Y., Argentinis, E., & Weber, G. (2016). IBM Watson: How Cognitive Computing Can Be Applied to Big data Challenges in Life Sciences Research. Clinical Therapeutics 38(4); 688-701. doi:10.1016/j.clinthera.2015.12.001.
- Pal, S., Mondal, S., Das, G., et al. (2020). Big data in biology: The hope and present-day challenges in it. Gene Reports 21. doi:10.1016/j.genrep.2020.100869.