Many subsections of bioinformatics exist, and it can be difficult to understand the whole of the field as it stands. By and large, bioinformatics is regarded as the intersection of computer science and biology.
Image Credit: ArtemisDiana/Shutterstock.com
Predominantly, software programs are applied to biological data; to condense it, extract it, clean it and segregate what is useful from what is not more easily. This is often accomplished through a “pipeline,” referring to a series of software algorithms that process and compute raw sequencing data. All pipelines are different and are tailored to the vendor or customer through independent parties or the in-house lab.
The Process of Data Analysis Within Bioinformatics
The processing of raw data is accomplished through code like C++, Java, SQL, or Python, with Python being the most prominent. Other software libraries, such as Pandas, Matplotlib, Numpy, etc., are used with these coding languages to make the processing more productive, quick, and easily interpretable. This is accomplished in a sequential process and does not vary much from the outline displayed below:
- Data extraction
- Data Cleaning
- Data Wrangling
Firstly, you must acquire the raw data you wish to assay, which can be delivered through different formats such as CSV., XML., or JSON. This data can be acquired through lab processes or purchased or offered from external sources; data distribution will vary. Once obtained, the data is cleaned, which is itself data-dependent. This can include the correction of wrong or invalid data, inputting missing values and empty cells, getting rid of duplicates, and more.
Data Wrangling will generally follow. Some data cleaning can fall within this process, but this is mostly regarded as transforming the data into more readily usable formats. This can include arranging a hierarchical form of data, indexing for quick access, and merging tables. What will be performed depends on what time of data is being leveraged and what the researcher wishes to analyze.
The final step: the step most people associate with bioinformatics and data interpretation is the exploration of the data. This is where visual representations of critical data are performed, which can include creating tables and charts (often performed using software libraries) and building statistical models. This step in the process is where the true value of bioinformatics comes into play.
Image Credit: arleksey/Shutterstock.com
The Challenges of Bioinformatics
Cautions should be taken performing data processing and, indeed, bioinformatics in general. The experimental design should be peer-reviewed for validity, heterogeneous analysis strategies need to be considered if the data calls for it, the figures/ data representations should be tailored for the right audience to help interoperability, and files/data formatting should be exported in such a way so that it can be used across multiple software’s.
Software that accomplishes this processing ability includes, but is not limited to:
- Hyrax Exatype
- DeepCheek HIV
Processing of Genome Files (Omics)
The study of omics (panomics, integrative omics, multi-omics, proteomics, and more) refers to the assaying/analysis of cellular molecules like RNA, DNA, and proteins and how they differ on an individual and species level. Common concepts and general information should be highlighted when processing genome files, whether in ab1., FASTA, EMBL, or some other format.
First, it is helpful to view the sequence while providing information on the per-nucleobase quality values. This raw material processing is common amongst most software but is still something that should be considered. The sequence/base quality is represented as a numerical value and will indicate whether the base readout is truthful or not.
Sequence alignment checks for mutations, alterations, and disparities between the target transcript and the reference. Usually, in the form of SAM, BAM, or ab1 files, these sequences will help show whether you have achieved the desired transcript if the coverage is good enough.
Types of Processing for Next-Gen Sequencing
To begin, quality control is carried out, which consists of removing the sequences with low-quality values (as these will most likely pertain to missed nucleotides and non-reliable alignments). An important part of this step is checking for external contamination (where some pathogen/foreign alignment can interfere with the cDNA analyte) and cross-contamination (where DNA can be contaminated/ altered from one sample to another while performing benchwork.
Next, a consensus is obtained, displayed in an amino acid frequency table or a codon table, where the whole of the transcript is more easily interpreted. After data storage has been sorted, real-time algorithms can be applied to the data, which many consider the final and most valuable formatting.
One of the most widely used file types in omics is variant call formats (VCF) files. These are examples of tab-delimited text files and are used to store information on genomic variants. These VCF files generally have a header, a body of raw data, field definition lines (manner of organization), and meta information. Comments containing said meta-information can be denoted by the # or ‘##’ prefix, a common practice amongst python users and other coders.
The information retained regarding genetic code varies from lab to lab but will often contain information regarding chromosome number, reference to specific alleles, the position of variants, variant identifiers, PHRED-scaled quality scores, and descriptions of each genomic variant.
- Aberra N, Sebastian A, Maloy AP, Rees CB, Bartron ML, Albert I. Bioinformatics recipes: creating, executing and distributing reproducible data analysis workflows. BMC Bioinformatics. 2020 Jul 8;21(1):292. doi: 10.1186/s12859-020-03602-6. PMID: 32640986; PMCID: PMC7346607.
- Karabayev D, Molkenov A, Yerulanuly K, Kabimoldayev I, Daniyarov A, Sharip A, Seisenova A, Zhumadilov Z, Kairov U. re-Searcher: GUI-based bioinformatics tool for simplified genomics data mining of VCF files. PeerJ. 2021 May 3;9:e11333. doi: 10.7717/peerj.11333. PMID: 33987016; PMCID: PMC8101456.
- Weber N, Liou D, Dommer J, MacMenamin P, Quiñones M, Misner I, Oler AJ, Wan J, Kim L, Coakley McCarthy M, Ezeji S, Noble K, Hurt DE. Nephele: a cloud platform for simplified, standardized and reproducible microbiome data analysis. Bioinformatics. 2018 Apr 15;34(8):1411-1413. doi: 10.1093/bioinformatics/btx617. PMID: 29028892; PMCID: PMC5905584.
- Ferreira JE, Takai OK. Understanding Database Design. 2007 Sep 12. In: Gruber A, Durham AM, Huynh C, et al., editors. Bioinformatics in Tropical Disease Research: A Practical and Case-Study Approach [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2008. Chapter A02. Available from: https://www.ncbi.nlm.nih.gov/books/NBK6828/
- Ganesan N, Kalyanasundaram B, Velauthapllai M. Bioinformatics data profiling tools: a prelude to metabolic profiling. Pac Symp Biocomput. 2007:127-32. PMID: 17990486.