The recent developments in bioinformatics and particularly in massively parallel sequencing (MPS) for DNA analysis have allowed the extraction of invaluable information without inducing very high costs.
Image Credit: igorstevanovic/Shutterstock.com
Certain areas of DNA that were previously deemed as time-consuming to analyze, can now be studied, yielding extra information that would have been lost otherwise.
The current state of bioinformatics has made obtaining information from different DNA markers, such as short tandem repeats, single nucleotide, and mitochondrial genome polymorphisms an easier task. Bioinformatics make use of open access Forensic DNA databases such as the genome/exome Aggregation Databases and ExAC.
Data preparation and analysis of DNA fragments
Before sequencing the amplified DNA fragments, short sequences are appended to the DNA during library preparation. By doing this, the sequencing process has greater chances to be successful.
Polymerase chain reaction primers may also be used to pinpoint the sequence of interest, an index sequence can also be utilized to identify the origin of a sample, and lastly, adapter sequences are used at the ends to allow the attachment of each library to a solid surface for sequencing. This preprocessing of the data is the first step in an MPS analysis pipeline.
The next steps might include quality assessment and in some cases, an alignment step as well. Quality control tools serve the purpose of removing poor-quality bases utilizing Phred quality scores which are an integer value, representing the estimated probability of the base being incorrect (also known as Q score).
Two different tools can be used in slightly different instances. Read trimming is used for removing low-quality bases from the ends of reads, whilst read filtering is useful for the removal of entire poor quality reads.
Alignment to a reference genome and variant calling (i.e. comparison of a sample sequence to a reference sequence) is the basis of genetic variant examination. The alignment step serves the purpose of aligning the reads to a reference genome and thus creating an alignment file.
Subsequently, variant calling is performed on the previously created alignment file, which results in the identification of the genotype of each base position of interest.
DNA data can be visualized, explored, and better understood with certain software such as Genomics viewer, Geneious, and NextGENe, all of which have been utilized in bioinformatics related scientific publications.
The three most important markers for forensics
Short tandem repeats (STRs) are the standard marker typically used in forensic DNA profiling. STR alleles are typed by length using the capillary electrophoresis methodology. There are many STR PCR amplification kits available on the market.
However, lately, with the MPS becoming more widespread, STR typing by MPS is considered superior to the CE methodology due to their ability to detect STR sequence variations and multiplex a greater number of DNA markers together.
Single nucleotide polymorphisms (SNPs) are the most common genetic variation in humans. SNPs are markers that can provide information from forensic samples concerning the external characteristics and the ancestry of the donor.
Single nucleotide changes that hold information about the phenotype or the ancestry are mostly SNPs, commonly shared with about 1% of the population. The SNPs encountered in flanking regions are described as novel/rare and therefore can be called SNVs (variants).
SNP reporting is easy in terms of nomenclature, however, strandedness can potentially create some problems in interpretation, where one allele is reported for each chromosome and then moved to the analysis tool.
Mitochondrial DNA (mtDNA) can be used when lineage information is required, and when nuclear DNA is not available or highly degraded. In the past, Sanger sequencing was used for analysis, but lately, the analysis focus has moved from the specific regions of variation to the entire control region and the mtDNA genome in its entirety.
This move reduces analysis errors that were traditionally caused by the mixing of independently-amplified hypervariable segments from different individuals, and further interpretation mistakes and clerical errors. In studies conducted by J.L King et al. and M.Bodner et al. whole mtDNA sequencing and the MPS approach were used which resulted in the generation of haplotypes from blood and single hair shaft samples respectively.
Image Credit: https://www.researchgate.net/figure/The-structure-of-Short-Tandem-Repeat-STR_fig2_221912832
Combining theoretical and practical tools
The whole mtDNA is analyzed post-alignment, whilst STRs and SNPs can be analyzed with, or without alignment. In an attempt to speed up the analysis even further, pipelines and tools for the simultaneous processing of SNPs and STRs have begun to be tested.
Overall, many tools have been introduced, and although each tool has its advantages and drawbacks it is not clear which tool is the most efficient/successful due to the lack of a qualitative and quantitative metric in their output.
Nevertheless, it has to be borne in mind that bioinformatics is closely tied with the recovered quantity of DNA from the sample. A literature review by Interpol showed that usually when the DNA concentration (recovered from a sample) was above a certain threshold, meaningful DNA profiling data were much more likely to be obtained.
- Yao-Yuan Liu, SallyAnn Harbison, 2018 A review of bioinformatic methods for forensic DNA analyses, Forensic Science International: Genetics 33, pp 117-28
- John M. Butler, Sheila Willis, 2020, Interpol review of forensic biology and forensic DNA typing 2016-2019 Forensic Science International: Synergy, In press, corrected proof
- Michael D. Coble, Jo-Anne Bright, 2019 Probabilistic genotyping software: An overview, Forensic Science International: Genetics 38, pp 219-24