A novel statistical technique makes it easier to find biologically significant changes in genomic data that span multiple conditions, like cell types or tissues.
Whole genome studies provide massive volumes of data, ranging from millions of individual DNA sequences to details about where and how many of the thousands of genes are expressed, as well as the location of functional elements throughout the genome. Because of the volume and complexity of the data, contrasting different biological situations or research conducted by different labs can be statistically difficult.
The difficulty when you have multiple conditions is how to analyze the data together in a way that can be both statistically powerful and computationally efficient. Existing methods are computationally expensive or produce results that are difficult to interpret biologically.”
Qunhua Li, Associate Professor, Statistics, The Pennsylvania State University
Qunhua Li adds, “We developed a method called CLIMB that improves on existing methods, is computationally efficient, and produces biologically interpretable results. We test the method on three types of genomic data collected from hematopoietic cells—related to blood stem cells—but the method could also be used in analyses of other ‘omic’ data.”
The CLIMB (Composite LIkelihood eMpirical Bayes) technique is described in the research published in the journal Nature Communications.
In experiments where there is so much information but from relatively few individuals, it helps to be able to use information as efficiently as possible. There are statistical advantages to be able to look at everything together and even to use information from related experiments. CLIMB allows us to do just that.”
Hillary Koch, Senior Statistician, Moderna
Hillary Koch was a graduate student at Penn State at the time of the research.
To analyze data across multiple circumstances, the CLIMB method employs ideas from two conventional methodologies. One method employs a series of pairwise comparisons between conditions, but it becomes extremely difficult to comprehend as more conditions are added.
A different method integrates each subject’s activity pattern across conditions into an “association vector,” such as a gene that is up-regulated, down-regulated, or unchanged in each of numerous cell types. The association vector represents the pattern of condition specificity directly and is simple to interpret.
However, because many distinct combinations are feasible even when just a few requirements are present, the calculations are tremendously computationally intensive. To address this issue, the second approach makes assertions about how to streamline the data that are not necessarily correct.
CLIMB uses aspects of both of these approaches. We ultimately analyze association vectors, but first we use pairwise analyses to identify the patterns that are likely to exist up front. Rather than making assumptions about the data, we use the pairwise information to eliminate combinations that the data don't strongly support. This dramatically reduces the space of possible patterns across conditions that would otherwise make the computations so intensive.”
Hillary Koch, Senior Statistician, Moderna
After the compilation of the reduced set of possible association vectors, the approach groups together subjects who exhibit the same pattern across conditions. For instance, the findings could reveal sets of genes that are up-regulated in some cell types but down-regulated in others.
The researchers evaluated their method on data acquired from tests using RNA-seq, a technique that can quantify the amount of RNA produced by all the genes expressed in a cell, to see if particular genes help define which sorts of cells the hematopoietic stem cell eventually becomes.
Li says, “Compared to the popular pair-wise method, our results are more specific. Our gene list is more succinct and biologically more relevant.”
While the classic pair-wise method yielded a list of six to seven thousand genes of interest, CLIMB yielded a significantly smaller list of two to three thousand genes, with at least a thousand of those genes being detected in both analyses.
“The different blood cell types have a variety of functions—some become red blood cells and others become immune cells—and we wanted to know which genes are more likely to be involved in determining each distinct cell types,” stated Ross Hardison, T. Ming Chu Professor of Biochemistry and Molecular Biology at Penn State.
Ross Hardison adds, “The CLIMB approach pulled out some important genes; some of them we already knew about and others add to what we know. But the difference is these results were a lot more specific and a lot more interpretable than those from previous analyses.”
CLIMB was also applied to data generated by a distinct experimental approach, ChIP-seq, which can determine where particular proteins attach to DNA along the genome. They investigated how the binding of CTCF, a transcription factor that aids in the establishment of interactions required for gene regulation in the cell nucleus, changes or does not vary among 17 cell populations derived from the same hematopoietic stem cell. The CLIMB research reveals several categories of CTCF-bound sites, some of which imply involvement for this transcription factor in all blood cells and others in particular types of cells.
Finally, the researchers compared the accessibility of chromatin—a complex of DNA and proteins—in 38 human cell types using data from yet another experimental technology named DNase-seq, which can detect regulatory areas.
Koch notes, “For all three tests, we wanted to see if our results had biological relevance, so we compared our results against independent data, such as studies of high-throughput sequencing of histone modifications and transcription factor footprinting.”
“In each case, our results correspond with these other methods. Next, we would like to improve the computational speed of our method and increase the number of conditions it can handle. For example, chromatin-accessibility data are available for many more cell types, so we’d love to increase the scale of CLIMB,” concluded Koch.
Koch, H., et al. (2022) CLIMB: High-dimensional association detection in large scale genomic data. Nature Communications. doi.org/10.1038/s41467-022-34360-z.