CARA Benchmark Set to Revolutionize Data-Driven Drug Discovery Methods

Modern drug discovery methods rely on predicting the binding affinities or biochemical activity of potential drugs against the target proteins. However, with the development of modern, data-driven methods of predicting compound activity using artificial intelligence (AI), a compound activity prediction benchmark that can evaluate these AI-based methods for real-world applications is needed.

In a recent study published in Communications Chemistry, researchers presented a compound activity prediction benchmark they curated for real-world, practical applications called Compound Activity benchmark for Real-world Applications (CARA).

​​​​​​​​​​​​​​Study: Benchmarking compound activity prediction for real-world drug discovery applications. Image Credit: paulista/​​​​​​​​​​​​​​Study: Benchmarking compound activity prediction for real-world drug discovery applications. Image Credit: paulista/


Although predicting the binding affinities of potential drug compounds against a target protein is an essential step of the modern drug discovery process, the drug discovery pipeline consists of numerous steps to characterize and predict the activity of the compounds and optimize the drugs.

Modern, data-driven methods of predicting compound activity, such as AI, deep learning, and machine learning, are more efficient and accurate than traditional knowledge-based methods, such as computer-aided drug design.

The success of data-driven methods in predicting binding affinities and compound activity is dependent on understanding the compound activity pattern from high-quality, large-scale data.

Furthermore, compound activity is measured using various cell-based experiments and biochemical and biophysical methods, making obtaining large-scale data challenging.

However, despite the availability of various large-scale benchmark datasets on compound activity, a benchmark designed to evaluate these data-based methods from a real-world perspective is lacking.

About the Study

In the present study, the researchers curated a benchmark called CARA based on real-world data characteristics to predict compound activity for practical applications.

To develop CARA, the researchers first analyzed the compound activity data from existing drug-discovery processes in the ChEMBL database.

The activity data in ChEMBL is grouped according to assays, where the measurement conditions for the same protein target but different compounds are cataloged together.

The researchers first filtered the ChEMBL data to retain single protein targets and small-molecule ligands below 1,000 molecular weights. They also removed samples that were not annotated well and had missing values.

The samples were then arranged according to individual measurement types, and replicates were combined with median values for reporting final measurements. The compound activity data was then differentiated into virtual screening and lead optimization categories.

The virtual screening process increases efficiency and success rates while lowering the experimental screening costs. The lead optimization stage is needed to ensure that the candidate compounds will be effective in the clinical experiments.

The assays were divided into training and test sets, and those with varied protein targets were used as test sets to evaluate the different models of compound activity prediction. Training and test sets were defined for both virtual screening and lead optimization tasks.

The data splitting also considered two scenarios to consider different application settings. One scenario in which no data on tasks was available was called the zero-shot scenario, and another in which measurements for some samples were available was called the few-shot scenario.

A range of deep learning and machine learning methods and training strategies to predict compound activity were evaluated using CARA. These included DeepCPI which used singular value decomposition, DeepDTA based on convolutional neural networks, and GraphDTA, which used graph neural networks.

Major Findings

The findings indicated that CARA could carefully distinguish assay types and select evaluation matrices to assess the bias in the distribution of data on real-world compound activity and prevent the overestimation of model predictions.

The assay-based metrices for evaluation used in CARA provided more accurate and comprehensive results in comparison to the bulk-evaluation metrices.

Testing some few-shot scenarios also revealed that virtual screening strategies were more effective for the exploration of cross-assay information, while those dealing with single-task information were better suited for lead optimization.

The researchers also found that the performance of various deep-learning and machine-learning methods differed across assays, and these methods had limitations in estimating uncertainty and the sample level and predicting activity cliffs.


The study aimed to develop a benchmark for predicting compound activity for drug discovery that could evaluate compound activity prediction from a real-world application perspective.

The findings indicated that CARA provided a high-quality, assay-based, large-scale dataset that could be used to evaluate and develop models for predicting compound activity. The researchers believe that CARA can pave the way to the development of more efficient data-driven drug discovery models.

Journal reference:


Please use one of the following formats to cite this article in your essay, paper or report:

  • APA

    Sidharthan, Chinta. (2024, June 11). CARA Benchmark Set to Revolutionize Data-Driven Drug Discovery Methods. AZoLifeSciences. Retrieved on July 22, 2024 from

  • MLA

    Sidharthan, Chinta. "CARA Benchmark Set to Revolutionize Data-Driven Drug Discovery Methods". AZoLifeSciences. 22 July 2024. <>.

  • Chicago

    Sidharthan, Chinta. "CARA Benchmark Set to Revolutionize Data-Driven Drug Discovery Methods". AZoLifeSciences. (accessed July 22, 2024).

  • Harvard

    Sidharthan, Chinta. 2024. CARA Benchmark Set to Revolutionize Data-Driven Drug Discovery Methods. AZoLifeSciences, viewed 22 July 2024,


The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoLifeSciences.
Post a new comment

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
The Protein STAT3 Helps the Immune System Fight Leukemia