GPT-4 Streamlines Cell Type Identification in Single-Cell RNA Sequencing

Download PDF Copy

Mar 26 2024Reviewed by Megan Craig, M.Sc.

A new study by Columbia University’s Mailman School of Public Health shows that GPT-4, a large language model, can accurately identify cell types in single-cell RNA sequencing data. This is a significant finding because analyzing this data manually is very time-consuming. The study's results are published in Nature Methods.

The large language model GPT-4 is intended for speech generation and comprehension. After evaluation in multiple tissues and cell types, GPT-4 has proven to be able to generate cell type annotations that closely match manual annotations from human experts and outperform current automatic algorithms.

The labor and skill required to annotate cell types—a procedure that can take months—may be greatly reduced with the help of this feature. Additionally, to make the automated annotation of cell types using GPT-4 easier, the researchers have created GPTCelltype, an R software package.

The process of annotating cell types for single cells is often time-consuming, requiring human experts to compare genes across cell clusters. Although automated cell type annotation methods have been developed, manual methods to interpret scientific data remain widely used, and such a process can take weeks to months. We hypothesized that GPT-4 can accurately annotate cell types, transitioning the process from manual to a semi- or even fully automated procedure and be cost-efficient and seamless.”

Wenpin Hou Ph.D., Assistant Professor, Department of Biostatistics, Mailman School of Public Health, Columbia University

The performance of GPT-4 was evaluated by the researchers using ten datasets that included both normal and cancer samples, hundreds of tissue and cell types, and five different species. For competitive purposes, the researchers also assessed other GPT versions and manual methods as a reference tool. They queried GPT-4 using GPTCelltype, the software tool developed by the researchers.

The researchers started by looking into the different variables that might have an impact on the GPT-4's annotation accuracy. It was discovered that GPT-4 works best with the top 10 genes and shows comparable accuracy with different prompt strategies, such as a basic prompt strategy, a chain-of-thought-inspired prompt strategy that involves steps in reasoning, and a repeated prompt strategy.

GPT-4 demonstrated its ability to generate expert-comparable cell type annotations, matching manual analyses in more than 75% of cell types across most studies and tissues.

Moreover, the limited concordance between GPT-4 and manual annotations in certain cell types does not inherently signify the inaccuracy of GPT-4's annotations. For instance, in the case of stromal or connective tissue cells, GPT-4 offers more precise cell type annotations. Additionally, GPT-4 exhibited notably faster performance.

Hou and her colleague further evaluated GPT-4's robustness in complex real data scenarios, revealing its ability to distinguish between pure and mixed cell types with 93% accuracy and to differentiate between known and unknown cell types with 99% accuracy.

Additionally, they examined the replicability of GPT-4's methods by comparing them to prior simulation studies. In 85% of cases, GPT-4 produced identical annotations for the same marker genes. “All of these results demonstrate GPT-4’s robustness in various scenarios,” observed Hou.

Hou notes that although GPT-4 outperforms current techniques, there are certain drawbacks to take into account, such as the difficulties in confirming the quality and dependability of GPT-4 due to its lack of transparency regarding its training procedures.

Since our study focuses on the standard version of GPT-4, fine-tuning GPT-4 could further improve cell type annotation performance.”

Wenpin Hou Ph.D., Assistant Professor, Department of Biostatistics, Columbia Mailman School

Source:

Columbia University’s Mailman School of Public Health

Journal reference:

Hou, W., & Ji, Z., et al. (2024) Assessing GPT-4 for cell type annotation in single-cell RNA-seq analysis. Nature Methods. doi.org/10.1038/s41592-024-02235-4

Posted in: Cell Biology | Device / Technology News

Tags: Cancer, Cell, Genes, Labor, Language, pH, Public Health, RNA, RNA Sequencing, Speech