Machine Learning-Based Tool Allows Scientists to Quickly Model Families of Proteins

Download PDF Copy

Add AZoLifeSciences on Googleas a preferred source

Reviewed by Lily Ramsey, LLMJun 2 2023Reviewed

Microbes drive key processes of life on Earth. They affect global elemental cycles-;the movement of carbon, nitrogen, and other elements. They also promote plant growth and affect the development of diseases. These roles are essential in every ecosystem.

Research constantly expands the database of microbial DNA sequences but does not provide all the biological information about proteins. To engineer microbes for sustainable bioenergy and other bioproducts, scientists need a fuller understanding of the function of proteins and other molecules. Scientists infer the function of a protein by comparing it with reference databases of already characterized proteins.

However, these comparisons are difficult and not scalable to massive databases. To address this challenge, scientists have applied machine learning to models that predict protein function. The result is the program Snekmer, which allows scientists to quickly model families of proteins.

The Impact

Studying biological protein molecules in microbes will help scientists pursue new applications for engineered microbes. Snekmer is easy to deploy in high-performance computing environments. In addition, it is incorporated into the DOE KBase framework as a new application that will allow users to annotate their genome and metagenome sequences.

This will help scientists to better model the effects of engineering microbes. That includes these microbes' effect on the climate and their benefits for crop health and bioproduction. Snekmer will also help scientists study the evolution of microbes and patterns in microbiomes.

Summary

The inability of current methods to predict function for 30-50% of bacterial protein sequences is a significant barrier to better understanding of complex systems such as soil microbiomes. Most protocols rely on pair-wise alignments, which are becoming computationally intractable and more challenging to interpret as databases expand.

For alignment-based models of protein families, the sensitivity and accuracy depend on initial training sets, which risk obsolescence as additional sequence diversity is discovered. Many bacterial proteins have either no functional assignment or are only assigned a general function based solely on taxonomic understanding.

To address this need, researchers at Pacific Northwest National Laboratory, Baylor University, and Oregon Health & Science University developed Snekmer, a software tool leveraging redundancy of amino acid residue properties to reduce sequence space and using short protein sequence (kmer) features for machine learning to generate protein family models.

Snekmer users can recode protein sequences into reduced alphabet kmer vectors and perform construction of supervised classification models trained on input protein families, or protein functional classification based on Snekmer models.

Source:

Department of Energy, Office of Science

Journal reference:

Christine, H. C., et al. (2023) Snekmer: a scalable pipeline for protein sequence fingerprinting based on amino acid recoding. Bioinformatics Advances. https://doi.org/10.1093/bioadv/vbad005.

Posted in: Molecular & Structural Biology | Proteomics | Device / Technology News

Comments (0)

Download PDF Copy

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoLifeSciences.

Post a new comment

(Logout)

Post

Sign in to keep reading

We're committed to providing free access to quality science. By registering and providing insight into your preferences you're joining a community of over 1m science interested individuals and help us to provide you with insightful content whilst keeping our service free.