Study Highlights the Vast Untapped Potential Within the Global Protein Universe

The number of known proteins is infinitely small in comparison to the universe of possible proteins which could in theory be realized. Yet these known proteins are the only major training ground for future protein design. Understanding how representative these proteins are of the overall potential diversity can therefore help inform strategies for a wide range of applications, including therapeutic, biocatalysis, or biomaterials development.

Published in PNAS, an international team from the Okinawa Institute of Science and Technology (OIST), the Institute of Science and Technology Austria (ISTA), the University of Vienna and the Centro de Astrobiología (CAB) investigated the relationship between protein evolution and sequence space, identifying the limiting factors behind protein diversification. Their findings reinforce theories of DNA recombination as a driving force of ancestral protein formation and highlight the limitations of many cutting-edge AI protein design methods.

Modern AI methods are thought to be revolutionizing protein design, with the 2024 Nobel Prize in Chemistry awarded to the team behind AlphaFold. Yet most of these AI design methods are typically trained on databases of known proteins. So without understanding how representative these known proteins are of sequence space, how confident can we be that such methods can generate truly diverse protein designs?"

Fyodor Kondrashov, Professor and Head, Evolutionary and Synthetic Biology Unit, Okinawa Institute of Science and Technology

Exploring the Protein Universe

Imagine you have 20 or so different block types, which you can connect in different orders and abundances into chains of tens, hundreds or even thousands of blocks in length. Mapping all possible resulting chains creates a sequence space.

For proteins, the shape and structure of their amino acid building blocks mean only a minute fraction of possible protein sequences can fold up into the correct 3D shape to power a biological function. They need the correct chemical groups in the correct places to create the interactions that will maintain 3D shape or bind to other molecules. Mapping the sequences that fulfil this requirement creates a smaller functional space.

Of these possible functional sequences, it's likely that relatively few have ever existed across evolutionary history. Therefore, the researchers set out to uncover how representative this subset of proteins is of functional space.

The researchers started by mathematically describing the sequence space taken up by known proteins. They then built a model of protein evolutions to understand the biological factors controlling the structural diversification of a wide range of naturally-occurring protein families. From their models, they then predicted how many functional sequences they would expect to exist for a given biological function.

By comparing the diversity of known proteins to these theoretical predictions of protein evolution, the researchers found that point-of-origin effects far outweighed the influence of other key evolutionary processes.

"That starting point is the main evolutionary limit is not necessarily surprising, but the scale of its importance is really quite remarkable," observes lead author Lada Isakova, PhD student within the unit. "As an evolutionary biologist, I was intrigued to see how little selection and epistasis seemed to matter in our results."

What Limits Protein Evolution?

When mutations arise in the genes encoding for a particular protein, these can result in changes to the sequence of amino acids produced, causing protein evolution. Natural selection limits which mutations persist over time based on whether they improve or harm the protein's function or stability. Epistasis - genetic interactions resulting in different outputs - also constrains evolution, as mutations may have limited effects alone, but large effects when present in combination with certain other mutations.

Both selection and epistasis are known to influence protein evolution, yet Isakova and colleagues found that by far, the limiting factor of protein diversity is the origins of our proteins, with relatively small divergence seen from the areas of sequence space of ancestral proteins.

This research provides new insights into the origins of life, reinforcing existing theories on initial protein formation. Isakova explains, "Our simulations suggest that, for the first proteins in the last universal common ancestor to arise, they couldn't just diverge from mutations of a single first sequence, given the time constraints we see. Instead, small pieces of DNA must have shuffled around and recombined to create new DNA molecules which could encode very different proteins."

The team also hopes that the research inspires experimental scientists to expand the known sequence space. Isakova comments, "Neural network approaches for functional protein prediction are limited by the data sets we provide. So based on existing data, most methods won't be able to generalize well beyond the current known sequence space. We can see there's huge swaths of sequence space left to be explored, but it'll take new experimental data to enable expansion into these unknown realms."

This global collaboration was supported by a Japan Science and Technology Agency (JST) Adopting Sustainable Partnerships for Innovative Research Ecosystem (ASPIRE) grant, which aims to build a network between top researchers in Japan and around the world, nurturing future scientific leaders.

Comments

The opinions expressed here are the views of the writer and do not necessarily reflect the views and opinions of AZoLifeSciences.
Post a new comment
Post

While we only use edited and approved content for Azthena answers, it may on occasions provide incorrect responses. Please confirm any data provided with the related suppliers or authors. We do not provide medical advice, if you search for medical information you must always consult a medical professional before acting on any information provided.

Your questions, but not your email details will be shared with OpenAI and retained for 30 days in accordance with their privacy principles.

Please do not ask questions that use sensitive or confidential information.

Read the full Terms & Conditions.

You might also like...
PAX3 Protein Plays Key Roles in Development and Cancer