What's new

Welcome to sihec | Welcome My Forum

Join us now to get access to all our features. Once registered and logged in, you will be able to create topics, post replies to existing threads, give reputation to your fellow members, get your own private messenger, and so, so much more. It's also quick and totally free, so what are you waiting for?

AI Tool Predicts Protein Aggregation from Sequence and Allows Scientists to Understand Decision-Making Process

Hoca

Administrator
Staff member
Joined
Apr 6, 2025
Messages
199
Reaction score
0
Points
0
Researchers headed by a team at the Centre for Genomic Regulation, Barcelona Institute of Science and Technology and at the Wellcome Sanger Institute have developed an AI tool that they say has made a step forward in translating the language proteins used to dictate whether they form sticky clumps similar to those linked to Alzheimer’s disease and other human diseases characterized by protein aggregation. In a departure from typical “black-box” AI models, the new tool, CANYA (convolution attention network for amyloid aggregation), was designed to be able to explain its decisions, revealing the specific chemical patterns that drive or prevent harmful protein folding.

The team’s development, reported in Science Advances, was achieved thanks to what they suggest is the largest-ever dataset on protein aggregation created to date. The team experimentally quantified the aggregation of more than 100,000 protein sequences, and used this dataset to train the new AI tool for predicting aggregation from sequence. Their results offer new insights about the molecular mechanisms underpinning sticky proteins, which are linked to diseases affecting half a billion people worldwide.


ICREA research professor Ben Lehner, group leader at the Centre for Genomic Regulation (CRG) and the Wellcome Sanger Institute, said, “This project is a great example of how combining large-scale data generation with AI can accelerate research. It’s also a very cost-effective method to generate data.”

Lehner and Benedetta Bolognesi, PhD, group leader at the Institute for Bioengineering of Catalonia (IBEC), are co-corresponding authors of the team’s report, titled, “Massive experimental quantification allows interpretable deep learning of protein aggregation,” in which the investigators concluded, “More generally, our results provide a very large and well-calibrated dataset to train and evaluate models beyond CANYA, and they demonstrate the utility of massive experimental analysis of random protein sequence spaces.” The Bolognesi and Lehner labs collaborated on the project with researchers at Cold Spring Harbor Laboratory and the Wellcome Sanger Institute.

Protein clumping, or amyloid aggregation, is a health hazard that disrupts normal cell function. When certain patches in proteins stick to each other, proteins grow into dense fibrous masses that have pathological consequences. “Specific insoluble protein aggregates in the form of amyloid fibrils characterize more than 50 clinical conditions affecting more than half a billion people,” the authors wrote. “These include common neurodegenerative disorders and the most frequent forms of dementia.”


While the study has some implications for accelerating research efforts for neurodegenerative diseases, its more immediate impact will be in biotechnology, the team suggested. Many drugs are also proteins, and may also often be hampered by unwanted clumping. “Protein aggregation is also a major problem in biotechnology, for example, in the production of enzymes, antibodies, and other protein therapeutics,” the team continued.

“Protein aggregation is a major headache for pharmaceutical companies,” Bolognesi noted. “If a therapeutic protein starts aggregating, manufacturing batches can fail, costing time and money.”

Protein clumps are formed using a poorly understood language. Proteins are made of twenty different types of amino acids, different combinations of which form “words,” or “motifs.” Researchers have long sought to decipher which combinations of motifs cause clumping and which others enable proteins to fold without error. Artificial intelligence tools that treat amino acids like the alphabet of language could help identify the precise words or motifs responsible, but the quality and volume of data about protein aggregation needed to feed models have been historically scant or restricted to very small protein fragments.

“The importance of amyloids across biological functions and diseases has spurred massive research efforts, yet the determinants and mechanisms of their formation remain quite poorly understood,” the investigators stated. “Methods have been proposed to predict aggregation from sequence, but these have been trained and evaluated on small and biased experimental datasets.”

The newly reported study addressed this challenge by carrying out large-scale experiments. The authors created over 100,000 completely random protein fragments, each 20 amino acids long, from scratch. The ability of each synthetic fragment to clump was tested in living yeast cells. If a particular fragment triggered clump formation, the yeast cells would grow in a certain way that could be measured by the researchers to determine cause and effect.

Amyloid aggregation inside cells marked using fluorescence techniques. [Benedetta Bolognesi/IBEC]

Amyloid aggregation inside cells marked using fluorescence techniques. [Benedetta Bolognesi/IBEC]
The team found that around one in every five protein fragments (21,936/100,000) caused clumping, while the rest did not. While previous studies might have tracked a handful of sequences, the new dataset captures a much bigger catalogue of the different protein variants that can cause amyloid aggregation.


“We created truly random protein fragments including many versions not found in nature,” explained first author Mike Thompson, PhD, a postdoctoral researcher at the Centre for Genomic Regulation (CRG). “Evolution has explored only a fraction of all possible protein sequences, while our approach helps us peer into a much bigger galaxy of possibilities, providing lots of data points to help understand more general laws of aggregation behavior.”

The vast amount of data generated from the experiments was used to train CANYA, which the researchers created using the principles of “explainable AI,” making its decision-making processes transparent and understandable to humans. This meant sacrificing a little bit of its predictive power, which is usually higher in “black-box” AIs. Despite this, CANYA proved to be around 15% more accurate than existing models. “Using random sequences allowed us to test the aggregation of sequences very different to the small number of known amyloids and to provide a principled evaluation of existing amyloid predictors over both our own and existing datasets, serving as a guideline for the community,” the investigators stated. “Evaluation on an additional independent 7,000 sequences confirmed the performance of CANYA on predicting aggregation from sequence.”

Specifically, CANYA is a convolution-attention model, a hybrid tool borrowing from two distinct corners of AI. Convolution models, like those used in image recognition, scan photos for features like an ear or a nose to identify a face, except in this case, CANYA skims through the protein chain to find meaningful features like motifs or “words.”

Attention AI models are used by language translation tools to identify key phrases in a sentence before deciding on the best translation. The researchers incorporated this technique to help CANYA figure out which motifs matter most in the grand scheme of the entire protein.

Together, these two approaches help CANYA see local motifs up close while also spotting their bigger-picture importance. The researchers could use this information to not just predict which motifs in the protein chain encourage clumping, block it, or something in between, but also understand why. “The performance of CANYA and its consistency across evaluation tasks suggest that CANYA does learn an accurate approximation of the sequence-aggregation landscape, despite only training on random, synthetic peptides,” they commented.

For example, CANYA showed that small pockets of water-repelling amino acids are more likely to spark clumping, while some motifs have a bigger impact on clumping if they’re near the start of a protein sequence rather than at the end. The observations align with previous findings researchers have seen under the microscope in known amyloid fibrils.

But CANYA also found new rules driving protein aggregation. For instance, certain charged amino acids are normally thought to prevent clumping. But it turns out that in the context of other specific building blocks, they can actually promote clumping.

In its current form, CANYA primarily explains protein aggregation in yes or no terms, i.e., it works as a so-called “classifier.” The researchers next want to refine the system so it can predict and compare aggregation speeds rather than just aggregation likelihood. This could help predict which protein variants form clumps quickly and which do so more slowly, a vital factor in neurodegenerative diseases where the timing of amyloid formation matters just as much as the fact that it happens at all.

“There are 1,024 quintillion ways of creating a protein fragment that is 20-amino-acids long,” Bolognesi stated. “So far, we’ve trained an AI with just 100,000 fragments. We want to improve it by making more and bigger fragments. This is just the first step, but our work shows it is possible to decipher the language of protein aggregation. This is incredibly important for our understanding of human disease, but also to guide synthetic biology efforts … CANYA can help guide efforts to engineer antibodies and enzymes that are less likely to stick together and reduce expensive setbacks in the process.”

Lehner added, “Using DNA synthesis and sequencing, we can perform hundreds of thousands of experiments in a single tube, generating the data we need to train AI models. This is an approach we are applying to many difficult problems in biology. The goal is to make biology predictable and programmable.”

The post AI Tool Predicts Protein Aggregation from Sequence and Allows Scientists to Understand Decision-Making Process appeared first on GEN - Genetic Engineering and Biotechnology News.
 
Top Bottom