The quest to design proteins with novel functions has long been hampered by a fundamental challenge: the sheer scale of possibility. A protein just 100 amino acids long presents more potential variations than there are atoms in the observable universe. Traditional protein engineering methods, testing hundreds of variants at a time, can only scratch the surface. Now, a novel framework called MULTI-evolve, detailed today in the journal Science, is leveraging the power of machine learning to dramatically accelerate this process, offering a potential leap forward in fields ranging from medicine to materials science.
Researchers at the Arc Institute have developed MULTI-evolve to address a key bottleneck in protein engineering: efficiently choosing which variants to build and test in the lab. While recent advances in artificial intelligence have enabled broader computational searches for promising protein designs, these still require substantial experimental validation. MULTI-evolve aims to compress months of iterative lab perform into weeks by intelligently prioritizing the most likely candidates for improvement. This represents the Arc Institute’s first “lab-in-the-loop” framework, integrating computational prediction with experimental design from the outset.
Focusing on Beneficial Interactions
A core insight behind MULTI-evolve is that not all mutations are created equal. Early attempts to predict protein function using neural networks trained on single-mutant data proved unreliable. These models struggled to account for the complex interactions between mutations – a phenomenon known as epistasis. Simply testing thousands of random variants, researchers found, largely taught the models what *doesn’t* work, rather than revealing synergistic combinations that enhance function.
The MULTI-evolve team shifted their focus to “quality over quantity.” The process begins by identifying a relatively minor set – around 15 to 20 – of mutations that individually improve protein function, using a combination of protein language models and experimental screening. The framework then systematically tests all pairwise combinations of these beneficial mutations, generating a dataset of roughly 100 to 200 measurements. This focused approach provides the data needed to learn the rules governing how mutations interact, allowing the model to predict the performance of more complex, multi-mutant variants.
This strategy was validated using 12 existing protein datasets from previously published studies. The researchers found that neural networks trained on single and double mutants could accurately predict the activity of variants containing 3 to 12 mutations, even when using only 10% of the available training data. This demonstrates the power of focusing on informative interactions rather than exhaustively searching the vast sequence space.
Three Key Innovations
MULTI-evolve isn’t a single algorithm, but rather an integrated framework built on three key innovations. First, it combines predictions from multiple protein language models – some analyzing protein sequence, others examining 3D structure – to identify a wider range of potentially beneficial mutations. This ensemble approach identified, on average, 20 promising mutations across 73 diverse protein datasets, compared to 11 identified by any single model. For example, when applied to the APEX enzyme, the framework pinpointed the A134P mutation, a substitution often missed by standard protein language models due to a bias against proline substitutions.
Second, MULTI-evolve employs fully connected neural networks to predict which combinations of mutations will work best. These networks, trained on data from single and double mutants, can reliably predict the activity of multi-mutant variants. In testing across 12 protein datasets, the models correctly identified top performers more than half the time.
Finally, the framework addresses the practical challenge of building and testing these predicted variants with a new method called MULTI-assembly. This technique optimizes reaction conditions and oligonucleotide designs to efficiently construct complex multi-mutants, achieving 40-70% assembly efficiency for variants with up to nine mutations. The team also developed a computational tool to design the necessary DNA primers, streamlining the entire process.
Real-World Applications and Open Access
The researchers demonstrated the effectiveness of MULTI-evolve by applying it to three different proteins: APEX, a bioluminescent enzyme; dCasRx, a CRISPR-based gene editor; and an anti-CD122 antibody. They achieved significant improvements in each case, including a 256-fold increase in APEX activity, a 9.8-fold improvement in dCasRx trans-splicing efficiency, and a 2.7-fold increase in binding affinity for the anti-CD122 antibody. Notably, each protein required testing only 100-200 variants in a single round, a fraction of the time and resources traditionally required.
The MULTI-evolve framework is now available as an open-source tool, providing researchers with a systematic path from initial mutation discovery to optimized multi-mutant proteins. The team anticipates that the framework will continue to improve as protein language models become more sophisticated and integrate with other protein design tools.
Researchers interested in applying MULTI-evolve to their own work are encouraged to reach out to the Arc Institute. The next step for the team involves refining the framework and expanding its application to a wider range of protein engineering challenges. The potential impact of this technology extends to numerous fields, offering a powerful new approach to designing proteins with tailored functions.
This story is for informational purposes only and should not be considered medical or scientific advice.
