Simulating Evolution: How ESM3 Language Model Transforms Protein Development

Simulating Evolution: How ESM3 Language Model Transforms Protein Development
November 5, 2024

ESM3, a new artificial intelligence (AI) created by EvolutionaryScale, a US company founded by former Meta workers, can design proteins with specified properties, a process that would normally take hundreds of millions of years to evolve in an organic way. The company unveiled this generative masked language model, one of the largest biological AIs to date, in a recent preprint on BioRxiv. The ability of ESM3 to simultaneously produce the amino acid sequence, three-dimensional structure and function of a protein in response to particular signals is unique and opens the door to uses in materials research, drug development drugs and carbon storage proteins.

Since proteins are microscopic biomachines vital for many bodily processes, including the formation of muscles, hair and nails as well as the production of hormones and antibodies, their three-dimensional structure is of great biological and pharmacological importance. Knowing the structure of proteins helps to understand their biological function, evaluate their eligibility as therapeutic targets, and determine their effectiveness as treatments. Proteins are the building blocks of several life-saving drugs, including insulin and synthetic antibodies against serious respiratory infections like RSV and cancer. Instead of laboriously searching for natural variants, medical research increasingly needs to make entirely new proteins with certain characteristics.

For protein synthesis, EvolutionaryScale's ESM3 uses a hidden language model that can fill in the gaps in various categories by looking at the context from different angles. The model used a separate alphabet for each category (sequence, 3D structure and function) and was trained on a large dataset including 2.8 billion amino acid sequences, 236 million protein structures and 539 million protein functions . To enable the model to understand context both within and across these many layers, the team found a way to represent each 3D structure as a series of characters.

The startup tasked the model with the challenge of creating synthetic versions of green fluorescent protein (GFP), responsible for the naturally occurring light in marine species like corals and jellyfish, to demonstrate the potential of ESM3. GFP, which won the 2008 Nobel Prize in Chemistry, is an essential protein in molecular biology that allows scientists to identify and track components of living cells. Although it had only a 58% genetic resemblance to its natural counterpart, “esmGFP,” the finest synthetic variant of GFP produced by ESM3, had a high brightness comparable to that of natural GFP. According to the researchers, creating this new fluorescent protein would be equivalent to mimicking more than 500 million years of evolution.

EvolutionaryScale Chief Scientist Alex Rives was involved in previous iterations of the ESM model at Meta. The team decided to continue this study alone after Meta stopped working in this area last year. As a result, the fluorescent protein has just been announced and $142 million has been invested to commercialize these advances. A smaller, open access version of EvolutionaryScale has also been made available for scientific research, but it is not fully functional. While he was excited to test the model, Martin Pacesa of the Ecole Polytechnique Fédérale de Lausanne mentioned in an interview that it would take a lot of computing power to reproduce the full version.

Stay on top of the latest in technology and innovation with updates from Code Labs Academy.

Code Labs Academy © 2024 All rights reserved.