News > Life/Science Blog

Teaching AI To Think Like A Cell

GREmLN, a new kind of AI model, is trained in ‘molecular logic’ and could unlock new frontiers in cancer research.

July 10, 2025

Molecular model of immunoglobulin M (IgM, light blue) bound to an antigen (magenta), activating the C1 complex (violet) of the complement system. (Photo courtesy of Juan Gaertner via Getty Images)

Cancer often begins with a mistake — a few genes misguided by a mutation, or a set of instructions misread. When a few genes go awry, it can result in consequences that cascade through dozens, even hundreds, of other genes. Researchers are puzzled by this complex domino effect and hope to use AI to help understand how this transformation of identity and behavior occurs.

For decades, biologists have tried to study these transformations using gene expression data — snapshots of which genes are active in individual cells. Tools like single-cell RNA sequencing now offer incredibly rich data, enabling scientists to compare the molecular activity of healthy cells to diseased ones, cell by cell. But while scientists can see which genes are turned “on” or “off”, they still struggle to effectively target these mutations.

In recent years, scientists have turned to AI approaches like machine learning to try to untangle this question, but most machine learning tools still aren’t built to answer it. As they’re built today, AI tools are pattern matchers, not meaning makers. The tools recognize that certain gene activity patterns correlate with disease, but can’t describe how the disease emerged, or what to do about it. That’s because traditional AI models don’t think like cells. They think like computers.

GREmLN — short for Gene Regulatory Embedding-based Large Neural model — doesn’t try to reshape biology to fit AI. Instead, it reshapes AI to fit biology.

Defining a new network

What sets GREmLN apart is how it integrates biological knowledge into the core of the model. Rather than guessing which genes matter based on arbitrary statistical association alone, GREmLN starts with something more powerful: gene regulatory networks, or GRNs.

GRNs are maps of influence. They describe which genes regulate others — such as the genes that encode for transcription factors, and proteins that activate or suppress the expression of other genes. These networks vary by cell type and are shaped by the cell’s function, identity, and environment.

Using these networks, the GREmLN team, led by New York Biohub president Andrea Califano with Columbia University graduate student Mingxuan Zhang, re-engineered the transformer attention mechanism that powers modern AI models and gave it a biological makeover. Instead of letting the model consider every possible gene combination, they constrained its attention to focus on gene pairs that are biologically plausible. This way, the model can simulate how information flows in a real cell, rather than grinding through millions of unlikely gene interactions.

To do this efficiently, GREmLN leverages a mathematical approach known as Chebyshev polynomials — a classic tool from the science of signal processing — to approximate how influence spreads across GRNs. This allows the model to incorporate long-range dependencies among genes without requiring massive computational power.

GREmLN joins a family of biomodels, called virtual cell models, developed by the Chan Zuckerberg Initiative to capture cell biology across molecular, cellular, and systems levels. These models are highly complementary and will range from more universal models, like TranscriptFormer, which is built to address virtually any problem in biology, to more specialized models, like GREmLN. These models will enable scientists to predict how biological systems operate and how to alter their possible future trajectories, accelerating the science for curing, preventing, and managing all diseases.

GREmLN focuses on the “molecular logic” that defines how genes interact and influence each other. This animation shows how the model uniquely captures gene interaction and the influence of master regulators — giving scientists a way to track the critical changes that pinpoint the earliest signs of disease and the possible targets for new treatments.

Training on the language of the cell

GREmLN was initially trained on approximately 11 million single-cell RNA sequencing profiles from 162 different cell types, spanning tissues like the brain, lung, kidney, and blood. All these data came from the Chan Zuckerberg CELLxGENE platform, an open science tool used by thousands of scientists every week that was developed by CZI to enable researchers to explore the inner workings of individual cells.

Instead of treating each gene as a token in a string, GREmLN builds an “embedding” — a rich vector representation — for each gene that captures not just its activity level, but its role in the broader network. These embeddings can then be used for a variety of downstream tasks, such as identifying the cell type of an unknown sample; reconstructing gene expression from a subset of observed genes; and predicting regulatory interactions in new, unseen cell types.

In head-to-head comparisons with state-of-the-art models like Geneformer, scGPT, and scFoundation, GREmLN outperformed them across multiple benchmarks, including a particularly tough challenge — predicting gene relationships in cancer-infiltrating immune cells, which often behave very differently from their healthy counterparts. More importantly, GREmLN only used one-third to one-tenth of the training profiles and parameters used by the other foundation models, making it a more nimble and efficient model.

What it means for biomedical research

Understanding which genes are active in a cancer cell is only the beginning. What researchers really want to know is: What went wrong? What gene caused this cascade of dysfunction? And more practically: Which genes can we target to reverse the damage?

GREmLN was built to leverage the internal logic of the cell with these questions in mind. By capturing how genes regulate one another across diverse conditions and cell types, the model could help researchers trace disease back to its origins. For example, given a malignant cell state, GREmLN could help identify which gene perturbation likely initiated the transformation — and how it might be undone.

This is especially valuable in immunotherapy. Immune cells, such as T cells and macrophages, receive incredibly specific molecular instructions that determine whether they patrol the bloodstream, attack a tumor, or suppress inflammation. By mapping these instruction sets with GREmLN, scientists could start to reprogram immune cells — guiding them more effectively to fight cancer, autoimmune disease, or infection.

One of the biggest challenges facing scientists when developing new drugs is figuring out which genes to target. GREmLN is designed to help identify the master regulator genes that are calling the shots — think of them as a football coach lining out the next play. By targeting these master regulators, researchers can design drugs that are more precise and effective, going after the root causes of disease rather than just the symptoms. And because GREmLN learns from how real cells behave in many different conditions, it will even help predict how targets will change as diseases like cancer evolve or change over time, for instance after a tumor has become drug resistant. It’s like having a smart guide that helps scientists aim their treatments exactly where and when they’ll matter most.

The road ahead

GREmLN is a stepping stone toward CZI’s grand challenge to harness the immune system for early detection, prevention and treatment of disease. Over the next several years, the team aims to integrate richer layers of biological context beyond interactions that affect gene expression. Future iterations could include layers of context such as protein interactions and those supporting cell-cell communications, which are critical to study the immune system. Once these additional layers are fully integrated, GREmLN may also help in other areas, like brain disease, inflammation, and immune disorders. Scientists could use it to study early changes in brain cells, predict how immune cells respond to illness, and simulate how cells might react to new drugs — all before running expensive lab tests.

But the promise is clear. With models like GREmLN, we’re moving from descriptive biology to predictive biology. From mapping what is, to simulating what could be.

Researchers can access GREmLN on the virtual cell platform, including a quick start tutorial; the codebase on GitHub; and the preprint on bioRxiv. Read the press release for more information.

Bioimaging

Solving bottlenecks in cryoET with machine learning

CZ Imaging Institute scientists mark milestone achievement with annotation of over 13,000 tomograms in just 3.5 days

Learn More

Biohub Investigators

A day in the life of an imaging scientist: Laura Waller

Scientist Laura Waller is pushing the frontiers of imaging. Explore a day in her life and lab, building microscopes that help neuroscientists understand the...

Learn More

Uncategorized