Building a Measurable Evidence Engine for Cell, Gene, and RNA Therapies

One of the questions we find ourselves returning to most often is a simple one: why do some patients respond dramatically to a therapy while others don't, even when they appear clinically similar? For cell, gene, and RNA therapies, where interventions work at the molecular level, the answer almost always lives somewhere in the data we already have. The challenge is finding it.

That challenge is fundamentally about integration. Disease is a molecular phenomenon, and the story of how it develops and progresses is encoded across multiple layers of biological data: the somatic mutations that initiate or drive it, the transcriptomic changes that reflect its active state, the proteomic landscape that mediates its effects, the clinical trajectory that captures its outcome. No single layer tells the complete story. But when you bring them together and can actually hold all those layers in the same analytical space, patterns emerge that you simply cannot see any other way.

The Cost Of Fragmentation

In practice, this kind of integration is harder than it sounds. Mutation data, RNA-seq outputs, proteomic profiles, clinical outcomes, and patient demographics typically live in separate systems, united only by a shared patient identifier. Analyzing each layer requires different expertise and different tools. Cross-modal integration is labor-intensive, and the reconciliation work (aligning formats, resolving missing values, tracing data provenance) can consume more time than the actual science.

The consequence is that most teams end up working one layer at a time, which is a bit like trying to understand a symphony by listening to the string section in isolation. You can hear something interesting. You're just missing most of what's going on.

A Case In Point: The Breast Cancer Survival Paradox

Consider a study we conducted with an academic cancer center, blinded pending publication, that illustrates what becomes possible when you dissolve these silos.

The research question was this: why do patients with invasive lobular carcinoma (ILC) show a 15% overall survival advantage over patients with the more common invasive ductal carcinoma (IDC) at five and ten years post-diagnosis, even though ILC presents as a more aggressive subtype at diagnosis? Clinicians have observed this paradox for decades. Clinical data alone had never resolved it.

We began by using our Omics Data Agent to build the initial clinical-genomic cohort, translating natural-language queries into SQL and Python to pull the relevant patient population without requiring a bioinformatics specialist at every step. Before committing to deeper analysis, we used the Cohort Browser to interactively explore the population, separating signal from noise in the clinical variables and validating that our cohort was well-characterized. We also ran the datasets through our Data Quality Manager to surface missingness and data quality characteristics at the column level, a step that's easy to skip but critical for ensuring downstream analysis is built on solid ground.

From there, we moved into the mutation layer, characterizing the tumor mutation landscape within the cancer center cohort and cross-referencing against TCGA as a public benchmark. The results were strongly concordant with established biology. CDH1, a tumor suppressor critical for epithelial cohesion, emerged as the dominant differentiator in the lobular subtype, alongside TBX3, CBFB, and RUNX1. ER-positive status was significantly more prevalent in ILC. The concordance with TCGA validated that what we were seeing was generalizable, not an artifact of a single institution's patient population.

Then we moved to the transcriptomic layer. Using JupyterLab integrated directly within the platform, our team took raw sequencing files through optimized pipelines and into downstream visualization without context-switching between environments. Differential gene expression analysis, adjusting for stage, grade, metastasis, age, and treatment history, surfaced several robust differentiators: GDF9, members of the KLK family (KLK11, KLK12), and KIF1A, among others. Correlation with TCGA showed high concordance here as well, with the same signals appearing in both datasets and strengthening confidence that these molecular signatures were biologically meaningful rather than noise.

The most interesting finding came from connectivity mapping against DrugMatrix, a publicly available database cataloging the gene expression effects of common pharmacological agents. When we mapped the ILC differential expression signature against known chemotherapy profiles, the ILC transcriptome aligned closely with two specific agents: vinorelbine (a microtubule inhibitor) and doxorubicin (a topoisomerase inhibitor).

Put plainly: the ILC transcriptome appears molecularly pre-aligned with the mechanisms of action of these chemotherapy classes. These patients may carry something like an endogenous "pre-treatment effect" at the RNA level, a molecular predisposition that makes them more responsive to standard chemotherapy regimens before treatment even begins. It's a hypothesis that still needs validation, but it's a strong candidate explanation for a survival advantage that has puzzled oncologists for decades. And it emerged only because we could interrogate mutation data, transcriptomics, and pharmacological connectivity together, simultaneously.

From Observation To Mechanism: Causal AI

Observational analysis generates powerful hypotheses. But for cell, gene, and RNA therapy programs, where the goal is to identify and validate precise molecular targets, hypothesis generation isn't the endpoint. That's where causal AI becomes essential.

Our partner Aitia applies their REFS platform (Reverse Engineering and Forward Simulation) to this challenge, using probabilistic causal mathematics grounded in Judea Pearl's work. Rather than identifying correlations — which tell you that two things move together but not which drives which — REFS reconstructs causal network structures from multiomic data at genome-wide scale, then simulates the effect of perturbing any node in the network. It's a kind of in silico loss-of-function experiment, run across thousands of potential targets simultaneously.

In a multiple myeloma study using the MMRF CoMMpass cohort (N > 500 patients with full multiomic profiling), Aitia built a digital twin integrating demographics, cytometry, lab data, treatment history, genomic variation, and RNA-seq data across protein-coding genes, microRNAs, and long non-coding RNAs. Through simulated knockdowns across the entire network, the team identified 102 genes causally driving overall survival. Validation against DepMap CRISPR knockout data found that 73% of the protein-coding driver genes were cancer-cell-dependent in MM cell lines, a statistically significant enrichment far exceeding chance.

What's worth noting is an apparent exception. PHF19, a gene known to drive MM cell proliferation, was identified as a survival driver in the digital twin but showed no dependency in DepMap cell lines. The contradiction resolved when the causal network revealed a modifier: 1q21 chromosomal amplification, present in a specific patient subpopulation, substantially alters PHF19's effect on survival. The DepMap cell lines simply weren't representative of that subgroup. This is precisely the kind of context-dependent, patient-population-specific insight that causal models can surface and that correlation-based approaches routinely flatten or miss.

Collaboration Without Compromise

One recurring challenge in this work is how to collaborate with academic medical centers or clinical partners without creating unacceptable data governance risk. Cancer centers in particular are often unwilling, or legally unable, to transfer patient-level data externally.

The breast cancer study above was conducted entirely within a Trusted Research Environment (TRE) established within the partner cancer center's own infrastructure. Our team accessed the full analytical platform from outside without the data ever leaving the institution. This "bring the compute to the data" model enables deep, integrated analysis while satisfying even the most stringent data residency requirements, and it generates a complete audit trail, which matters increasingly as multimodal analyses find their way into regulatory submissions and IND packages.

What This Means For Advanced Therapy R&D

The pace at which cell, gene, and RNA therapies are advancing means the analytical bar is rising continuously. Identifying the right patient populations, understanding which molecular features predict response, building the biomarker evidence that supports regulatory submissions: all of this requires being able to hold multiple layers of data in the same place and ask integrated questions across them.

The good news is that the data already exists. The analytical methods, from differential expression to connectivity mapping to causal AI, are maturing rapidly. What's left is the infrastructure to unite them in a secure, reproducible, collaborative environment, and the willingness to treat integration as a core scientific capability rather than an afterthought.

The patterns are there. You just have to be able to look at everything at once.

To learn more about how DNAnexus supports biopharma translational research and advanced therapy development, request a demo or explore our Biopharma solutions page.