The Data Science team develops open-source tools, datasets, and resources to enable scientific discovery and measure human biology.


Technological advances in next-generation sequencing have allowed for broad experimental sampling of immune repertoires, providing insight into how our immune system responds to infection, vaccination, autoimmunity, and cancer. However, the scale of this data can make it difficult to bioinformatically extract the key sequence features that are shared across multiple repertoires.

So we built AIRRscape, an open-source, R Shiny tool to interactively visualize and analyze antibody repertoires. AIRRscape permits users to import and visualize their own AIRR-compliant repertoire data or visualize pre-loaded datasets on any web browser. Get the code on Github.

Access AIRRscape

COVID Tissue Atlas

COVID-19 is the most devastating infectious disease in recent history. The pandemic has impacted all parts of the globe and resulted in over 6 million deaths. The systemic effects of severe COVID-19 are largely mediated through the immune response to SARS-CoV-2 infection and subsequent inflammatory response. A multi-organ approach is necessary to improve our understanding of the cellular and molecular mechanisms that drive severe COVID-19 and lead to damage to different organs and tissues.

We used single-cell transcriptomics to analyze six organs from 15 COVID-positive and five healthy autopsies. Remarkably, through a multi-organ analysis of differential expression and pathway enrichment, we found common transcriptional responses in endothelial cells and macrophages across multiple organs from COVID autopsies. We also identified potential ligand-receptor interactions between these two cell types as targets of signal transduction mechanisms in COVID-19. More generally, our computational efforts provide a basis for analyzing the responses of individual cells while considering the global context of the human body.

Access the COVID Tissue Atlas


Datahub is dedicated to producing simple, scalable software that helps the Biohub pursue scientific insight and advancement. This consists of three main projects: a centralized data portal to integrate and expose multimodal datasets generated by researchers, modularized and shareable bioinformatics pipelines to improve the quality of analysis, and sample submission systems to track analysis and the associated metadata.

Access the Datahub




Single-cell transcriptomics analysis requires iteration on pre-processing and annotation of cell types. Thus, being able to quickly check the pre-processing and its impact on the cell-type annotation and features such as differential gene expression is key to streamlining this iterative process.

We built Exploratory CellxGene(or exCellxGene) as an extension of CZ CellxGene to assist researchers from the beginning of single-cell omics data analysis (from pre-processing and filtering of data to visualization and annotation) all the way to fast differential gene expression computation.

Access exCellxGene

Napari plug-ins

Napari is an easy-to-use open-source image visualization tool for complex scientific datasets.

We have built napari plug-ins that annotate malaria-infected cells, perform polygonal cropping on multi-scale images, visualize large multimodal MERFISH (Multiplexed Error-Robust Fluorescence In Situ Hybridization) datasets, and interactively perform annotation and comparative phenotyping of perturbed cells.

Access napari plug-ins


ortho_seqs is a Python software tool that quantifies higher order sequence-phenotype interactions based on the previously published multivariate tensor-based orthogonal polynomial method applied to biological sequences. 

The tool is a packaged command-line utility, installable via PyPI or through Github, and accompanied by an easy-to-use graphical user interface (GUI) along with extensive documentation to allow community use to explore sequence-phenotype relationships.

Access ortho_seqs

Spatial Transcriptomics

Spatial transcriptomics extends single cell RNA sequencing (scRNA-seq) by providing spatial context for cell type identification and analysis. Imaging-based spatial technologies such as Multiplexed Error-Robust Fluorescence In Situ Hybridization (MERFISH) can achieve unprecedented spatial resolution for applications, such as directly mapping single-cell identities to position within a tissue. 

At CZ Biohub, we are utilizing the Vizgen MERSCOPE system to conduct single-molecule spatial transcriptomics for investigating fundamental cell biology, disease states, and embryonic development. By measuring RNA count statistics for hundreds of genes, we can accurately identify cell types within their native spatial context and conduct next-generation bioinformatic analysis.

Access the resource


Zebrahub is a multimodal, single-cell RNA sequencing atlas of vertebrate development at single-embryo resolution, using zebrafish as a model organism. The first Zebrahub dataset, of approximately 120,000 cells, spans 10 developmental stages: from end-of-gastrulation embryos to 10-day larvae (bud-stage, 5-, 10-, 15-, 20-, 30-somites stages, as well as 2-, 3-, 5- and 10-days post-fertilization). Four embryos were sequenced per time point. We strive to achieve the highest possible quality; in that context, we expect the Zebrahub dataset to evolve to include more stages and higher data resolution. We collaborate with Biohub’s Royer Group and Genomics Platform on this work.

Access Zebrahub