Research Themes

Reproducible research

With the advent of large-scale and high-throughput data collection coupled with the creation and implementation of complex statistical algorithms for data analysis, the reproducibility of modern data analyses, meaning the ability of independent analysts to recreate the results claimed by the original authors using the original data and analysis techniques, has become an important topic of discussion. We contribute to developing reproducibility standards and best practices for the life sciences and in public health.

Single cell data science

Single-cell RNA-Sequencing (scRNA-seq) data has become the most widely used high-throughput method for transcription profiling of individual cells. This technology has created an unprecedented opportunity to investigate important biological questions that can only be answered at the single-cell level. However, this technology also brings new statistical and computational challenges.

To address these challenges, we have developed statistical and machine learning methods to:

Store, access, visualize, and analyze scRNA-seq data
Identify and to remove systematic errors and biases in scRNA-seq data
Identify features and perform dimensionality reduction for count-based data
Perform fast and memory-efficient unsupervised clustering scalable to millions of cells
Identify differences between biological groups or conditions using count-based models either at the gene and isoform level
Estimate the cell composition in heterogenous tissues using bulk RNA-sequencing

We have also written books to teach users how to make use of cutting-edge computational tools to process, analyze, visualize, and explore scRNA-seq data.

Spatial omics data science

Spatially-resolved transcriptomics (SRT), the Nature Methods 2020 Method of the Year and spatial proteomics, the Nature Methods 2024 Method of the Year, are poised to transform our understanding of how the spatial organization and communication of cells in complex tissues. The experimental technologies to generate these data are fast moving, but have generated lots of open questions and opportunities.

In this space, we have developed statistical and machine learning methods to:

Store, access, visualize, and analyze spatial transcriptomics data leveraging existing infrastructure borrowed from geomaps and geospatial file formats
Perform spatially-aware quality control to reduce the noise in downstream analyses
Identify spatially variable genes leveraging fast and scalable spatial statistics models
Remove the mean variance relationship in spatial transcriptomics data

Deep learning and AI

We are actively developing state-of-the-art deep learning algorithms using autoencoders, contrastive self-supervised learning, and transformers to improve downstream tasks in spatial omics.

Predict spatial domains for multi-modal spatial omics data
Integrate and align multiple tissue sections to create 3D spatial atlases
Fine tune sequence to function models for downstream spatial omics data analysis tasks

Open-source software

In addition to developing statistical methods, we care deeply about implementing our computational methods in open-source software. Bioconductor is an open-source, open-development software project in the R programming language for the analysis and comprehension of high-throughput genomics and molecular biology data. We contribute open-source software to CRAN and to Bioconductor, which helps make our computational methods broadly accessible to the research community.

Benchmarking and evaluting open-source software packages for their accuracy, usability and reproducibility is also important to us. We have led systematic benchmark comparsions of:

Computational methods that control the false discovery rate (FDR)
Computational methods for scRNA-seq imputation

Data science education

An increase in demand for statistics and data science education has led to changes in curriculum, specifically an increase in computing. While this has led to more applied courses, students still struggle with effectively deriving knowledge from data and solving real-world problems. In 1999, Deborah Nolan and Terry Speed argued the solution was to teach courses through in-depth case studies derived from interesting scientific questions with nontrivial solutions that leave room for different analyses of the data. This innovative framework teaches the student to make important connections between the scientific question, data and statistical concepts that only come from hands-on experience analyzing data.

Open Case Studies

To address this, we built the Open Case Studies (OCS) data science educational resource of case studies that educators can use in the classroom to teach students how to effectively derive knowledge from data. This project was selected as a High-Impact Project in 2019-2020 by the Bloomberg American Health Initiative and Bloomberg Philanthropies.

Analytic Design Theory

The data science revolution has led to an increased interest in the practice of data analysis. While much has been written about statistical thinking, a complementary form of thinking that appears in the practice of data analysis is design thinking – the problem-solving process to understand the people for whom a product is being designed.

Analytic Design Theory is the study of how data analyses are conducted in the real world taking into account both the perspectives of both the analyst and the stakeholder or consumer of the analysis. Aspects of this include characterizing variation between analyses, developing measures of quality, success, and usability, modeling the analytic process, and considering what makes data analyses more or less trustworthy. Analytic design theory is driven by the need to scale the training of data analysis and to continuously improve the quality of analyses.

Neuroscience

We part of PsychENCODE Project and have several active and ongoing collaborations in neuroscience.

We collaborate with the Peixoto lab and Risso lab to investigate how gene expression changes with sleep and cognitative impairements within Autism Spectrum Disorders.
We collaborate with the Zack lab and the Timp lab to generate and analyze long-read RNA-sequencing to investigate gene expression in retinal ganglion cells, the cells whose death in glaucoma leads to visual loss and potentially blindness.
We collaborate with researchers at the Lieber Institute for Brain Development, in particular the labs of Keri Martinowich, Kristen Maynard, and Stephanie Page, on a variety of projects leveraging scRNA-seq and spatial transcriptomics technologies. We use these technologies to profilie postmorteum human tissue at a single-cell and/or spatial resolution to better understand the molecular mechanisms of neuropsychiatric, neurodevelopmental, and neurodegenerative disorders in the human brain.

Cancer

We collaborate with Greene lab and Doherty lab studying biological basis of subtypes of high-grade serous ovarian cancers (HGSOC) using bulk RNA-seq, scRNA-seq, and spatial transcriptomics data. This is highly relevant to public health because HGSOC is a particularly deadly cancer that is often only identified at late stage and treatment options are limited. The long-term impact of this project will be a key step towards developing targeted treatments for HGSOCs.