7 Workshop

7.1 Overview

The goal of this workshop is to build a workflow with some example single-cell RNA-seq data.

7.2 Data

The scRNAseq package provides convenient access to several publicly available data sets in the form of SingleCellExperiment objects. The focus of this package is to capture datasets that are not easily read into R with a one-liner from, e.g., read_csv(). Instead, the necessary data munging is already done so that users only need to call a single function to obtain a well-formed SingleCellExperiment.

library(scRNAseq)

To see the list of available datasets, use the listDatasets() function:

out <- listDatasets() 
out

DataFrame with 61 rows and 5 columns
                 Reference  Taxonomy               Part    Number
               <character> <integer>        <character> <integer>
1   @aztekin2019identifi..      8355               tail     13199
2   @bach2017differentia..     10090      mammary gland     25806
3           @bacher2020low      9606            T cells    104417
4     @baron2016singlecell      9606           pancreas      8569
5     @baron2016singlecell     10090           pancreas      1886
...                    ...       ...                ...       ...
57    @zeisel2018molecular     10090     nervous system    160796
58     @zhao2020singlecell      9606 liver immune cells     68100
59    @zhong2018singlecell      9606  prefrontal cortex      2394
60  @zilionis2019singlec..      9606               lung    173954
61  @zilionis2019singlec..     10090               lung     17549
                      Call
               <character>
1        AztekinTailData()
2        BachMammaryData()
3        BacherTCellData()
4   BaronPancreasData('h..
5   BaronPancreasData('m..
...                    ...
57     ZeiselNervousData()
58   ZhaoImmuneLiverData()
59   ZhongPrefrontalData()
60      ZilionisLungData()
61  ZilionisLungData('mo..

You can load a dataset the following way:

sce <- ZeiselBrainData()
sce

class: SingleCellExperiment 
dim: 20006 3005 
metadata(0):
assays(1): counts
rownames(20006): Tspan12 Tshz1 ... mt-Rnr1 mt-Nd4l
rowData names(1): featureType
colnames(3005): 1772071015_C02 1772071017_G12 ... 1772066098_A12
  1772058148_F03
colData names(10): tissue group # ... level1class level2class
reducedDimNames(0):
mainExpName: endogenous
altExpNames(2): ERCC repeat

7.3 Tasks

Pick a scRNA-seq dataset that has more than 5,000 cells and load the SingleCellExperiment (or sce) object.
Show the number of number of genes and number of observations in the sce object.
Using the material we learned in the lecture, analyze the scRNA-seq data using the Biocondutor packages we learned about. This should include (but not be limited to)
- Quality control (you must use at least two different QC metrics)
- Normalization
- Feature selection using highly variable genes
- Dimensionality reduction using PCA
- Data visualization using tSNE or UMAP
- Unsupervised clustering (your choice of method!)
At the end of your analysis, show a plot of both (i) the PCA plot and (ii) either the tSNE or UMAP plot with the colors represented by the predicted labels from the clustering algorithm.
For each component described in Task #3, write 3-4 sentences naming and describing the idea behind the methodology you used, along with interpreting the output.

# Add your solution here

7.3.1 Useful tips

If the original dataset was not provided with Ensembl annotation, we can map the identifiers with ensembl=TRUE. Any genes without a corresponding Ensembl identifier is discarded from the dataset.

sce <- ZeiselBrainData(ensembl=TRUE)

Warning: Unable to map 1565 of 20006 requested IDs.

head(rownames(sce))

[1] "ENSMUSG00000029669" "ENSMUSG00000046982" "ENSMUSG00000039735"
[4] "ENSMUSG00000033453" "ENSMUSG00000046798" "ENSMUSG00000034009"