This notebook will show you how to work with the C. elegans L2-stage sci-RNA-seq data from Cao et al., 2017.
It aims to cover the following use cases:
You'll need the R packages listed below.
This vignette was written using Monocle version 2.3.5, which is the version that was used for the paper. The source code of Monocle 2.3.5 is provided as a supplementary data file for the paper. You can install it using the command:
R CMD INSTALL monocle_2.3.5.tar.gz
Later versions of Monocle may produce different results for use case 4 (re-clustering with t-SNE) due to changes in how we preprocess the data before running t-SNE. We will update this vignette when we release the next version of Monocle, 2.6.0, to support the new version.
suppressPackageStartupMessages({
library(dplyr)
library(ggplot2)
library(monocle)
})
This RData file contains both the data and some utility functions to help navigate it.
download.file(
"http://jpacker-data.s3.amazonaws.com/public/Cao_et_al_2017_vignette.RData",
destfile = "Cao_et_al_2017_vignette.RData")
load("Cao_et_al_2017_vignette.RData")
The raw gene-by-cell UMI (unique molecular identifier) count matrix for a CellDataSet can be accessed using the exprs
function. It is stored as a sparse matrix object (".
" = 0)
Note that if you need "even rawer" data (FASTQ files), they are available at the Gene Expression Omnibus (www.ncbi.nlm.nih.gov/geo) under accession code GSE98561.
If you aren't familiar with working with single cell RNA-seq data, we highly recommend that you take a look at the examples and utility functions presented in the other sections of this document instead of trying to dive in to the raw data directly.
exprs(cds)[1:3, 1:3]
The fData
function is used to access gene annotations.
fData(cds)[1:3,]
The pData
function is used to access cell annotations.
pData(cds)[1:3,]
neuron.type in pData(cds.neurons) is the annotation used in Figure 4 of Cao et al.
pData(cds.neurons)[1:3,]
The show.expr.info
function returns statistics related to the expression of a given gene in tabular form. The first argument is the gene name. The second argument specifies whether to show statistics at the level of tissues, cell types, or neuron (sub)-types. See the examples below.
The function returns a data frame with five columns:
Note that low levels of expression of a given gene in cell types you would not expect may be due to leakage of RNA in the fixed cells. For this experiment, based on the expression patterns for a few marker genes, we estimate that the background level of expression you will observe randomly in cells that do not truly express a given gene is ~1-2% of the expression of the highest-expressing cell type.
show.expr.info("emb-9", "tissue")
show.expr.info("emb-9", "cell type") %>% head(10)
show.expr.info("R102.2", "neuron type") %>% head(10)
Note that you can see the source code for any of the utility functions used in this vignette. Just run the function with name no parentheses.
show.expr.info
The plot.expr
function can be used to highlight cells that express a given gene on the t-SNE map.
plot.expr(cds, "lin-12")
plot.expr(cds.neurons, "che-3")
We've defined a function two.set.differential.gene.test
that finds and reports statistics on differentially expressed genes (DEG) that distinguish between two defined sets of cells. This is a wrapper that just adds a bit of functionality around Monocle's differentialGeneTest
function.
two.set.differential.gene.test
takes four arguments:
Warning: if you run two.set.differential.gene.test
on large sets of cells (>10000-ish), it may take a bunch of memory.
The utility functions is.tissue
, is.cell.type
, and is.neuron.type
may be used to create the boolean vectors required for the set.1.filter
and set.2.filter
parameters. Each function takes a CellDataSet and a string and tests for each cell whether its tissue / cell type / neuron type is defined (not NA) and equal to the given value.
is.tissue
head(is.tissue(cds, "Gonad"))
sum(is.tissue(cds, "Gonad"))
sum(is.cell.type(cds, "Distal tip cells"))
sum(is.neuron.type(cds.neurons, "ASEL"))
tissues
cell.types
neuron.types
ASEL.vs.ASER.DEG = two.set.differential.gene.test(
cds.neurons, is.neuron.type(cds.neurons, "ASEL"), is.neuron.type(cds.neurons, "ASER"))
two.set.differential.gene.test
returns the following statistics:
ASEL.vs.ASER.DEG %>% head()
ASEL.vs.ASER.DEG %>% filter(higher.expr == "Set 1") %>% head()
Running two.set.differential.gene.test
with formal = T
will compute q-values (the false detection rate at which a gene can be considered to be differentially expressed between the two sets). This is very slow however, so we recommend not using it for exploratory analysis and only using it after you've found something interesting and want to verify that the finding is statistically robust.
system.time({
ASEL.vs.ASER.DEG = two.set.differential.gene.test(
cds.neurons, is.neuron.type(cds.neurons, "ASEL"), is.neuron.type(cds.neurons, "ASER"),
formal = T, cores = min(16, detectCores()))
})
ASEL.vs.ASER.DEG %>% head()
two.set.differential.gene.test
can be used to compare a specific cell type vs. all other cells.
touch.receptor.neuron.DEG = two.set.differential.gene.test(
cds,
is.cell.type(cds, "Touch receptor neurons"),
!is.cell.type(cds, "Touch receptor neurons") & is.tissue(cds, "Neurons"))
touch.receptor.neuron.DEG %>% filter(higher.expr == "Set 1") %>% head()
tni-3 is a troponin that is expressed in body wall muscle (BWM) in the head, but not in the posterior.
cwn-1 and egl-20 are Wnt ligands that are expressed in posterior BWM, but not anterior BWM.
Using these genes as markers, let's look for differentially expressed genes between anterior and posterior BWM.
BWM.anterior.vs.posterior.DEG = two.set.differential.gene.test(
cds,
is.cell.type(cds, "Body wall muscle") & expresses.gene(cds, "tni-3"),
is.cell.type(cds, "Body wall muscle") & (expresses.gene(cds, "cwn-1") | expresses.gene(cds, "egl-20")))
BWM.anterior.vs.posterior.DEG$higher.expr = ifelse(
BWM.anterior.vs.posterior.DEG$higher.expr == "Set 1",
"Anterior BWM", "Posterior BWM")
BWM.anterior.vs.posterior.DEG %>% head()
BWM.anterior.vs.posterior.DEG %>% filter(precision >= 0.95, higher.expr == "Anterior BWM") %>% head(10)
R102.2 is expressed in nine specific pairs of ciliated sensory neurons (http://www.wormbase.org/species/c_elegans/gene/WBGene00011289). We have identified some of these in this dataset, but others remain ambiguous.
show.expr.info("R102.2", "neuron type") %>% head(8)
plot.expr(cds.neurons, "R102.2")
two.set.differential.gene.test
can be used to find DEG between the mystery R102.2(+) clusters and other ciliated sensory neurons. If you can figure out what these clusters correspond to, let us know!
neuron.cluster.21.vs.other.CSN.DEG = two.set.differential.gene.test(
cds.neurons,
is.neuron.type(cds.neurons, "Cluster 21"),
!is.neuron.type(cds.neurons, "Cluster 21") & is.cell.type(cds.neurons, "Ciliated sensory neurons"))
neuron.cluster.21.vs.other.CSN.DEG$higher.expr = ifelse(
neuron.cluster.21.vs.other.CSN.DEG$higher.expr == "Set 1",
"Cluster 21", "Other CSN")
neuron.cluster.21.vs.other.CSN.DEG %>% filter(higher.expr == "Cluster 21", precision >= 0.8) %>% head(10)
neuron.cluster.16.vs.other.CSN.DEG = two.set.differential.gene.test(
cds.neurons,
is.neuron.type(cds.neurons, "Cluster 16"),
!is.neuron.type(cds.neurons, "Cluster 16") & is.cell.type(cds.neurons, "Ciliated sensory neurons"))
neuron.cluster.16.vs.other.CSN.DEG$higher.expr = ifelse(
neuron.cluster.16.vs.other.CSN.DEG$higher.expr == "Set 1",
"Cluster 16", "Other CSN")
neuron.cluster.16.vs.other.CSN.DEG %>% filter(higher.expr == "Cluster 16") %>% head(10)
In the Cao et al. paper, we found that several of the neuron t-SNE clusters expressed markers of cholinergic neurons such as unc-17, cho-1, and cha-1. Let's perform a sub-clustering of these cholinergic neurons to see if we can get better separation. First, we'll identify which clusters are enriched for cells that express cholinergic markers.
pData(cds.neurons)$any.cholinergic.marker =
(expresses.gene(cds.neurons, "unc-17") +
expresses.gene(cds.neurons, "cho-1") +
expresses.gene(cds.neurons, "cha-1")) > 0
pData(cds.neurons) %>%
group_by(Cluster) %>%
summarize(
n.total = n(), n.cholinergic = sum(any.cholinergic.marker),
prop.cholinergic = n.cholinergic / n.total) %>%
inner_join(
unique(pData(cds.neurons)[, c("Cluster", "neuron.type")]),
by = "Cluster") %>%
arrange(-prop.cholinergic) %>%
head(15)
cholinergic.clusters = c(29, 23, 3, 26, 35, 36, 15, 24, 11)
plot_cell_clusters(cds.neurons, color = "Cluster %in% cholinergic.clusters", cell_size = 0.2)
cds.cholinergic = cds.neurons[, pData(cds.neurons)$Cluster %in% cholinergic.clusters]
cat(ncol(cds.cholinergic), "cells in the cds subset", "\n")
cds.cholinergic = estimateSizeFactors(cds.cholinergic)
cds.cholinergic = estimateDispersions(cds.cholinergic) # would take a lot of memory for larger cell sets
cds.cholinergic = detectGenes(cds.cholinergic, 0.1)
The next step is to run a new t-SNE dimensionality reduction on this subset of cells.
system.time({
cds.cholinergic = reduceDimension(
cds.cholinergic, max_components = 2, norm_method = "log",
num_dim = 20, reduction_method = 'tSNE', verbose = T)
})
pData(cds.cholinergic)$tsne_1 = reducedDimA(cds.cholinergic)[1,]
pData(cds.cholinergic)$tsne_2 = reducedDimA(cds.cholinergic)[2,]
20 principal components looks like its enough for this data. If you don't see an elbow in the scree plot, that means you've used too few principal components.
plot_pc_variance_explained(cds.cholinergic)
ggplot(pData(cds.cholinergic), aes(x = tsne_1, y = tsne_2)) +
geom_point(size = 0.1) +
monocle:::monocle_theme_opts()
Now we'll cluster the cells in the t-SNE space using density peak clustering. In density peak clustering, rho
measures to the local density of cells and delta
measures the minimum distance to a region of higher local density. See http://science.sciencemag.org/content/344/6191/1492 for more details. You want to set thresholds on rho and delta that enclose the outlier points in the scatter plot. In my experience, it's usually better to over-cluster than to under-cluster.
cds.cholinergic = clusterCells_Density_Peak(cds.cholinergic)
plot_rho_delta(cds.cholinergic, rho_threshold = 10, delta_threshold = 6)
cds.cholinergic = clusterCells_Density_Peak(cds.cholinergic,
rho_threshold = 10, delta_threshold = 6, skip_rho_sigma = T)
plot_cell_clusters(cds.cholinergic, cell_size = 0.2)
Looks like there are many distinct clusters. Recall that we input only 9 clusters from the cds.neurons t-SNE into this re-clustering. The re-clustering appears to have revealed new potential distinct cell types.
Now would be a good time to save your progress.
save.image("my_analysis_Cao_et_al_data.RData")
Let's find marker genes for each cluster.
cholinergic.clusters = sort(as.integer(unique(pData(cds.cholinergic)$Cluster)))
cholinergic.clusters
DEG.results = lapply(cholinergic.clusters, function(this.cluster) {
message("Finding markers for cluster ", this.cluster)
cbind(
two.set.differential.gene.test(
cds.cholinergic,
pData(cds.cholinergic)$Cluster == this.cluster,
pData(cds.cholinergic)$Cluster != this.cluster),
data.frame(cluster = this.cluster))
})
cholinergic.cluster.markers = do.call(rbind, DEG.results) %>% filter(higher.expr == "Set 1") %>%
select(
cluster, gene,
cluster.n.umi = set.1.n.umi,
other.n.umi = set.2.n.umi,
cluster.tpm = set.1.tpm,
other.tpm = set.2.tpm,
log2.ratio, precision, recall, f.score) %>%
arrange(-f.score)
cholinergic.cluster.markers %>% filter(precision >= 0.5) %>% head(10)
In the previous examples for using plot.expr
, the function used pre-computed expression tables for cds
and cds.neurons
to show which cell types / neuron types were expressing a given gene. The following code will set up plot.expr
so it can show which density peak clusters from this re-clustering analysis express a given gene.
cholinergic.expr.info = get.expr.info.by.facet(cds.cholinergic, "Cluster")
plot.expr(cds.cholinergic, "lgc-39", expr.info = cholinergic.expr.info)
plot.expr(cds.cholinergic, "vglu-2", expr.info = cholinergic.expr.info)
plot.expr(cds.cholinergic, "unc-17", expr.info = cholinergic.expr.info)