Vignette for Cao et al., 2017

This notebook will show you how to work with the C. elegans L2-stage sci-RNA-seq data from Cao et al., 2017.
It aims to cover the following use cases:

  1. Accessing the raw data
  2. Exploring the expression pattern of a gene of interest
  3. Finding differentially expressed genes between subsets of cells
  4. Re-clustering subsets of the data using t-SNE

You'll need the R packages listed below.

This vignette was written using Monocle version 2.3.5, which is the version that was used for the paper. The source code of Monocle 2.3.5 is provided as a supplementary data file for the paper. You can install it using the command:

R CMD INSTALL monocle_2.3.5.tar.gz

Later versions of Monocle may produce different results for use case 4 (re-clustering with t-SNE) due to changes in how we preprocess the data before running t-SNE. We will update this vignette when we release the next version of Monocle, 2.6.0, to support the new version.

In [3]:
suppressPackageStartupMessages({
    library(dplyr)
    library(ggplot2)
    library(monocle)
})

This RData file contains both the data and some utility functions to help navigate it.

In [2]:
download.file(
    "http://jpacker-data.s3.amazonaws.com/public/Cao_et_al_2017_vignette.RData",
    destfile = "Cao_et_al_2017_vignette.RData")
In [5]:
load("Cao_et_al_2017_vignette.RData")
  • cds is a Monocle CellDataSet object containing the single cell RNA-seq data from the main L2-stage C. elegans experiment described in Cao et al., along with annotations.
  • cds.neurons is a re-clustered subset of the neuronal cells from cds.
  • cds.experiment.2 has data from the second C. elegans experiment described in Cao et al.. This includes intestine cells that were missed in the first experiment, but the data overall is lower quality than the first experiment. In the manuscript, we only included the intestine cells from this experiment and excluded the rest.

Use case 1: accessing the raw data

The raw gene-by-cell UMI (unique molecular identifier) count matrix for a CellDataSet can be accessed using the exprs function. It is stored as a sparse matrix object ("." = 0)

Note that if you need "even rawer" data (FASTQ files), they are available at the Gene Expression Omnibus (www.ncbi.nlm.nih.gov/geo) under accession code GSE98561.

If you aren't familiar with working with single cell RNA-seq data, we highly recommend that you take a look at the examples and utility functions presented in the other sections of this document instead of trying to dive in to the raw data directly.

In [6]:
exprs(cds)[1:3, 1:3]
3 x 3 sparse Matrix of class "dgCMatrix"
               cele-001-001.CATGACTCAA cele-001-001.AAGACGGCCA
WBGene00000001                       .                       .
WBGene00000002                       .                       .
WBGene00000003                       .                       .
               cele-001-001.GCCAACGCCA
WBGene00000001                       .
WBGene00000002                       .
WBGene00000003                       .

The fData function is used to access gene annotations.

In [7]:
fData(cds)[1:3,]
gene_idsymbolnum_cells_expressed
WBGene00000001WBGene00000001aap-1 1016
WBGene00000002WBGene00000002aat-1 354
WBGene00000003WBGene00000003aat-2 897

The pData function is used to access cell annotations.

  • n.umi is the number of unique molecular identifiers observed to be expressed by a given cell.
  • Size_Factor is n.umi divided by the media n.umi across all cells.
  • tsne_1 and tsne_2 are the coordinates for the cell in the t-SNE dimensionality reduction.
  • Cluster is the id of the t-SNE cluster the cell is part of (defined by density peak clustering).
  • cell.type and tissue are the annotations that we used throughout the analysis of Cao et al.
In [8]:
pData(cds)[1:3,]
celln.umiplateSize_Factornum_genes_expressedtsne_1tsne_2Clusterpeakshalodeltarhocell.typetissue
cele-001-001.CATGACTCAAcele-001-001.CATGACTCAA 144 001 0.2368328 89 5.4866377 14.67085 20 FALSE TRUE 0.02491657 893.9855 Unclassified neurons Neurons
cele-001-001.AAGACGGCCAcele-001-001.AAGACGGCCA 790 001 1.2992911 419 -3.8619751 -27.63448 6 FALSE TRUE 0.40961274 812.2076 Germline Gonad
cele-001-001.GCCAACGCCAcele-001-001.GCCAACGCCA 832 001 1.3683674 338 -0.5594413 41.98569 13 FALSE TRUE 0.04445184 240.2908 Intestinal/rectal muscleIntestinal/rectal muscle

neuron.type in pData(cds.neurons) is the annotation used in Figure 4 of Cao et al.

In [9]:
pData(cds.neurons)[1:3,]
celln.umiplateSize_Factornum_genes_expressedtsne_1tsne_2Clusterpeakshalodeltarhotissuecell.typeneuron.type
cele-001-001.CATGACTCAAcele-001-001.CATGACTCAA 144 001 0.2368328 89 0.9574604 0.8288424 11 FALSE TRUE 0.37046400 108.71265 Neurons Unclassified neurons Cholinergic (11)
cele-001-001.AACTACGGCTcele-001-001.AACTACGGCT 201 001 0.3305791 129 -3.0567593 -41.4083795 8 FALSE FALSE 0.25861943 70.88069 Neurons Ciliated sensory neuronsASI/ASJ
cele-001-001.GAGGCTTATTcele-001-001.GAGGCTTATT 117 001 0.1924267 76 -18.5689290 -33.9833909 39 FALSE TRUE 0.02962754 37.29414 Neurons Ciliated sensory neuronsAFD

Use case 2: exploring the expression pattern of a gene of interest

The show.expr.info function returns statistics related to the expression of a given gene in tabular form. The first argument is the gene name. The second argument specifies whether to show statistics at the level of tissues, cell types, or neuron (sub)-types. See the examples below.

The function returns a data frame with five columns:

  • facet -- the tissue / cell type / neuron type
  • tpm -- the expression of the given gene in the facet in TPM (transcripts per million).
  • prop.cells.expr -- the proportion of cells in the facet that express at least one UMI (unique molecular identifier) for the given gene. Note that cells in different facets can have very different average number of UMIs per cell, so TPM is the better measure of relative expression.
  • n.umi -- the number of UMIs (unique molecular identifiers) observed for the given gene in the facet. This is the "sample size" from which the TPM value is computed.
  • total.n.umi.for.facet -- the total number of UMIs observed across all genes for all cells in the facet.

Note that low levels of expression of a given gene in cell types you would not expect may be due to leakage of RNA in the fixed cells. For this experiment, based on the expression patterns for a few marker genes, we estimate that the background level of expression you will observe randomly in cells that do not truly express a given gene is ~1-2% of the expression of the highest-expressing cell type.

In [10]:
show.expr.info("emb-9", "tissue")
facettpmprop.cells.exprn.umitotal.n.umi.for.facet
Body wall muscle7972.27222 0.96451347 157028 19390434
Gonad 390.16652 0.03757116 3597 11166871
Intestine 97.88569 0.09775967 102 1230975
Neurons 89.83882 0.01406926 187 2203067
Glia 72.96189 0.02148228 39 787560
Pharynx 49.97025 0.01237113 48 1122443
Hypodermis 47.05950 0.01821904 158 5821384
In [11]:
show.expr.info("emb-9", "cell type") %>% head(10)
facettpmprop.cells.exprn.umitotal.n.umi.for.facet
Distal tip cells 16165.1598 0.97520661 3405 202581
Body wall muscle 7972.2722 0.96451347 157028 19390434
Intestinal/rectal muscle 5211.8861 0.84740260 2622 439170
Sex myoblasts 218.2425 0.14776632 75 377288
Pharyngeal neurons 165.1210 0.01592357 14 85381
Other interneurons 137.8986 0.02483070 17 172852
Socket cells 123.7517 0.02793296 26 184774
Coelomocytes 115.4683 0.02503682 71 544263
Non-seam hypodermis 102.4186 0.02006689 56 1059546
Somatic gonad precursors 102.1766 0.06376812 75 823856
In [12]:
show.expr.info("R102.2", "neuron type") %>% head(10)
facettpmprop.cells.exprn.umitotal.n.umi.for.facet
Cluster 21 9807.74014 0.91588785 446 49587
Cluster 16 8266.64832 0.66025641 363 47719
ASK 6720.61429 0.81944444 155 24157
ASI/ASJ 6482.22237 0.74358974 239 40396
ASG 2121.04289 0.39534884 48 26295
ASEL 1670.50501 0.37837838 17 11042
ASER 795.91129 0.28571429 16 15324
AWB/AWC 85.09161 0.02380952 4 23186
Cholinergic (15) 60.56935 0.01538462 1 14463
Pharyngeal (33) 58.35581 0.02857143 2 25101

Note that you can see the source code for any of the utility functions used in this vignette. Just run the function with name no parentheses.

In [13]:
show.expr.info
function (gene, expr.info) 
{
    if (class(expr.info) == "character") {
        expr.info = gsub("[.]", " ", tolower(expr.info))
        if (expr.info == "tissue") 
            expr.info = tissue.expr.info
        else if (expr.info == "cell type") 
            expr.info = cell.type.expr.info
        else if (expr.info == "neuron type") 
            expr.info = neuron.type.expr.info
    }
    gene.id = get.gene.id(gene, fData.df = expr.info$gene.annotations)
    data.frame(facet = names(expr.info$tpm[gene.id, ]), tpm = expr.info$tpm[gene.id, 
        ], prop.cells.expr = expr.info$prop.cells.expr[gene.id, 
        ], n.umi = expr.info$n.umi[gene.id, ], total.n.umi.for.facet = expr.info$total.n.umi.for.facet) %>% 
        arrange(-tpm)
}

The plot.expr function can be used to highlight cells that express a given gene on the t-SNE map.

  • Cells that do not express the gene will be colored grey. They are made semi-transparent so as to better highlight the cells that do express the gene.
  • Cells that express the gene will be colored according to their cell type. The top 4 highest-expressing cell types will be assigned distinct colors. Other cell types will be lumped together.
  • Cells in the "Failed QC" category are those that express the gene, but are excluded from the analysis due to either having an unusually low UMI count or being a likely doublet.
In [14]:
plot.expr(cds, "lin-12")
Warning message in grid.Call.graphics(L_points, x$x, x$y, x$pch, x$size):
“semi-transparency is not supported on this device: reported only once per page”
In [15]:
plot.expr(cds.neurons, "che-3")
Warning message in grid.Call.graphics(L_points, x$x, x$y, x$pch, x$size):
“semi-transparency is not supported on this device: reported only once per page”

Use case 3: testing for differential expression between subsets of cells

We've defined a function two.set.differential.gene.test that finds and reports statistics on differentially expressed genes (DEG) that distinguish between two defined sets of cells. This is a wrapper that just adds a bit of functionality around Monocle's differentialGeneTest function.

two.set.differential.gene.test takes four arguments:

  1. cds -- a CellDataSet object that includes both sets of cells you wish to compare
  2. set.1.filter -- a boolean vector of length ncol(cds) indicating which cells should be in Set 1
  3. set.2.filter -- a boolean vector of length ncol(cds) indicating which cells should be in Set 2
  4. formal -- if TRUE, p-values and q-values are computed for differential gene expression and genes are ranked by q-value. if FALSE, no formal statistical test is performed and genes are heuristically ranked. formal = F makes the function run much, much faster.

Warning: if you run two.set.differential.gene.test on large sets of cells (>10000-ish), it may take a bunch of memory.

The utility functions is.tissue, is.cell.type, and is.neuron.type may be used to create the boolean vectors required for the set.1.filter and set.2.filter parameters. Each function takes a CellDataSet and a string and tests for each cell whether its tissue / cell type / neuron type is defined (not NA) and equal to the given value.

In [17]:
is.tissue
function (cds, x) 
{
    with(pData(cds), !is.na(tissue) & tissue == x)
}
In [14]:
head(is.tissue(cds, "Gonad"))
  1. FALSE
  2. TRUE
  3. FALSE
  4. FALSE
  5. FALSE
  6. TRUE
In [15]:
sum(is.tissue(cds, "Gonad"))
5628
In [16]:
sum(is.cell.type(cds, "Distal tip cells"))
sum(is.neuron.type(cds.neurons, "ASEL"))
129
37
In [20]:
tissues
  1. 'Body wall muscle'
  2. 'Pharynx'
  3. 'Hypodermis'
  4. 'Neurons'
  5. 'Glia'
  6. 'Gonad'
  7. 'Intestine'
In [21]:
cell.types
  1. 'Am/PH sheath cells'
  2. 'Body wall muscle'
  3. 'Canal associated neurons'
  4. 'Cholinergic neurons'
  5. 'Ciliated sensory neurons'
  6. 'Coelomocytes'
  7. 'Distal tip cells'
  8. 'Dopaminergic neurons'
  9. 'Excretory cells'
  10. 'flp-1(+) interneurons'
  11. 'GABAergic neurons'
  12. 'Germline'
  13. 'Intestinal/rectal muscle'
  14. 'Intestine'
  15. 'Non-seam hypodermis'
  16. 'Other interneurons'
  17. 'Oxygen sensory neurons'
  18. 'Pharyngeal epithelia'
  19. 'Pharyngeal gland'
  20. 'Pharyngeal muscle'
  21. 'Pharyngeal neurons'
  22. 'Rectum'
  23. 'Seam cells'
  24. 'Sex myoblasts'
  25. 'Socket cells'
  26. 'Somatic gonad precursors'
  27. 'Touch receptor neurons'
  28. 'Vulval precursors'
In [22]:
neuron.types
  1. 'AFD'
  2. 'ASEL'
  3. 'ASER'
  4. 'ASG'
  5. 'ASI/ASJ'
  6. 'ASK'
  7. 'AWA'
  8. 'AWB/AWC'
  9. 'BAG'
  10. 'CAN'
  11. 'Cholinergic (11)'
  12. 'Cholinergic (15)'
  13. 'Cholinergic (23)'
  14. 'Cholinergic (24)'
  15. 'Cholinergic (26)'
  16. 'Cholinergic (29)'
  17. 'Cholinergic (3)'
  18. 'Cholinergic (35)'
  19. 'Cholinergic (36)'
  20. 'Cluster 10'
  21. 'Cluster 13'
  22. 'Cluster 16'
  23. 'Cluster 17'
  24. 'Cluster 21'
  25. 'Cluster 25'
  26. 'Cluster 27'
  27. 'Cluster 40'
  28. 'Cluster 5'
  29. 'Dopaminergic'
  30. 'DVA'
  31. 'flp-1(+)'
  32. 'GABAergic'
  33. 'Pharyngeal (33)'
  34. 'Pharyngeal (37)'
  35. 'PVC/PVD'
  36. 'RIA'
  37. 'RIC'
  38. 'SDQ/ALN/PLN'
  39. 'Touch receptor'
  40. 'URX/AQR/PQR'
In [18]:
ASEL.vs.ASER.DEG = two.set.differential.gene.test(
    cds.neurons, is.neuron.type(cds.neurons, "ASEL"), is.neuron.type(cds.neurons, "ASER"))
# of cells in set 1: 37
# of cells in set 2: 35

two.set.differential.gene.test returns the following statistics:

  • set.1.umi -- the number of UMI (unique molecular identifiers) observed for the gene in Set 1
  • set.2.umi -- the number of UMI (unique molecular identifiers) observed for the gene in Set 2
  • set.1.tpm -- gene expression in Set 1 in TPM (transcripts per million)
  • set.2.tpm -- gene expression in Set 2 in TPM (transcripts per million)
  • higher.expr -- which set has higher expression (TPM)
  • log2.ratio -- log2(TPM of higher expressing set / (TPM of lower expressing set + 1))
  • precision -- (# of cells expressing gene in higher expressing set) / (total # of cells expressing gene)
  • recall -- (# of cells expressing gene in higher expressing set) / (total # of cells in higher expressing set)
  • f.score -- geometric mean of precision and recall. genes are sorted by this metric
In [19]:
ASEL.vs.ASER.DEG %>% head()
geneset.1.n.umiset.2.n.umiset.1.tpmset.2.tpmhigher.exprlog2.ratioprecisionrecallf.score
tank-1 20 570 1794.188 36169.778Set 2 4.3325780.85365851.00000000.9210526
gcy-22 0 129 0.000 8862.360Set 2 13.1134751.00000000.80000000.8888889
gei-3 10 166 1095.928 12076.301Set 2 3.4606380.90909090.85714290.8823529
gcy-3 0 147 0.000 9593.506Set 2 13.2278421.00000000.74285710.8524590
gcy-6 66 0 6909.705 0.000Set 1 12.7544081.00000000.72972970.8437500
T27C4.1 30 164 3210.293 11338.218Set 2 1.8199680.68085110.91428570.7804878
In [20]:
ASEL.vs.ASER.DEG %>% filter(higher.expr == "Set 1") %>% head()
geneset.1.n.umiset.2.n.umiset.1.tpmset.2.tpmhigher.exprlog2.ratioprecisionrecallf.score
gcy-6 66 0 6909.705 0.000 Set 1 12.7544081.00000000.72972970.8437500
gcy-17 69 0 6553.874 0.000 Set 1 12.6781321.00000000.62162160.7666667
crh-1 58 12 5675.336 800.504 Set 1 2.8239240.83333330.67567570.7462687
gcy-20 53 0 5339.037 0.000 Set 1 12.3823641.00000000.59459460.7457627
gcy-7 39 0 3709.660 0.000 Set 1 11.8570711.00000000.56756760.7241379
unc-44 88 45 7647.438 2915.088 Set 1 1.3909420.60000000.81081080.6896552

Running two.set.differential.gene.test with formal = T will compute q-values (the false detection rate at which a gene can be considered to be differentially expressed between the two sets). This is very slow however, so we recommend not using it for exploratory analysis and only using it after you've found something interesting and want to verify that the finding is statistically robust.

In [26]:
system.time({

ASEL.vs.ASER.DEG = two.set.differential.gene.test(
    cds.neurons, is.neuron.type(cds.neurons, "ASEL"), is.neuron.type(cds.neurons, "ASER"),
    formal = T, cores = min(16, detectCores()))
    
})
# of cells in set 1: 37
# of cells in set 2: 35
756 genes expressed in at least 5 cells across both sets
Computing differential expression p-values
Warning message:
“closing unused connection 19 (<-localhost:11967)”Warning message:
“closing unused connection 18 (<-localhost:11967)”Warning message:
“closing unused connection 17 (<-localhost:11967)”Warning message:
“closing unused connection 16 (<-localhost:11967)”Warning message:
“closing unused connection 15 (<-localhost:11967)”Warning message:
“closing unused connection 14 (<-localhost:11967)”Warning message:
“closing unused connection 13 (<-localhost:11967)”Warning message:
“closing unused connection 12 (<-localhost:11967)”Warning message:
“closing unused connection 11 (<-localhost:11967)”Warning message:
“closing unused connection 10 (<-localhost:11967)”Warning message:
“closing unused connection 9 (<-localhost:11967)”Warning message:
“closing unused connection 8 (<-localhost:11967)”Warning message:
“closing unused connection 7 (<-localhost:11967)”Warning message:
“closing unused connection 6 (<-localhost:11967)”Warning message:
“closing unused connection 5 (<-localhost:11967)”Warning message:
“closing unused connection 4 (<-localhost:11967)”
   user  system elapsed 
 21.241   3.599 152.575 
In [21]:
ASEL.vs.ASER.DEG %>% head()
geneset.1.n.umiset.2.n.umiset.1.tpmset.2.tpmhigher.exprlog2.ratioprecisionrecallf.score
tank-1 20 570 1794.188 36169.778Set 2 4.3325780.85365851.00000000.9210526
gcy-22 0 129 0.000 8862.360Set 2 13.1134751.00000000.80000000.8888889
gei-3 10 166 1095.928 12076.301Set 2 3.4606380.90909090.85714290.8823529
gcy-3 0 147 0.000 9593.506Set 2 13.2278421.00000000.74285710.8524590
gcy-6 66 0 6909.705 0.000Set 1 12.7544081.00000000.72972970.8437500
T27C4.1 30 164 3210.293 11338.218Set 2 1.8199680.68085110.91428570.7804878

two.set.differential.gene.test can be used to compare a specific cell type vs. all other cells.

In [22]:
touch.receptor.neuron.DEG = two.set.differential.gene.test(
    cds,
    is.cell.type(cds, "Touch receptor neurons"),
    !is.cell.type(cds, "Touch receptor neurons") & is.tissue(cds, "Neurons"))
# of cells in set 1: 334
# of cells in set 2: 7058
In [23]:
touch.receptor.neuron.DEG %>% filter(higher.expr == "Set 1") %>% head()
geneset.1.n.umiset.2.n.umiset.1.tpmset.2.tpmhigher.exprlog2.ratioprecisionrecallf.score
mec-17 2720 118 24033.028 39.202902Set 1 9.223503 0.9009288 0.8712575 0.8858447
mec-18 796 49 7040.671 17.963804Set 1 8.536321 0.9094828 0.6317365 0.7455830
mtd-1 443 15 4025.528 5.513083Set 1 9.271622 0.9476440 0.5419162 0.6895238
mec-7 4418 743 35936.308 239.162028Set 1 7.225290 0.5563771 0.9011976 0.6880000
mec-1 1717 695 16304.188 328.822706Set 1 5.627408 0.4609610 0.9191617 0.6140000
mec-9 726 337 7088.606 182.571539Set 1 5.271088 0.5126263 0.6077844 0.5561644

tni-3 is a troponin that is expressed in body wall muscle (BWM) in the head, but not in the posterior.
cwn-1 and egl-20 are Wnt ligands that are expressed in posterior BWM, but not anterior BWM.
Using these genes as markers, let's look for differentially expressed genes between anterior and posterior BWM.

In [30]:
BWM.anterior.vs.posterior.DEG = two.set.differential.gene.test(
    cds,
    is.cell.type(cds, "Body wall muscle") & expresses.gene(cds, "tni-3"),
    is.cell.type(cds, "Body wall muscle") & (expresses.gene(cds, "cwn-1") | expresses.gene(cds, "egl-20")))
# of cells in set 1: 1254
# of cells in set 2: 1290
In [31]:
BWM.anterior.vs.posterior.DEG$higher.expr = ifelse(
    BWM.anterior.vs.posterior.DEG$higher.expr == "Set 1",
    "Anterior BWM", "Posterior BWM")
In [32]:
BWM.anterior.vs.posterior.DEG %>% head()
geneset.1.n.umiset.2.n.umiset.1.tpmset.2.tpmhigher.exprlog2.ratioprecisionrecallf.score
tni-3 6981 93 3804.24742 18.57887 Anterior BWM 7.602169 0.9819890 1.0000000 0.9909127
cwn-1 59 3113 13.99762 1311.17230 Posterior BWM6.449980 0.9821732 0.8968992 0.9376013
T21B6.3 13869 271 5966.03306 60.20010 Anterior BWM 6.607094 0.9553265 0.8867624 0.9197684
him-4 1366 12029 595.78605 3527.26462 Posterior BWM2.563264 0.8371408 0.8806202 0.8583302
lec-5 862 9838 230.36587 2677.35386 Posterior BWM3.532560 0.8626374 0.8519380 0.8572543
F41C3.5 2478 21353 592.75881 5575.97179 Posterior BWM3.231274 0.7865412 0.8426357 0.8136228
In [33]:
BWM.anterior.vs.posterior.DEG %>% filter(precision >= 0.95, higher.expr == "Anterior BWM") %>% head(10)
geneset.1.n.umiset.2.n.umiset.1.tpmset.2.tpmhigher.exprlog2.ratioprecisionrecallf.score
tni-3 6981 93 3804.24742 18.5788744 Anterior BWM7.602169 0.9819890 1.0000000 0.9909127
T21B6.3 13869 271 5966.03306 60.2000953 Anterior BWM6.607094 0.9553265 0.8867624 0.9197684
glc-4 684 37 310.53057 11.4323046 Anterior BWM4.642570 0.9506849 0.2767145 0.4286597
tre-3 555 5 214.79033 0.6369876 Anterior BWM7.035742 0.9852399 0.2129187 0.3501639
F48E3.8 402 1 144.96720 0.1162907 Anterior BWM7.020870 0.9947917 0.1523126 0.2641770
dpyd-1 289 6 91.93291 0.9050051 Anterior BWM5.592715 0.9740933 0.1499203 0.2598480
ceh-34 307 2 131.08432 0.2528051 Anterior BWM6.709189 0.9885057 0.1371611 0.2408964
seb-2 201 5 85.08025 0.6848107 Anterior BWM5.658166 0.9750000 0.1244019 0.2206506
sfrp-1 335 2 145.18286 0.2754131 Anterior BWM6.830763 0.9863946 0.1156300 0.2069950
F35C11.5 168 1 67.97453 0.5230727 Anterior BWM5.479937 0.9923664 0.1036683 0.1877256

R102.2 is expressed in nine specific pairs of ciliated sensory neurons (http://www.wormbase.org/species/c_elegans/gene/WBGene00011289). We have identified some of these in this dataset, but others remain ambiguous.

In [34]:
show.expr.info("R102.2", "neuron type") %>% head(8)
facettpmprop.cells.exprn.umitotal.n.umi.for.facet
Cluster 219807.740140.91588785446 49587
Cluster 168266.648320.66025641363 47719
ASK 6720.614290.81944444155 24157
ASI/ASJ 6482.222370.74358974239 40396
ASG 2121.042890.39534884 48 26295
ASEL 1670.505010.37837838 17 11042
ASER 795.911290.28571429 16 15324
AWB/AWC 85.091610.02380952 4 23186
In [35]:
plot.expr(cds.neurons, "R102.2")
Warning message in grid.Call.graphics(L_points, x$x, x$y, x$pch, x$size):
“semi-transparency is not supported on this device: reported only once per page”

two.set.differential.gene.test can be used to find DEG between the mystery R102.2(+) clusters and other ciliated sensory neurons. If you can figure out what these clusters correspond to, let us know!

In [46]:
neuron.cluster.21.vs.other.CSN.DEG = two.set.differential.gene.test(
    cds.neurons,
    is.neuron.type(cds.neurons, "Cluster 21"),
    !is.neuron.type(cds.neurons, "Cluster 21") & is.cell.type(cds.neurons, "Ciliated sensory neurons"))

neuron.cluster.21.vs.other.CSN.DEG$higher.expr = ifelse(
    neuron.cluster.21.vs.other.CSN.DEG$higher.expr == "Set 1",
    "Cluster 21", "Other CSN")
# of cells in set 1: 107
# of cells in set 2: 735
In [47]:
neuron.cluster.21.vs.other.CSN.DEG %>% filter(higher.expr == "Cluster 21", precision >= 0.8) %>% head(10)
geneset.1.n.umiset.2.n.umiset.1.tpmset.2.tpmhigher.exprlog2.ratioprecisionrecallf.score
C39D10.2 226 1 5237.910 1.667334 Cluster 2110.939377 0.9880952 0.7757009 0.8691099
T09B9.3 197 0 3836.121 0.000000 Cluster 2111.905433 1.0000000 0.6074766 0.7558140
F15A4.5 157 15 3677.812 67.963116 Cluster 21 5.736879 0.8923077 0.5420561 0.6744186
flp-25 100 7 1775.299 44.020086 Cluster 21 5.301349 0.9137931 0.4953271 0.6424242
C18H7.6 118 0 2385.147 0.000000 Cluster 2111.219863 1.0000000 0.4579439 0.6282051
cdh-3 79 11 1811.618 69.489017 Cluster 21 4.683736 0.8809524 0.3457944 0.4966443
K04D7.6 65 0 1110.520 0.000000 Cluster 2110.117019 1.0000000 0.2710280 0.4264706
C29F4.3 53 0 1067.368 0.000000 Cluster 2110.059842 1.0000000 0.2523364 0.4029851
K02E2.1 39 0 1084.714 0.000000 Cluster 2110.083099 1.0000000 0.2523364 0.4029851
dhs-9 67 15 987.323 53.024505 Cluster 21 4.191836 0.8484848 0.2616822 0.4000000
In [48]:
neuron.cluster.16.vs.other.CSN.DEG = two.set.differential.gene.test(
    cds.neurons,
    is.neuron.type(cds.neurons, "Cluster 16"),
    !is.neuron.type(cds.neurons, "Cluster 16") & is.cell.type(cds.neurons, "Ciliated sensory neurons"))

neuron.cluster.16.vs.other.CSN.DEG$higher.expr = ifelse(
    neuron.cluster.16.vs.other.CSN.DEG$higher.expr == "Set 1",
    "Cluster 16", "Other CSN")
# of cells in set 1: 156
# of cells in set 2: 686
In [49]:
neuron.cluster.16.vs.other.CSN.DEG %>% filter(higher.expr == "Cluster 16") %>% head(10)
geneset.1.n.umiset.2.n.umiset.1.tpmset.2.tpmhigher.exprlog2.ratioprecisionrecallf.score
F27C1.11 223 367 5302.959 1576.21303Cluster 161.7494202 0.3537118 0.5192308 0.4207792
W05F2.7 137 309 3007.794 1054.38784Cluster 161.5109326 0.3564356 0.4615385 0.4022346
M04B2.6 117 4 2144.256 14.78412Cluster 167.0858594 0.9285714 0.2500000 0.3939394
ocr-2 100 81 2408.536 294.50719Cluster 163.0268916 0.5465116 0.3012821 0.3884298
osm-10 66 32 1209.716 124.87432Cluster 163.2646129 0.7222222 0.2500000 0.3714286
R102.2 363 926 8266.648 3753.48748Cluster 161.1386865 0.2524510 0.6602564 0.3652482
T01D3.1 87 122 1849.955 382.64946Cluster 162.2696298 0.4112903 0.3269231 0.3642857
ida-1 337 1211 7843.130 5636.27821Cluster 160.4764307 0.2234848 0.7564103 0.3450292
lap-2 132 55 2344.214 231.93208Cluster 163.3311228 0.5263158 0.2564103 0.3448276
rps-11 83 230 2013.967 1036.83972Cluster 160.9564565 0.2863636 0.4038462 0.3351064

Use case 4: re-clustering subsets of the data using t-SNE

In the Cao et al. paper, we found that several of the neuron t-SNE clusters expressed markers of cholinergic neurons such as unc-17, cho-1, and cha-1. Let's perform a sub-clustering of these cholinergic neurons to see if we can get better separation. First, we'll identify which clusters are enriched for cells that express cholinergic markers.

In [16]:
pData(cds.neurons)$any.cholinergic.marker =
    (expresses.gene(cds.neurons, "unc-17") +
     expresses.gene(cds.neurons, "cho-1") +
     expresses.gene(cds.neurons, "cha-1")) > 0
In [17]:
pData(cds.neurons) %>%
    group_by(Cluster) %>%
    summarize(
        n.total = n(), n.cholinergic = sum(any.cholinergic.marker),
        prop.cholinergic = n.cholinergic / n.total) %>%
    inner_join(
        unique(pData(cds.neurons)[, c("Cluster", "neuron.type")]),
        by = "Cluster") %>%
    arrange(-prop.cholinergic) %>%
    head(15)
Clustern.totaln.cholinergicprop.cholinergicneuron.type
29 305 155 0.5081967 Cholinergic (29)
23 45 19 0.4222222 Cholinergic (23)
3 385 131 0.3402597 Cholinergic (3)
26 261 81 0.3103448 Cholinergic (26)
35 58 16 0.2758621 Cholinergic (35)
36 128 35 0.2734375 Cholinergic (36)
15 65 16 0.2461538 Cholinergic (15)
12 68 14 0.2058824 DVA
24 188 38 0.2021277 Cholinergic (24)
8 117 23 0.1965812 ASI/ASJ
11 1998 387 0.1936937 Cholinergic (11)
6 160 21 0.1312500 SDQ/ALN/PLN
25 363 43 0.1184573 Cluster 25
41 211 24 0.1137441 Doublets
16 156 17 0.1089744 Cluster 16
In [18]:
cholinergic.clusters = c(29, 23, 3, 26, 35, 36, 15, 24, 11)
plot_cell_clusters(cds.neurons, color = "Cluster %in% cholinergic.clusters", cell_size = 0.2)
In [19]:
cds.cholinergic = cds.neurons[, pData(cds.neurons)$Cluster %in% cholinergic.clusters]
cat(ncol(cds.cholinergic), "cells in the cds subset", "\n")

cds.cholinergic = estimateSizeFactors(cds.cholinergic)
cds.cholinergic = estimateDispersions(cds.cholinergic) # would take a lot of memory for larger cell sets
cds.cholinergic = detectGenes(cds.cholinergic, 0.1)
3433 cells in the cds subset 
Warning message:
“Deprecated, use tibble::rownames_to_column() instead.”Warning message in `[<-.data.frame`(`*tmp*`, res$mu == 0, value = structure(list(:
“provided 1 variable to replace 0 variables”Warning message in `[<-.data.frame`(`*tmp*`, res$mu == 0, value = structure(list(:
“provided 1 variable to replace 0 variables”Removing 324 outliers

The next step is to run a new t-SNE dimensionality reduction on this subset of cells.

In [21]:
system.time({
cds.cholinergic = reduceDimension(
    cds.cholinergic, max_components = 2, norm_method = "log",
    num_dim = 20, reduction_method = 'tSNE', verbose = T)
})

pData(cds.cholinergic)$tsne_1 = reducedDimA(cds.cholinergic)[1,]
pData(cds.cholinergic)$tsne_2 = reducedDimA(cds.cholinergic)[2,]
Remove noise by PCA ...
Reduce dimension by tSNE ...
   user  system elapsed 
164.398   2.898 168.483 

20 principal components looks like its enough for this data. If you don't see an elbow in the scree plot, that means you've used too few principal components.

In [22]:
plot_pc_variance_explained(cds.cholinergic)
Warning message in grid.Call.graphics(L_points, x$x, x$y, x$pch, x$size):
“semi-transparency is not supported on this device: reported only once per page”
In [23]:
ggplot(pData(cds.cholinergic), aes(x = tsne_1, y = tsne_2)) +
    geom_point(size = 0.1) +
    monocle:::monocle_theme_opts()

Now we'll cluster the cells in the t-SNE space using density peak clustering. In density peak clustering, rho measures to the local density of cells and delta measures the minimum distance to a region of higher local density. See http://science.sciencemag.org/content/344/6191/1492 for more details. You want to set thresholds on rho and delta that enclose the outlier points in the scatter plot. In my experience, it's usually better to over-cluster than to under-cluster.

In [30]:
cds.cholinergic = clusterCells_Density_Peak(cds.cholinergic)
Distance cutoff calculated to 2.694535 
the length of the distance: 5891028
In [31]:
plot_rho_delta(cds.cholinergic, rho_threshold = 10, delta_threshold = 6)
Warning message in grid.Call.graphics(L_points, x$x, x$y, x$pch, x$size):
“semi-transparency is not supported on this device: reported only once per page”
In [34]:
cds.cholinergic = clusterCells_Density_Peak(cds.cholinergic,
    rho_threshold = 10, delta_threshold = 6, skip_rho_sigma = T)
plot_cell_clusters(cds.cholinergic, cell_size = 0.2)

Looks like there are many distinct clusters. Recall that we input only 9 clusters from the cds.neurons t-SNE into this re-clustering. The re-clustering appears to have revealed new potential distinct cell types.

Now would be a good time to save your progress.

In [35]:
save.image("my_analysis_Cao_et_al_data.RData")

Let's find marker genes for each cluster.

In [36]:
cholinergic.clusters = sort(as.integer(unique(pData(cds.cholinergic)$Cluster)))
cholinergic.clusters
  1. 1
  2. 2
  3. 3
  4. 4
  5. 5
  6. 6
  7. 7
  8. 8
  9. 9
  10. 10
  11. 11
  12. 12
  13. 13
  14. 14
  15. 15
  16. 16
  17. 17
  18. 18
  19. 19
  20. 20
  21. 21
  22. 22
In [37]:
DEG.results = lapply(cholinergic.clusters, function(this.cluster) {
    message("Finding markers for cluster ", this.cluster)
    cbind(
        two.set.differential.gene.test(
            cds.cholinergic,
            pData(cds.cholinergic)$Cluster == this.cluster,
            pData(cds.cholinergic)$Cluster != this.cluster),
        data.frame(cluster = this.cluster))
})
Finding markers for cluster 1
# of cells in set 1: 71
# of cells in set 2: 3362
Finding markers for cluster 2
# of cells in set 1: 45
# of cells in set 2: 3388
Finding markers for cluster 3
# of cells in set 1: 64
# of cells in set 2: 3369
Finding markers for cluster 4
# of cells in set 1: 67
# of cells in set 2: 3366
Finding markers for cluster 5
# of cells in set 1: 400
# of cells in set 2: 3033
Finding markers for cluster 6
# of cells in set 1: 453
# of cells in set 2: 2980
Finding markers for cluster 7
# of cells in set 1: 40
# of cells in set 2: 3393
Finding markers for cluster 8
# of cells in set 1: 308
# of cells in set 2: 3125
Finding markers for cluster 9
# of cells in set 1: 34
# of cells in set 2: 3399
Finding markers for cluster 10
# of cells in set 1: 50
# of cells in set 2: 3383
Finding markers for cluster 11
# of cells in set 1: 66
# of cells in set 2: 3367
Finding markers for cluster 12
# of cells in set 1: 240
# of cells in set 2: 3193
Finding markers for cluster 13
# of cells in set 1: 58
# of cells in set 2: 3375
Finding markers for cluster 14
# of cells in set 1: 63
# of cells in set 2: 3370
Finding markers for cluster 15
# of cells in set 1: 197
# of cells in set 2: 3236
Finding markers for cluster 16
# of cells in set 1: 121
# of cells in set 2: 3312
Finding markers for cluster 17
# of cells in set 1: 52
# of cells in set 2: 3381
Finding markers for cluster 18
# of cells in set 1: 53
# of cells in set 2: 3380
Finding markers for cluster 19
# of cells in set 1: 301
# of cells in set 2: 3132
Finding markers for cluster 20
# of cells in set 1: 328
# of cells in set 2: 3105
Finding markers for cluster 21
# of cells in set 1: 229
# of cells in set 2: 3204
Finding markers for cluster 22
# of cells in set 1: 193
# of cells in set 2: 3240
In [38]:
cholinergic.cluster.markers = do.call(rbind, DEG.results) %>% filter(higher.expr == "Set 1") %>%
    select(
        cluster, gene,
        cluster.n.umi = set.1.n.umi,
        other.n.umi = set.2.n.umi,
        cluster.tpm = set.1.tpm,
        other.tpm = set.2.tpm,
        log2.ratio, precision, recall, f.score) %>%
    arrange(-f.score)
In [39]:
cholinergic.cluster.markers %>% filter(precision >= 0.5) %>% head(10)
clustergenecluster.n.umiother.n.umicluster.tpmother.tpmlog2.ratioprecisionrecallf.score
7 B0432.14 194 11 10886.411 5.833289310.637661 0.9047619 0.9500000 0.9268293
13 nlp-42 299 15 22725.799 17.188682710.287074 0.8833333 0.9137931 0.8983051
21 flp-12 1379 297 17963.487 366.8397815 5.609846 0.7282609 0.8777293 0.7960396
13 T04C12.3 173 25 12067.242 22.8874212 8.980629 0.7407407 0.6896552 0.7142857
22 lgc-39 289 93 5300.848 120.7416859 5.444328 0.6346154 0.6839378 0.6583541
1 sem-2 128 63 6240.354 52.4115889 6.868331 0.6615385 0.6056338 0.6323529
2 vglu-2 30 1 2356.216 0.697776310.438609 0.9523810 0.4444444 0.6060606
15 Y48C3A.5 339 145 6989.459 148.0684921 5.551134 0.5747664 0.6243655 0.5985401
10 glb-17 81 19 4790.632 16.9955396 8.056433 0.7352941 0.5000000 0.5952381
9 nlp-5 34 24 3552.050 21.1320088 7.326374 0.5500000 0.6470588 0.5945946

In the previous examples for using plot.expr, the function used pre-computed expression tables for cds and cds.neurons to show which cell types / neuron types were expressing a given gene. The following code will set up plot.expr so it can show which density peak clusters from this re-clustering analysis express a given gene.

In [40]:
cholinergic.expr.info = get.expr.info.by.facet(cds.cholinergic, "Cluster")
Computing normalized expression values
Computing gene expression statistics
In [41]:
plot.expr(cds.cholinergic, "lgc-39", expr.info = cholinergic.expr.info)
Warning message in grid.Call.graphics(L_points, x$x, x$y, x$pch, x$size):
“semi-transparency is not supported on this device: reported only once per page”
In [44]:
plot.expr(cds.cholinergic, "vglu-2", expr.info = cholinergic.expr.info)
Warning message in grid.Call.graphics(L_points, x$x, x$y, x$pch, x$size):
“semi-transparency is not supported on this device: reported only once per page”
In [42]:
plot.expr(cds.cholinergic, "unc-17", expr.info = cholinergic.expr.info)
Warning message in grid.Call.graphics(L_points, x$x, x$y, x$pch, x$size):
“semi-transparency is not supported on this device: reported only once per page”
In [ ]: