Integrating multi-dataset slices of data with Seurat

The Census contains data from multiple studies providing an opportunity to perform inter-dataset analysis. To this end integration of data has to be performed first to account for batch effects.

This notebook provides a demonstration for integrating two Census datasets using Seurat. The goal is not to provide an exhaustive guide on proper integration, but to showcase what information in the Census can inform data integration.

Contents

Finding and fetching data from mouse liver (10X Genomics and Smart-Seq2).
Gene-length normalization of Smart-Seq2 data.
Integration with Seurat.
1. Inspecting data prior to integration.
2. Integration across datasets using dataset_id.
3. Integration across datasets using dataset_id and controlling for batch using donor_id.
4. Integration across datasets using dataset_id and controlling for batch using donor_id + assay_ontology_term_id + suspension_type. .

Finding and fetching data from mouse liver (10X Genomics and Smart-Seq2)

Let’s load all packages needed for this notebook.

library("cellxgene.census")
library("Seurat")

Now we can open the Census.

census <- open_soma()

In this notebook we will use Tabula Muris Senis data from the liver as it contains cells from both 10X Genomics and Smart-Seq2 technologies.

Let’s query the datasets table of the Census by filtering on collection_name for “Tabula Muris Senis” and dataset_title for “liver”.

census_datasets <- census$get("census_info")$get("datasets")
census_datasets <- census_datasets$read(value_filter = "collection_name == 'Tabula Muris Senis'")
census_datasets <- as.data.frame(census_datasets$concat())

# Print rows with liver data
census_datasets[grep("Liver", census_datasets$dataset_title), ]
#>    soma_joinid
#> 7          125
#> 25         208
#>                                                                                                                                                                                                                                                                                                         citation
#> 7  Publication: https://doi.org/10.1038/s41586-020-2496-1 Dataset Version: https://datasets.cellxgene.cziscience.com/d32c289e-e881-4140-8db4-078ec04c942f.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/0b9d8a04-bb9d-44da-aa27-705bb65b54eb
#> 25 Publication: https://doi.org/10.1038/s41586-020-2496-1 Dataset Version: https://datasets.cellxgene.cziscience.com/e7218cb6-9df7-461e-8425-d8c3ddca9392.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/0b9d8a04-bb9d-44da-aa27-705bb65b54eb
#>                           collection_id    collection_name
#> 7  0b9d8a04-bb9d-44da-aa27-705bb65b54eb Tabula Muris Senis
#> 25 0b9d8a04-bb9d-44da-aa27-705bb65b54eb Tabula Muris Senis
#>               collection_doi                           dataset_id
#> 7  10.1038/s41586-020-2496-1 4546e757-34d0-4d17-be06-538318925fcd
#> 25 10.1038/s41586-020-2496-1 6202a243-b713-4e12-9ced-c387f8483dea
#>                      dataset_version_id
#> 7  d32c289e-e881-4140-8db4-078ec04c942f
#> 25 e7218cb6-9df7-461e-8425-d8c3ddca9392
#>                                                                                        dataset_title
#> 7  Liver - A single-cell transcriptomic atlas characterizes ageing tissues in the mouse - Smart-seq2
#> 25        Liver - A single-cell transcriptomic atlas characterizes ageing tissues in the mouse - 10x
#>                            dataset_h5ad_path dataset_total_cell_count
#> 7  4546e757-34d0-4d17-be06-538318925fcd.h5ad                     2859
#> 25 6202a243-b713-4e12-9ced-c387f8483dea.h5ad                     7294

Now we can use the values from dataset_id to query and load a Seurat object with all the cells from those datasets.

tabula_muris_liver_ids <- c("4546e757-34d0-4d17-be06-538318925fcd", "6202a243-b713-4e12-9ced-c387f8483dea")

seurat_obj <- get_seurat(
  census,
  organism = "Mus musculus",
  obs_value_filter = "dataset_id %in% tabula_muris_liver_ids"
)

We can check the cell counts for both 10X Genomics and Smart-Seq2 data by looking at assay in the metadata.

table(seurat_obj$assay)
#> 
#> 10x 3' transcription profiling                      10x 3' v1 
#>                              0                              0 
#>                      10x 3' v2                      10x 3' v3 
#>                           7294                              0 
#>                      10x 5' v1                      DroNc-seq 
#>                              0                              0 
#>                       Drop-seq                      Smart-seq 
#>                              0                              0 
#>                   Smart-seq v4                     Smart-seq2 
#>                              0                           2859 
#>                    sci-RNA-seq 
#>                              0

Gene-length normalization of Smart-Seq2 data.

Smart-seq2 read counts have to be normalized by gene length.

Lets first get the gene lengths from var.feature_length.

smart_seq_gene_lengths <- seurat_obj$RNA[[]]$feature_length

Now we can use those to normalize Smart-seq data. So let’s split the object by assay.

seurat_obj.list <- SplitObject(seurat_obj, split.by = "assay")

An normalize the Smart-seq slice using the gene lengths and merge them back into a single object.

seurat_obj.list[["Smart-seq2"]][["RNA"]]@counts <- seurat_obj.list[["Smart-seq2"]][["RNA"]]@counts / smart_seq_gene_lengths
seurat_obj <- merge(seurat_obj.list[[1]], seurat_obj.list[[2]])

Integration with Seurat

Here we will use the native integration capabilities of Seurat.

For comprehensive usage and best practices of Seurat intergation please refer to the doc site of Seurat.

Inspecting data prior to integration

Let’s take a look at the strength of batch effects in our data. For that we will perform embedding visualization via UMAP.

Let’s do basic data normalization and variable gene selection

seurat_obj <- SCTransform(seurat_obj)
seurat_obj <- FindVariableFeatures(seurat_obj, selection.method = "vst", nfeatures = 2000)

And now perform PCA and UMAP

seurat_obj <- RunPCA(seurat_obj, features = VariableFeatures(object = seurat_obj))
seurat_obj <- RunUMAP(seurat_obj, dims = 1:30)

# By assay
p1 <- DimPlot(seurat_obj, reduction = "umap", group.by = "assay")
p1

# By cell type
p2 <- DimPlot(seurat_obj, reduction = "umap", group.by = "cell_type")
p2

You can see that batch effects are strong as cells cluster primarily by assay and then by cell_type. Properly integrated embedding would in principle cluster primarily by cell_type, assay should at best randomly distributed.

Data integration with Seurat

Whenever you query and fetch Census data from multiple datasets then integration needs to be performed as evidenced by the batch effects we observed.

The paramaters for Seurat used in this notebook were selected to the model run quickly. For best practices on integration of single-cell data using Seurat please refer to their documentation page.

seurat_d reading the article An integrated cell atlas of the human lung in health and disease by Sikkema et al. whom perfomed integration of 43 datasets from Lung.

Here we focus on the metadata from the Census that can be as batch information for integration.

Integration across datasets using `dataset_id`

All cells in the Census are annotated with the dataset they come from in "dataset_id". This is a great place to start for integration.

So let’s run a Seurat integration pipeline. First we define our model with batch set as dataset_id.

Firs normalize and select variable genes seperated by our batch key dataset_id

# split the dataset into a list of two seurat objects for each dataset
seurat_obj.list <- SplitObject(seurat_obj, split.by = "dataset_id")

# normalize each dataset independently
seurat_obj.list <- lapply(X = seurat_obj.list, FUN = function(x) {
  x <- SCTransform(x)
})

# select features for integration
features <- SelectIntegrationFeatures(object.list = seurat_obj.list)

Now we perform integration.

seurat_obj.list <- PrepSCTIntegration(seurat_obj.list, anchor.features = features)
seurat_obj.anchors <- FindIntegrationAnchors(object.list = seurat_obj.list, anchor.features = features, normalization.method = "SCT")
seurat_obj.combined <- IntegrateData(anchorset = seurat_obj.anchors, normalization.method = "SCT")

Let’s inspect the results by doing normalization and then UMAP visulization.

DefaultAssay(seurat_obj.combined) <- "integrated"

# Run the standard workflow for visualization and clustering
seurat_obj.combined <- ScaleData(seurat_obj.combined, verbose = FALSE)
seurat_obj.combined <- RunPCA(seurat_obj.combined, npcs = 30, verbose = FALSE)
seurat_obj.combined <- RunUMAP(seurat_obj.combined, reduction = "pca", dims = 1:30)

And plot the UMAP.

# By assay
p1 <- DimPlot(seurat_obj.combined, reduction = "umap", group.by = "assay")
p1

# By cell type
p2 <- DimPlot(seurat_obj.combined, reduction = "umap", group.by = "cell_type")
p2

Great! You can see that the clustering is no longer mainly driven by assay, albeit still contributing to it.

Integration across datasets using `dataset_id` and controlling for batch using `donor_id`

Similar to dataset_id, all cells in Census are annotated with donor_id. The definition of donor_id depends on the dataset and it is left to the discretion of data curators. However it is still rich in information and can be used as a batch variable during integration.

Because donor_id is not guaranteed to be unique across all cells of the Census, we strongly recommend concatenating dataset_id and donor_id and use that as the batch separator for Seurat

# split the dataset into a list of two seurat objects for each dataset
seurat_obj.list <- SplitObject(seurat_obj, split.by = "dataset_id")

# normalize each dataset independently controlling for batch
seurat_obj.list <- lapply(X = seurat_obj.list, FUN = function(x) {
  x <- SCTransform(x, vars.to.regress = "donor_id")
})

# select features for integration
features <- SelectIntegrationFeatures(object.list = seurat_obj.list)

Now we perform integration.

seurat_obj.list <- PrepSCTIntegration(seurat_obj.list, anchor.features = features)
seurat_obj.anchors <- FindIntegrationAnchors(object.list = seurat_obj.list, anchor.features = features, normalization.method = "SCT")
#> Finding all pairwise anchors
#> Running CCA
#> Merging objects
#> Finding neighborhoods
#> Finding anchors
#>  Found 7209 anchors
#> Filtering anchors
#>  Retained 5036 anchors
seurat_obj.combined <- IntegrateData(anchorset = seurat_obj.anchors, normalization.method = "SCT")
#> [1] 1
#> Warning: Different cells and/or features from existing assay SCT
#> Warning: Layer counts isn't present in the assay object; returning NULL
#> [1] 2
#> Warning: Different cells and/or features from existing assay SCT
#> Layer counts isn't present in the assay object; returning NULL
#> Merging dataset 1 into 2
#> Extracting anchors for merged samples
#> Finding integration vectors
#> Finding integration vector weights
#> Integrating data
#> Warning: Layer counts isn't present in the assay object; returning NULL
#> Warning: Assay integrated changing from Assay to SCTAssay
#> Warning: Layer counts isn't present in the assay object; returning NULL
#> Warning: Different cells and/or features from existing assay SCT

And inspect the new results by UMAP.

DefaultAssay(seurat_obj.combined) <- "integrated"

# Run the standard workflow for visualization and clustering
seurat_obj.combined <- RunPCA(seurat_obj.combined, npcs = 30, verbose = FALSE)
seurat_obj.combined <- RunUMAP(seurat_obj.combined, reduction = "pca", dims = 1:30)
#> 02:53:52 UMAP embedding parameters a = 0.9922 b = 1.112
#> 02:53:52 Read 10153 rows and found 30 numeric columns
#> 02:53:52 Using Annoy for neighbor search, n_neighbors = 30
#> 02:53:52 Building Annoy index with metric = cosine, n_trees = 50
#> 0%   10   20   30   40   50   60   70   80   90   100%
#> [----|----|----|----|----|----|----|----|----|----|
#> **************************************************|
#> 02:53:54 Writing NN index file to temp file /tmp/RtmpDFNk2P/file1d2213890893e
#> 02:53:54 Searching Annoy index using 1 thread, search_k = 3000
#> 02:53:58 Annoy recall = 100%
#> 02:53:58 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
#> 02:53:59 Initializing from normalized Laplacian + noise (using RSpectra)
#> 02:54:00 Commencing optimization for 200 epochs, with 410512 positive edges
#> 02:54:04 Optimization finished

Plot the UMAP.

# By assay
p1 <- DimPlot(seurat_obj.combined, reduction = "umap", group.by = "assay")
p1

# By cell type
p2 <- DimPlot(seurat_obj.combined, reduction = "umap", group.by = "cell_type")
p2

As you can see using dataset_id and donor_id as batch the cells now mostly cluster by cell type.

Integration across datasets using `dataset_id` and controlling for batch using `donor_id` + `assay_ontology_term_id` + `suspension_type`.

In some cases one dataset may contain multiple assay types and/or multiple suspension types (cell vs nucleus), and for those it is important to consider these metadata as batches.

Therefore, the most comprehensive definition of batch in the Census can be accomplished by combining the cell metadata of dataset_id, donor_id, assay_ontology_term_id and suspension_type, the latter will encode the EFO ids for assay types.

In our example, the two datasets that we used only contain cells from one assay each, and one suspension type for all of them. Thus it would not make a difference to include these metadata as part of batch.

This how the implementation would look line

# EXAMPLE, DON'T RUN.

# split the dataset into a list of seurat objects for each dataset
seurat_obj.list <- SplitObject(seurat_obj, split.by = "dataset_id")

# normalize each dataset independently controlling for batch
seurat_obj.list <- lapply(X = seurat_obj.list, FUN = function(x) {
  x <- SCTransform(x, vars.to.regress = c("donor_id", "assay_ontology_term_id", "suspension_type"))
})

# select features for integration
features <- SelectIntegrationFeatures(object.list = seurat_obj.list)

# integrate
seurat_obj.list <- PrepSCTIntegration(seurat_obj.list, anchor.features = features)
seurat_obj.anchors <- FindIntegrationAnchors(object.list = seurat_obj.list, anchor.features = features, normalization.method = "SCT")
seurat_obj.combined <- IntegrateData(anchorset = seurat_obj.anchors, normalization.method = "SCT")