Tabula Muris

Introduction

To give you hands-on experience analyzing a single-cell RNASeq dataset from start to finish, we will be using data from Tabula Muris as an example. The Tabula Muris is a collaborative effort to profile every mouse tissue at a single-cell level. The full dataset includes both high throughput but low-coverage 10X data and lower throughput but high-coverage Smartseq2 data.

For this workshop, we will be using the Smartseq2 data from the mouse brain. This data consists of:

  1. an expression matrix where each column corresponds to a gene (or transcript) and each row corresponds to a single cell
  2. a table of metadata describing each cell

Downloading the data

The data is bundled in this course repository under scRNA-python-workshop/content/data.zip. You can also find the data files directly here. Unzip the folder so that the files are located in the directory scRNA-python-workshop/content/data/.

Reading the data

We can now read in the count matrix from the comma-separated file. Then inspect the resulting dataframe:

import pandas as pd ## load in the pandas library for working with dataframes
## tell pandas to make a new DataFrame with the contents of `brain_counts.csv`. This might take a minute.
count_dataframe = pd.read_csv('../data/brain_counts.csv', # where to find the data
                              index_col=0) # use the first column to label the rows (the 'index')

## print the first 2 rows of our dataframe
print(count_dataframe.head(2)) 
                       0610005C13Rik  0610007C21Rik  0610007L01Rik  \
A1.B003290.3_38_F.1.1              0            125             16   
A1.B003728.3_56_F.1.1              0              0              0   

                       0610007N19Rik  0610007P08Rik  0610007P14Rik  \
A1.B003290.3_38_F.1.1              0              0              0   
A1.B003728.3_56_F.1.1              0              0            324   

                       0610007P22Rik  0610008F07Rik  0610009B14Rik  \
A1.B003290.3_38_F.1.1              0              0              0   
A1.B003728.3_56_F.1.1              0              0              0   

                       0610009B22Rik  ...  Zxdb  Zxdc  Zyg11a  Zyg11b  Zyx  \
A1.B003290.3_38_F.1.1              0  ...     0     0       0       0    0   
A1.B003728.3_56_F.1.1              0  ...     0     0       0       0    0   

                       Zzef1  Zzz3  a  l7Rn6  zsGreen_transgene  
A1.B003290.3_38_F.1.1      0     0  0     54                  0  
A1.B003728.3_56_F.1.1      0     0  0      0                  0  

[2 rows x 23433 columns]

Exercise

What do the column names represent? What do the row names represent? How many cells and genes are in this dataset?

Hint: need some help? Try help(pd.DataFrame.shape).

Solution

The column names represent genes. The row names represent unique cell identifiers that were assigned by the authors of the dataset.

We can find out how many genes and cells are in the dataset by asking for its shape:
print(count_dataframe.shape)
gives:
(3401, 23433)

which represents (N rows, N columns).

help(pd.DataFrame.shape)

Reading the metadata

The authors have also provided metadata describing each cell. This metadata is stored in a separate file, brain_metadata.csv. We can load it into a dataframe and inspect it, just like we did for the count data.

Exercise

Load the metadata from the csv file into a pandas dataframe called metadata_dataframe. Does it have the same dimensions and index as the counts_dataframe?

Solution

First, we need to load in the metadata.
metadata_dataframe = pd.read_csv('../data/brain_metadata.csv', # where to find the data index_col=0) # use the first column as the index

Let's take a peak at the resulting dataframe to make sure it looks correct.
print(metadata_dataframe.shape) >>> (3401, 5)
We have 5 columns of information about 3,401 cells. Sounds reasonable.

Let's take a closer look and inspect the first few rows:

print(metadata_dataframe.head())
>>> cell_ontology_class subtissue mouse.sex mouse.id \ cell A1.B003290.3_38_F.1.1 astrocyte Striatum F 3_38_F A1.B003728.3_56_F.1.1 astrocyte Striatum F 3_56_F A1.MAA000560.3_10_M.1.1 oligodendrocyte Cortex M 3_10_M A1.MAA000564.3_10_M.1.1 endothelial cell Striatum M 3_10_M A1.MAA000923.3_9_M.1.1 astrocyte Hippocampus M 3_9_M plate.barcode cell A1.B003290.3_38_F.1.1 B003290 A1.B003728.3_56_F.1.1 B003728 A1.MAA000560.3_10_M.1.1 MAA000560 A1.MAA000564.3_10_M.1.1 MAA000564 A1.MAA000923.3_9_M.1.1 MAA000923

metadata_dataframe = pd.read_csv()

Exercise

To get a sense for what is in this dataset, let's look at the summary of each metadata column.

  1. How many cells from each subtissue label are in this dataset?
    Hint: try running help(pd.value_counts) to get started

  2. Using a for loop, repeat this counting procedure to summarize each of the metadata columns.
    Hint: you can access all the column names in the dataframe with columns = metadata_dataframe.columns.values

Solution

1. We can count the number of times each value appears in a column like this:
print(pd.value_counts(metadata_dataframe['subtissue'])) >>>> Cortex 1149 Hippocampus 976 Striatum 723 Cerebellum 553 Name: subtissue, dtype: int64

2. To repeat this for each column in the dataframe, we can use a for loop:
for column in metadata_dataframe.columns.values: print(pd.value_counts(metadata_dataframe[column]))

Building an AnnData object

We now have two dataframes, containing the counts and metadata from the Tabula Muris brain dataset. To keep these organized, we'll use a data structure called AnnData. AnnData stands for "annotated data," and is the standard format used by the analysis library, SCANPY.

AnnData uses some generalized vocabulary to describe cells and genes: they refer to cells as observations and genes as variables. This data structure has four areas where we can store information:

Drawing

AnnData.X stores the count matrix
AnnData.obs stores metadata about the observations (cells)
AnnData.var stores metadata about the variables (genes)
AnnData.uns stores any additional, unstructured information we decide to attach later

Here, we have a count matrix and metadata that describes each cell, so we will use the .X and .obs portions of the AnnData structure.

import scanpy as sc # import the scanpy library that tells Python how an AnnData data structure works
# help(sc.AnnData)
adata = sc.AnnData(X = count_dataframe, obs = metadata_dataframe)
print(adata)
AnnData object with n_obs × n_vars = 3401 × 23433 
    obs: 'cell_ontology_class', 'subtissue', 'mouse.sex', 'mouse.id', 'plate.barcode'

Labeling spike-ins

Because this is smartseq2 data, we may have spike-ins. These gene names start with ERCC. We can label them in adata.var as a gene annotation.

is_spike_in = {}
number_of_spike_ins = 0

for gene_name in adata.var_names:
    if 'ERCC' in gene_name:
        is_spike_in[gene_name] = True # record that we found a spike-in
        number_of_spike_ins += 1 # bump the counter
    else:
        is_spike_in[gene_name] = False # record that this was not a spike-in
        
adata.var['ERCC'] = pd.Series(is_spike_in) # because the index of adata.var and the keys of is_spike_in match, anndata will take care of matching them up
print('found this many spike ins: ', number_of_spike_ins)
found this many spike ins:  92

Now that we've finished building our AnnData object, we can save it in a file for later use like so:

adata.write('../data/brain_raw.h5ad') ## the h5ad extension is AnnData-specific
... storing 'cell_ontology_class' as categorical
... storing 'subtissue' as categorical
... storing 'mouse.sex' as categorical
... storing 'mouse.id' as categorical
... storing 'plate.barcode' as categorical