Tabula Muris
Introduction
To give you hands-on experience analyzing a single-cell RNASeq dataset from start to finish, we will be using data from Tabula Muris as an example. The Tabula Muris is a collaborative effort to profile every mouse tissue at a single-cell level. The full dataset includes both high throughput but low-coverage 10X data and lower throughput but high-coverage Smartseq2 data.
For this workshop, we will be using the Smartseq2 data from the mouse brain. This data consists of:
- an expression matrix where each column corresponds to a gene (or transcript) and each row corresponds to a single cell
- a table of metadata describing each cell
Downloading the data
The data is bundled in this course repository under scRNA-python-workshop/content/data.zip
. You can also find the data files directly here. Unzip the folder so that the files are located in the directory scRNA-python-workshop/content/data/
.
Reading the data
We can now read in the count matrix from the comma-separated file. Then inspect the resulting dataframe:
import pandas as pd ## load in the pandas library for working with dataframes
## tell pandas to make a new DataFrame with the contents of `brain_counts.csv`. This might take a minute.
count_dataframe = pd.read_csv('../data/brain_counts.csv', # where to find the data
index_col=0) # use the first column to label the rows (the 'index')
## print the first 2 rows of our dataframe
print(count_dataframe.head(2))
Exercise
What do the column names represent? What do the row names represent? How many cells and genes are in this dataset?
Hint: need some help? Try help(pd.DataFrame.shape)
.
The column names represent genes.
The row names represent unique cell identifiers that were assigned by the authors of the dataset.
We can find out how many genes and cells are in the dataset by asking for its Solution
shape
:
print(count_dataframe.shape)
gives:
(3401, 23433)
which represents (N rows, N columns)
.
help(pd.DataFrame.shape)
Reading the metadata
The authors have also provided metadata describing each cell. This metadata is stored in a separate file, brain_metadata.csv
. We can load it into a dataframe and inspect it, just like we did for the count data.
Exercise
Load the metadata from the csv file into a pandas dataframe called metadata_dataframe
. Does it have the same dimensions and index as the counts_dataframe
?
Let's take a peak at the resulting dataframe to make sure it looks correct.
Let's take a closer look and inspect the first few rows:
First, we need to load in the metadata.Solution
metadata_dataframe = pd.read_csv('../data/brain_metadata.csv', # where to find the data
index_col=0) # use the first column as the index
print(metadata_dataframe.shape)
>>> (3401, 5)
We have 5 columns of information about 3,401 cells. Sounds reasonable.
print(metadata_dataframe.head())
>>> cell_ontology_class subtissue mouse.sex mouse.id \
cell
A1.B003290.3_38_F.1.1 astrocyte Striatum F 3_38_F
A1.B003728.3_56_F.1.1 astrocyte Striatum F 3_56_F
A1.MAA000560.3_10_M.1.1 oligodendrocyte Cortex M 3_10_M
A1.MAA000564.3_10_M.1.1 endothelial cell Striatum M 3_10_M
A1.MAA000923.3_9_M.1.1 astrocyte Hippocampus M 3_9_M
plate.barcode
cell
A1.B003290.3_38_F.1.1 B003290
A1.B003728.3_56_F.1.1 B003728
A1.MAA000560.3_10_M.1.1 MAA000560
A1.MAA000564.3_10_M.1.1 MAA000564
A1.MAA000923.3_9_M.1.1 MAA000923
metadata_dataframe = pd.read_csv()
Exercise
To get a sense for what is in this dataset, let's look at the summary of each metadata column.
- How many cells from each
subtissue
label are in this dataset?
Hint: try runninghelp(pd.value_counts)
to get started - Using a
for
loop, repeat this counting procedure to summarize each of the metadata columns.
Hint: you can access all the column names in the dataframe withcolumns = metadata_dataframe.columns.values
1. We can count the number of times each value appears in a column like this:Solution
print(pd.value_counts(metadata_dataframe['subtissue']))
>>>>
Cortex 1149
Hippocampus 976
Striatum 723
Cerebellum 553
Name: subtissue, dtype: int64
2. To repeat this for each column in the dataframe, we can use a for
loop:
for column in metadata_dataframe.columns.values:
print(pd.value_counts(metadata_dataframe[column]))
We now have two dataframes, containing the counts and metadata from the Tabula Muris brain dataset. To keep these organized, we'll use a data structure called AnnData. AnnData stands for "annotated data," and is the standard format used by the analysis library, SCANPY.
AnnData uses some generalized vocabulary to describe cells and genes: they refer to cells as observations and genes as variables. This data structure has four areas where we can store information:
AnnData.X
stores the count matrix
AnnData.obs
stores metadata about the observations (cells)
AnnData.var
stores metadata about the variables (genes)
AnnData.uns
stores any additional, unstructured information we decide to attach later
Here, we have a count matrix and metadata that describes each cell, so we will use the .X
and .obs
portions of the AnnData structure.
import scanpy as sc # import the scanpy library that tells Python how an AnnData data structure works
# help(sc.AnnData)
adata = sc.AnnData(X = count_dataframe, obs = metadata_dataframe)
print(adata)
is_spike_in = {}
number_of_spike_ins = 0
for gene_name in adata.var_names:
if 'ERCC' in gene_name:
is_spike_in[gene_name] = True # record that we found a spike-in
number_of_spike_ins += 1 # bump the counter
else:
is_spike_in[gene_name] = False # record that this was not a spike-in
adata.var['ERCC'] = pd.Series(is_spike_in) # because the index of adata.var and the keys of is_spike_in match, anndata will take care of matching them up
print('found this many spike ins: ', number_of_spike_ins)
Now that we've finished building our AnnData object, we can save it in a file for later use like so:
adata.write('../data/brain_raw.h5ad') ## the h5ad extension is AnnData-specific