Once we have our expression matrix, it should be examined to remove poor quality cells which were not detected in the initial processing of the raw reads. Failure to remove low quality cells at this stage may add technical noise which has the potential to obscure the biological signals of interest in the downstream analysis.

Since there is currently no standard method for performing scRNAseq, the expected values for the various QC measures that will be presented here can vary substantially from experiment to experiment. Thus, to perform QC we will be looking for cells which are outliers with respect to the rest of the dataset rather than comparing to independent quality standards. Consequently, care should be taken when comparing quality metrics across datasets collected using different protocols.


We'll continue working with the Tabula Muris brain data that we prepared earlier:

import scanpy as sc # import scanpy to handle our AnnData 
import pandas as pd # import pandas to handle dataframes
import matplotlib.pyplot as plt # import matplotlib to visualize our qc metrics

# magic incantation to help matplotlib work with our jupyter notebook
%matplotlib inline 
adata ='../data/brain_raw.h5ad')

As an aside, you'll notice that both here and in the previous notebook we read data into python objects using some variation on*, where * indicates that there is some possible suffix. Developers of python data packages that we'll be using today, and many others in the ecosystem try very hard to group objects so that they are discoverable.

One method of discovering functionality in python is to read documentation. Another is to use tab-completion to find similar functions. For reading files, it's usually possible to see what possibilities there are by tab completing, package.load, or It's unusual for these functions to start with other words.

For writing files, usually the pattern is or object.write. In this case, try using the pattern by tab completing and see what happens. What about tab completion with the object.write pattern using the adata object we just loaded?

Knowing this pattern is very helpful. Unfortunately, data files containing expression information will often come in all shapes and sizes. Thankfully, developers of these packages are aware of the common types, and may have already designed readers and writers to make them easy to work with. With this knowledge in hand, you'll be well equipped to execute the following tasks on data stored in a broader variety of file formats.

Computing quality control metrics

We'll compute quality metrics and then filter cells and genes accordingly.

The calculate_qc_metrics function returns two dataframes: one containing quality control metrics about cells, and one containing metrics about genes. This function is housed in the 'preprocessing' portion of the SCANPY library, which you can read more about here.

qc = sc.pp.calculate_qc_metrics(adata, qc_vars = ['ERCC'])# this returns a tuple of (cell_qc_dataframe, gene_qc_dataframe)
                                 # ask for the percentage of reads from spike ins
cell_qc_dataframe = qc[0]
gene_qc_dataframe = qc[1]

print('This is the cell quality control dataframe:')

print('\n\n\n\nThis is the gene quality control dataframe:')
This is the cell quality control dataframe:
                       n_genes_by_counts  log1p_n_genes_by_counts  \
A1.B003290.3_38_F.1.1               3359                 8.119696   
A1.B003728.3_56_F.1.1               1718                 7.449498   

                       total_counts  log1p_total_counts  \
A1.B003290.3_38_F.1.1      390075.0           12.874097   
A1.B003728.3_56_F.1.1      776439.0           13.562474   

                       pct_counts_in_top_50_genes  \
A1.B003290.3_38_F.1.1                   25.884766   
A1.B003728.3_56_F.1.1                   43.051933   

                       pct_counts_in_top_100_genes  \
A1.B003290.3_38_F.1.1                    32.847017   
A1.B003728.3_56_F.1.1                    52.912721   

                       pct_counts_in_top_200_genes  \
A1.B003290.3_38_F.1.1                    42.219573   
A1.B003728.3_56_F.1.1                    65.313309   

                       pct_counts_in_top_500_genes  total_counts_ERCC  \
A1.B003290.3_38_F.1.1                    59.472666            10201.0   
A1.B003728.3_56_F.1.1                    87.315423            67351.0   

                       log1p_total_counts_ERCC  pct_counts_ERCC  
A1.B003290.3_38_F.1.1                 9.230339         2.615138  
A1.B003728.3_56_F.1.1                11.117688         8.674345  

This is the gene quality control dataframe:
               n_cells_by_counts  mean_counts  log1p_mean_counts  \
0610005C13Rik                 28     0.118201           0.111721   
0610007C21Rik               2399   206.211990           5.333742   

               pct_dropout_by_counts  total_counts  log1p_total_counts  
0610005C13Rik              99.176713         402.0            5.998937  
0610007C21Rik              29.461923      701327.0           13.460731  

Quality control for cells

Library size

First, we consider the total number of reads detected per cell. Cells with few reads are likely to have been broken or failed to capture a cell, and should thus be removed.

plt.hist(cell_qc_dataframe['total_counts'], bins=1000)
plt.xlabel('Total counts')
plt.ylabel('N cells')
plt.axvline(50000, color='red')
# plt.xlim(0,1e6) # Try plotting with and without scaling the x-axis. When is this helpful?
<matplotlib.lines.Line2D at 0x13a25c390>

Looks like the authors have already removed cells with fewer than 50,000 reads.

Detected genes

In addition to ensuring sufficient sequencing depth for each sample, we also want to make sure that the reads are distributed across the transcriptome. Thus, we count the total number of unique genes detected in each sample.

plt.hist(cell_qc_dataframe['n_genes_by_counts'], bins=100)
plt.xlabel('N genes')
plt.ylabel('N cells')
plt.axvline(1000, color='red')
<matplotlib.lines.Line2D at 0x13a25c828>

From the plot we conclude that most cells have between ~1,000-5,000 detected genes, which is typical for smartseq2 data. However, this varies by experimental protocol and sequencing depth.

The most notable feature in the above plot is the little peak on the left hand side of the distribution. If detection rates were equal across the cells then the distribution should be approximately normal. Thus, we will remove those cells in the tail of the distribution (fewer than ~1000 detected genes).


Another measure of cell quality is the ratio between ERCC spike-in RNAs and endogenous RNAs. This ratio can be used to estimate the total amount of RNA in the captured cells. Cells with a high level of spike-in RNAs had low starting amounts of RNA, likely due to the cell being dead or stressed which may result in the RNA being degraded.


Plot the distribution of pct_counts_ERCC in this dataset. What do you think is a reasonable threshold?


plt.hist(cell_qc_dataframe['pct_counts_ERCC'], bins=1000) plt.xlabel('Percent counts ERCC') plt.ylabel('N cells') plt.axvline(10, color='red')

Placing a threshold is always a judgement call. Here, the majority of cells have less than 10% ERCC counts, but there's a long tail of cells that have very high spike-in counts; these are likely dead cells and should be removed.


Cell filtering

Now we can define a cell filter based on our findings:

There isn't an automatic function for removing cells with a high percentage of ERCC reads, but we can use a mask to remove them like so:

low_ERCC_mask = (cell_qc_dataframe['pct_counts_ERCC'] < 10)
adata = adata[low_ERCC_mask]


Use the SCANPY function sc.pp.filter_cells() to filter cells with fewer 750 genes detected. How many cells does this remove?
_Hint: start with the Parameters list in help(sc.pp.filter_cells)


print('Started with: \n', adata) sc.pp.filter_cells(adata, min_genes = 750) print('Finished with: \n', adata)

Trying to set attribute `.obs` of view, making a copy.

Quality control for genes

It is typically a good idea to remove genes whose expression level is considered "undetectable". We define a gene as detectable if at least two cells contain more than 5 reads from the gene. However, the threshold strongly depends on the sequencing depth. It is important to keep in mind that genes must be filtered after cell filtering since some genes may only be detected in poor quality cells.

plt.hist(gene_qc_dataframe['n_cells_by_counts'], bins=1000)
plt.xlabel('N cells expressing > 0')
plt.ylabel('log(N genes)') # for visual clarity
plt.axvline(2, color='red')
plt.hist(gene_qc_dataframe['total_counts'], bins=1000)
plt.xlabel('Total counts')
plt.ylabel('log(N genes)') # for visual clarity
plt.axvline(10, color='red')
<matplotlib.lines.Line2D at 0x12fffcba8>


Use the SCANPY function sc.pp.filter_genes() to filter genes according to the criteria above. How many genes does this remove?
_Hint: start with the Parameters list in help(sc.pp.filter_genes)


print('Started with: \n', adata) sc.pp.filter_genes(adata, min_cells = 2) sc.pp.filter_genes(adata, min_counts = 10) print('Finished with: \n', adata)


Save the data

# print(adata) ## Final dimensions of the QC'd dataset