Memory-efficient implementations of commonly used single-cell methods

Published: September 18, 2023

The Census team is thrilled to officially announce memory-efficient implementations of some of the most widely used single-cell algorithms.

With just a few lines of code, using the Census Python API, users can now perform the following computational tasks across tens of millions of cells using a conventional laptop with 8GB of memory:

Calculating average and variance gene expression for cells or genes. See example below or the full tutorial here.
Obtaining batch-corrected highly variable genes. See example below or the full tutorial here.

These implementations are interwoven with the way users query slices of Census data, which means that these tasks can be seamlessly applied to any slice of the 33M+ cells available in Census.

Continue reading for more implementation details and usage examples.

Efficient calculation of average and variance gene expression across millions of cells

With cellxgene_census.experimental.pp.mean_variance users can now get gene expression average and variance for all genes or cells in a given Census query.

How it works

Calculations are done in an accumulative incremental fashion, meaning that only a small fraction of the total data is processed at any given time.

The Census data is downloaded in increments and average and variance accumulators are updated at each incremental step. The implementation also takes advantage of CPU-based multiprocessing to speed up the process.

Currently, the mean and variance are calculated using the full population of cells/genes, including those with a zero valued measurement. In the future, we will enable calculation of mean including only the population of non-zero cells/genes.

Example: KRAS and AQP4 average and variance expression in lung epithelial cells

The following calculates the average and variance values for the genes KRAS and AQP4 in all epithelial cells of the human lung.

Users can easily switch the calculation, and obtain average and variance for each cell across the genes in the query. This is controlled by the axis argument of mean_variance.

import cellxgene_census
import tiledbsoma as soma
from cellxgene_census.experimental.pp import mean_variance

# Open the Census
census = cellxgene_census.open_soma()
human_data = census["census_data"]["homo_sapiens"]

# Set filters
cell_filter = (
  "is_primary_data == True "
  "and tissue_general == 'lung' "
  "and cell_type == 'epithelial cell'"
 )
gene_filter = "feature_name in ['KRAS', 'AQP4']"

# Perform query
query = human_data.axis_query(
  measurement_name="RNA",
  obs_query=soma.AxisQuery(value_filter= cell_filter),
  var_query=soma.AxisQuery(value_filter= gene_filter)
)

# Calculate mean and average per gene
mean_variance_df = mean_variance(query, axis=0, calculate_mean=True, calculate_variance=True)

# Get gene metadata of query
gene_df = query.var().concat().to_pandas()

query.close()
census.close()

Which results in:

mean_variance_df
# soma_joinid      mean     variance
# 8624         3.071926  5741.242485
# 16437        8.233282   452.119153

gene_df
#    soma_joinid       feature_id feature_name  feature_length
# 0         8624  ENSG00000171885         AQP4            5943
# 1        16437  ENSG00000133703         KRAS            6845

Efficient calculation of highly variable genes across millions of cells

With cellxgene_census.experimental.pp.get_highly_variable_genes users can get the most highly variable genes of a Census query while accounting for batch effects.

This is usually the first pre-processing step necessary for other downstream tasks, for example data integration.

How it works

The Census algorithm is based on the scanpy method scanpy.pp.highly_variable_genes, and in particular the Seurat V3 method, which is designed for raw counts and can account for batch effects.

The Census implementation utilizes the same incremental paradigm used in cellxgene_census.experimental.pp.mean_variance (see above), calculating incremental-based mean and variance accumulators with some tweaks to comply to the Seurat V3 method.

Example: Finding highly variable genes for all cells of the human esophagus

The following example identifies the top 1000 highly variable genes for all human esophagus cells. As a general rule of thumb it is good to use dataset_id as the batch variable.

import cellxgene_census
from cellxgene_census.experimental.pp import get_highly_variable_genes

census = cellxgene_census.open_soma()

hvg = get_highly_variable_genes(
  census,
  organism="Homo sapiens",
  obs_value_filter="is_primary_data == True and tissue_general == 'esophagus'",
  n_top_genes = 1000,
  batch_key = "dataset_id"
)

census.close()

Which results in:

hvg
# soma_joinid    means  variances  ...  variances_norm  highly_variable
# 0            0.003692   0.004627  ...        0.748221            False
# 1            0.003084   0.003203  ...        0.898657            False
# 2            0.014962   0.037395  ...        0.513473            False
# 3            0.218865   1.547648  ...        4.786928             True
# 4            0.002142   0.002242  ...        0.894955            False
# ...               ...        ...  ...             ...              ...
# 60659        0.000000   0.000000  ...        0.000000            False
# 60660        0.000000   0.000000  ...        0.000000            False
# 60661        0.000000   0.000000  ...        0.000000            False
# 60662        0.000000   0.000000  ...        0.000000            False
# 60663        0.000000   0.000000  ...        0.000000            False