Experimental Highly Variable Genes API

This tutorial describes use of the cellxgene_census.experimental.pp API for finding highly variable genes (HVGs) in the Census. The HVG algorithm implements the ranked normalized variance method seurat_v3 described in scanpy.pp.highly_variable_genes.

There are two API available:

  • get_highly_variable_genes() - high level function which accepts arguments similar to cellxgene_census.get_anndata(), and returns annotations for each var feature in a Pandas DataFrame.

  • highly_variable_genes() - lower level function which accepts a tiledbsoma.ExperimentAxisQuery and returns the same result.

Both functions accept common arguments to control ranking, with argument semantics matching the Scanpy API:

  • n_top_genes - number of genes to rank.

  • batch_key - if specified, normalized ranking will be done in separate batches based upon the obs column value name specified, and then merged into the final result.

  • span - the fraction of the data (cells) used when estimating the variance in the loess model fit.

In addition:

  • max_lowess_jitter - maxmimum jitter (noise) to data if LOESS fails. Disable by setting to zero.

For more information, see the docstrings for both functions (e.g. help(function))

[1]:
# Import packages
import cellxgene_census
import pandas as pd
import tiledbsoma as soma
from cellxgene_census.experimental.pp import (
    get_highly_variable_genes,
    highly_variable_genes,
)

get_highly_variable_genes

This convenience function will meet most use cases, and is a wrapper around highly_variable_genes. This demonstration requests the top 500 genes from the Mouse census where tissue_general is heart, and joins with the var dataframe.

The HVGs returned by get_highly_variable_genes are indexed by their soma_joinid. Join with the var dataframe to have a merged view of var metadata.

[2]:
with cellxgene_census.open_soma(census_version="stable") as census:
    hvgs_df = get_highly_variable_genes(
        census,
        organism="mus_musculus",
        n_top_genes=500,
        obs_value_filter="""is_primary_data == True and tissue_general == 'heart'""",
    )

    # while the Census is open, also grab the var dataframe for the mouse
    var_df = census["census_data"]["mus_musculus"].ms["RNA"].var.read().concat().to_pandas()

hvgs_df
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
[2]:
means variances highly_variable_rank variances_norm highly_variable
soma_joinid
0 0.230445 116.044863 NaN 1.749637 False
1 0.000000 0.000000 NaN 0.000000 False
2 0.000000 0.000000 NaN 0.000000 False
3 0.287551 45.276809 NaN 0.461324 False
4 67.407450 363945.055626 280.0 2.958509 True
... ... ... ... ... ...
52387 0.000000 0.000000 NaN 0.000000 False
52388 0.000000 0.000000 NaN 0.000000 False
52389 0.000000 0.000000 NaN 0.000000 False
52390 0.000000 0.000000 NaN 0.000000 False
52391 0.000000 0.000000 NaN 0.000000 False

52392 rows × 5 columns

Concat the two dataframes for convenience:

[3]:
combined_df = pd.concat([var_df.set_index("soma_joinid"), hvgs_df], axis=1)
combined_df
[3]:
feature_id feature_name feature_length means variances highly_variable_rank variances_norm highly_variable
soma_joinid
0 ENSMUSG00000051951 Xkr4 6094 0.230445 116.044863 NaN 1.749637 False
1 ENSMUSG00000089699 Gm1992 250 0.000000 0.000000 NaN 0.000000 False
2 ENSMUSG00000102343 Gm37381 1364 0.000000 0.000000 NaN 0.000000 False
3 ENSMUSG00000025900 Rp1 12311 0.287551 45.276809 NaN 0.461324 False
4 ENSMUSG00000025902 Sox17 4772 67.407450 363945.055626 280.0 2.958509 True
... ... ... ... ... ... ... ... ...
52387 ENSMUSG00000081591 Btf3-ps9 496 0.000000 0.000000 NaN 0.000000 False
52388 ENSMUSG00000118710 mmu-mir-467a-3_ENSMUSG00000118710 83 0.000000 0.000000 NaN 0.000000 False
52389 ENSMUSG00000119584 Rn18s 1849 0.000000 0.000000 NaN 0.000000 False
52390 ENSMUSG00000118538 Gm18218 970 0.000000 0.000000 NaN 0.000000 False
52391 ENSMUSG00000084217 Setd9-ps 670 0.000000 0.000000 NaN 0.000000 False

52392 rows × 8 columns

Select only the highly_variable genes by using the highly_variable column value:

[4]:
combined_df[combined_df.highly_variable]
[4]:
feature_id feature_name feature_length means variances highly_variable_rank variances_norm highly_variable
soma_joinid
4 ENSMUSG00000025902 Sox17 4772 67.407450 363945.055626 280.0 2.958509 True
188 ENSMUSG00000026117 Zap70 2992 5.409091 14793.026717 350.0 2.775560 True
233 ENSMUSG00000026073 Il1r2 1908 4.764085 41918.471500 206.0 3.402176 True
500 ENSMUSG00000026185 Igfbp5 6006 43.234876 314355.591239 156.0 3.825651 True
512 ENSMUSG00000026180 Cxcr2 3048 2.379390 10491.033344 173.0 3.640129 True
... ... ... ... ... ... ... ... ...
30296 ENSMUSG00000024803 Ankrd1 2886 38.548572 274005.455137 107.0 4.741864 True
30313 ENSMUSG00000024987 Cyp26a1 1983 2.186686 12973.622003 454.0 2.580162 True
30379 ENSMUSG00000018822 Sfrp5 1900 2.927853 10943.645525 410.0 2.637004 True
32042 ENSMUSG00000031838 Ifi30 980 91.676950 995276.564962 128.0 4.205886 True
33314 ENSMUSG00000092572 Serpinb10 3490 0.264085 227.239812 487.0 2.535469 True

500 rows × 8 columns

highly_variable_genes

This API provides the same function as get_highly_variable_genes, but accepts any tiledbsoma.ExperimentAxisQuery. It is intended for more advanced users who wish to use create and manage their own queries.

[5]:
with cellxgene_census.open_soma(census_version="stable") as census:
    experiment = census["census_data"]["mus_musculus"]
    with experiment.axis_query(
        measurement_name="RNA",
        obs_query=soma.AxisQuery(value_filter="""is_primary_data == True and tissue_general == 'heart'"""),
    ) as query:
        hvgs_df = highly_variable_genes(query, n_top_genes=500)

hvgs_df[hvgs_df.highly_variable]
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
[5]:
means variances highly_variable_rank variances_norm highly_variable
soma_joinid
4 67.407450 363945.055626 280.0 2.958509 True
188 5.409091 14793.026717 350.0 2.775560 True
233 4.764085 41918.471500 206.0 3.402176 True
500 43.234876 314355.591239 156.0 3.825651 True
512 2.379390 10491.033344 173.0 3.640129 True
... ... ... ... ... ...
30296 38.548572 274005.455137 107.0 4.741864 True
30313 2.186686 12973.622003 454.0 2.580162 True
30379 2.927853 10943.645525 410.0 2.637004 True
32042 91.676950 995276.564962 128.0 4.205886 True
33314 0.264085 227.239812 487.0 2.535469 True

500 rows × 5 columns