Experimental Highly Variable Genes API

This tutorial describes use of the cellxgene_census.experimental.pp API for finding highly variable genes (HVGs) in the Census. The HVG algorithm implements the ranked normalized variance method seurat_v3 described in scanpy.pp.highly_variable_genes.

There are two API available:

get_highly_variable_genes() - high level function which accepts arguments similar to cellxgene_census.get_anndata(), and returns annotations for each var feature in a Pandas DataFrame.
highly_variable_genes() - lower level function which accepts a tiledbsoma.ExperimentAxisQuery and returns the same result.

Both functions accept common arguments to control ranking, with argument semantics matching the Scanpy API:

n_top_genes - number of genes to rank.
batch_key - if specified, normalized ranking will be done in separate batches based upon the obs column value name specified, and then merged into the final result.
span - the fraction of the data (cells) used when estimating the variance in the loess model fit.

In addition:

max_lowess_jitter - maxmimum jitter (noise) to data if LOESS fails. Disable by setting to zero.

For more information, see the docstrings for both functions (e.g. help(function))

[1]:

# Import packages
import cellxgene_census
import pandas as pd
import tiledbsoma as soma
from cellxgene_census.experimental.pp import (
    get_highly_variable_genes,
    highly_variable_genes,
)

get_highly_variable_genes

This convenience function will meet most use cases, and is a wrapper around highly_variable_genes. This demonstration requests the top 500 genes from the Mouse census where tissue_general is heart, and joins with the var dataframe.

The HVGs returned by get_highly_variable_genes are indexed by their soma_joinid. Join with the var dataframe to have a merged view of var metadata.

[2]:

with cellxgene_census.open_soma(census_version="stable") as census:
    hvgs_df = get_highly_variable_genes(
        census,
        organism="mus_musculus",
        n_top_genes=500,
        obs_value_filter="""is_primary_data == True and tissue_general == 'heart'""",
    )

    # while the Census is open, also grab the var dataframe for the mouse
    var_df = cellxgene_census.get_var(census, "mus_musculus")

hvgs_df

The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.

[2]:

	means	variances	highly_variable_rank	variances_norm	highly_variable
soma_joinid
0	0.230445	116.044863	NaN	1.749637	False
1	0.000000	0.000000	NaN	0.000000	False
2	0.000000	0.000000	NaN	0.000000	False
3	0.287551	45.276809	NaN	0.461324	False
4	67.407450	363945.055626	280.0	2.958509	True
...	...	...	...	...	...
52387	0.000000	0.000000	NaN	0.000000	False
52388	0.000000	0.000000	NaN	0.000000	False
52389	0.000000	0.000000	NaN	0.000000	False
52390	0.000000	0.000000	NaN	0.000000	False
52391	0.000000	0.000000	NaN	0.000000	False

52392 rows × 5 columns

Concat the two dataframes for convenience:

[3]:

combined_df = pd.concat([var_df.set_index("soma_joinid"), hvgs_df], axis=1)
combined_df

[3]:

	feature_id	feature_name	feature_length	means	variances	highly_variable_rank	variances_norm	highly_variable
soma_joinid
0	ENSMUSG00000051951	Xkr4	6094	0.230445	116.044863	NaN	1.749637	False
1	ENSMUSG00000089699	Gm1992	250	0.000000	0.000000	NaN	0.000000	False
2	ENSMUSG00000102343	Gm37381	1364	0.000000	0.000000	NaN	0.000000	False
3	ENSMUSG00000025900	Rp1	12311	0.287551	45.276809	NaN	0.461324	False
4	ENSMUSG00000025902	Sox17	4772	67.407450	363945.055626	280.0	2.958509	True
...	...	...	...	...	...	...	...	...
52387	ENSMUSG00000081591	Btf3-ps9	496	0.000000	0.000000	NaN	0.000000	False
52388	ENSMUSG00000118710	mmu-mir-467a-3_ENSMUSG00000118710	83	0.000000	0.000000	NaN	0.000000	False
52389	ENSMUSG00000119584	Rn18s	1849	0.000000	0.000000	NaN	0.000000	False
52390	ENSMUSG00000118538	Gm18218	970	0.000000	0.000000	NaN	0.000000	False
52391	ENSMUSG00000084217	Setd9-ps	670	0.000000	0.000000	NaN	0.000000	False

52392 rows × 8 columns

Select only the highly_variable genes by using the highly_variable column value:

[4]:

combined_df[combined_df.highly_variable]

[4]:

	feature_id	feature_name	feature_length	means	variances	highly_variable_rank	variances_norm	highly_variable
soma_joinid
4	ENSMUSG00000025902	Sox17	4772	67.407450	363945.055626	280.0	2.958509	True
188	ENSMUSG00000026117	Zap70	2992	5.409091	14793.026717	350.0	2.775560	True
233	ENSMUSG00000026073	Il1r2	1908	4.764085	41918.471500	206.0	3.402176	True
500	ENSMUSG00000026185	Igfbp5	6006	43.234876	314355.591239	156.0	3.825651	True
512	ENSMUSG00000026180	Cxcr2	3048	2.379390	10491.033344	173.0	3.640129	True
...	...	...	...	...	...	...	...	...
30296	ENSMUSG00000024803	Ankrd1	2886	38.548572	274005.455137	107.0	4.741864	True
30313	ENSMUSG00000024987	Cyp26a1	1983	2.186686	12973.622003	454.0	2.580162	True
30379	ENSMUSG00000018822	Sfrp5	1900	2.927853	10943.645525	410.0	2.637004	True
32042	ENSMUSG00000031838	Ifi30	980	91.676950	995276.564962	128.0	4.205886	True
33314	ENSMUSG00000092572	Serpinb10	3490	0.264085	227.239812	487.0	2.535469	True

500 rows × 8 columns

highly_variable_genes

This API provides the same function as get_highly_variable_genes, but accepts any tiledbsoma.ExperimentAxisQuery. It is intended for more advanced users who wish to use create and manage their own queries.

[5]:

with cellxgene_census.open_soma(census_version="stable") as census:
    experiment = census["census_data"]["mus_musculus"]
    with experiment.axis_query(
        measurement_name="RNA",
        obs_query=soma.AxisQuery(value_filter="""is_primary_data == True and tissue_general == 'heart'"""),
    ) as query:
        hvgs_df = highly_variable_genes(query, n_top_genes=500)

hvgs_df[hvgs_df.highly_variable]

The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.

[5]:

	means	variances	highly_variable_rank	variances_norm	highly_variable
soma_joinid
4	67.407450	363945.055626	280.0	2.958509	True
188	5.409091	14793.026717	350.0	2.775560	True
233	4.764085	41918.471500	206.0	3.402176	True
500	43.234876	314355.591239	156.0	3.825651	True
512	2.379390	10491.033344	173.0	3.640129	True
...	...	...	...	...	...
30296	38.548572	274005.455137	107.0	4.741864	True
30313	2.186686	12973.622003	454.0	2.580162	True
30379	2.927853	10943.645525	410.0	2.637004	True
32042	91.676950	995276.564962	128.0	4.205886	True
33314	0.264085	227.239812	487.0	2.535469	True

500 rows × 5 columns