Experimental Highly Variable Genes APIο
This tutorial describes use of the cellxgene_census.experimental.pp
API for finding highly variable genes (HVGs) in the Census. The HVG algorithm implements the ranked normalized variance method seurat_v3
described in scanpy.pp.highly_variable_genes.
There are two API available:
get_highly_variable_genes()
- high level function which accepts arguments similar tocellxgene_census.get_anndata()
, and returns annotations for eachvar
feature in a Pandas DataFrame.highly_variable_genes()
- lower level function which accepts atiledbsoma.ExperimentAxisQuery
and returns the same result.
Both functions accept common arguments to control ranking, with argument semantics matching the Scanpy API:
n_top_genes
- number of genes to rank.batch_key
- if specified, normalized ranking will be done in separate batches based upon the obs column value name specified, and then merged into the final result.span
- the fraction of the data (cells) used when estimating the variance in the loess model fit.
In addition:
max_lowess_jitter
- maxmimum jitter (noise) to data if LOESS fails. Disable by setting to zero.
For more information, see the docstrings for both functions (e.g. help(function)
)
[1]:
# Import packages
import cellxgene_census
import pandas as pd
import tiledbsoma as soma
from cellxgene_census.experimental.pp import (
get_highly_variable_genes,
highly_variable_genes,
)
get_highly_variable_genesο
This convenience function will meet most use cases, and is a wrapper around highly_variable_genes
. This demonstration requests the top 500 genes from the Mouse census where tissue_general
is heart
, and joins with the var
dataframe.
The HVGs returned by get_highly_variable_genes are indexed by their soma_joinid
. Join with the var
dataframe to have a merged view of var metadata.
[2]:
with cellxgene_census.open_soma(census_version="stable") as census:
hvgs_df = get_highly_variable_genes(
census,
organism="mus_musculus",
n_top_genes=500,
obs_value_filter="""is_primary_data == True and tissue_general == 'heart'""",
)
# while the Census is open, also grab the var dataframe for the mouse
var_df = cellxgene_census.get_var(census, "mus_musculus")
hvgs_df
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
[2]:
means | variances | highly_variable_rank | variances_norm | highly_variable | |
---|---|---|---|---|---|
soma_joinid | |||||
0 | 0.230445 | 116.044863 | NaN | 1.749637 | False |
1 | 0.000000 | 0.000000 | NaN | 0.000000 | False |
2 | 0.000000 | 0.000000 | NaN | 0.000000 | False |
3 | 0.287551 | 45.276809 | NaN | 0.461324 | False |
4 | 67.407450 | 363945.055626 | 280.0 | 2.958509 | True |
... | ... | ... | ... | ... | ... |
52387 | 0.000000 | 0.000000 | NaN | 0.000000 | False |
52388 | 0.000000 | 0.000000 | NaN | 0.000000 | False |
52389 | 0.000000 | 0.000000 | NaN | 0.000000 | False |
52390 | 0.000000 | 0.000000 | NaN | 0.000000 | False |
52391 | 0.000000 | 0.000000 | NaN | 0.000000 | False |
52392 rows Γ 5 columns
Concat the two dataframes for convenience:
[3]:
combined_df = pd.concat([var_df.set_index("soma_joinid"), hvgs_df], axis=1)
combined_df
[3]:
feature_id | feature_name | feature_length | means | variances | highly_variable_rank | variances_norm | highly_variable | |
---|---|---|---|---|---|---|---|---|
soma_joinid | ||||||||
0 | ENSMUSG00000051951 | Xkr4 | 6094 | 0.230445 | 116.044863 | NaN | 1.749637 | False |
1 | ENSMUSG00000089699 | Gm1992 | 250 | 0.000000 | 0.000000 | NaN | 0.000000 | False |
2 | ENSMUSG00000102343 | Gm37381 | 1364 | 0.000000 | 0.000000 | NaN | 0.000000 | False |
3 | ENSMUSG00000025900 | Rp1 | 12311 | 0.287551 | 45.276809 | NaN | 0.461324 | False |
4 | ENSMUSG00000025902 | Sox17 | 4772 | 67.407450 | 363945.055626 | 280.0 | 2.958509 | True |
... | ... | ... | ... | ... | ... | ... | ... | ... |
52387 | ENSMUSG00000081591 | Btf3-ps9 | 496 | 0.000000 | 0.000000 | NaN | 0.000000 | False |
52388 | ENSMUSG00000118710 | mmu-mir-467a-3_ENSMUSG00000118710 | 83 | 0.000000 | 0.000000 | NaN | 0.000000 | False |
52389 | ENSMUSG00000119584 | Rn18s | 1849 | 0.000000 | 0.000000 | NaN | 0.000000 | False |
52390 | ENSMUSG00000118538 | Gm18218 | 970 | 0.000000 | 0.000000 | NaN | 0.000000 | False |
52391 | ENSMUSG00000084217 | Setd9-ps | 670 | 0.000000 | 0.000000 | NaN | 0.000000 | False |
52392 rows Γ 8 columns
Select only the highly_variable genes by using the highly_variable
column value:
[4]:
combined_df[combined_df.highly_variable]
[4]:
feature_id | feature_name | feature_length | means | variances | highly_variable_rank | variances_norm | highly_variable | |
---|---|---|---|---|---|---|---|---|
soma_joinid | ||||||||
4 | ENSMUSG00000025902 | Sox17 | 4772 | 67.407450 | 363945.055626 | 280.0 | 2.958509 | True |
188 | ENSMUSG00000026117 | Zap70 | 2992 | 5.409091 | 14793.026717 | 350.0 | 2.775560 | True |
233 | ENSMUSG00000026073 | Il1r2 | 1908 | 4.764085 | 41918.471500 | 206.0 | 3.402176 | True |
500 | ENSMUSG00000026185 | Igfbp5 | 6006 | 43.234876 | 314355.591239 | 156.0 | 3.825651 | True |
512 | ENSMUSG00000026180 | Cxcr2 | 3048 | 2.379390 | 10491.033344 | 173.0 | 3.640129 | True |
... | ... | ... | ... | ... | ... | ... | ... | ... |
30296 | ENSMUSG00000024803 | Ankrd1 | 2886 | 38.548572 | 274005.455137 | 107.0 | 4.741864 | True |
30313 | ENSMUSG00000024987 | Cyp26a1 | 1983 | 2.186686 | 12973.622003 | 454.0 | 2.580162 | True |
30379 | ENSMUSG00000018822 | Sfrp5 | 1900 | 2.927853 | 10943.645525 | 410.0 | 2.637004 | True |
32042 | ENSMUSG00000031838 | Ifi30 | 980 | 91.676950 | 995276.564962 | 128.0 | 4.205886 | True |
33314 | ENSMUSG00000092572 | Serpinb10 | 3490 | 0.264085 | 227.239812 | 487.0 | 2.535469 | True |
500 rows Γ 8 columns
highly_variable_genesο
This API provides the same function as get_highly_variable_genes
, but accepts any tiledbsoma.ExperimentAxisQuery
. It is intended for more advanced users who wish to use create and manage their own queries.
[5]:
with cellxgene_census.open_soma(census_version="stable") as census:
experiment = census["census_data"]["mus_musculus"]
with experiment.axis_query(
measurement_name="RNA",
obs_query=soma.AxisQuery(value_filter="""is_primary_data == True and tissue_general == 'heart'"""),
) as query:
hvgs_df = highly_variable_genes(query, n_top_genes=500)
hvgs_df[hvgs_df.highly_variable]
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
[5]:
means | variances | highly_variable_rank | variances_norm | highly_variable | |
---|---|---|---|---|---|
soma_joinid | |||||
4 | 67.407450 | 363945.055626 | 280.0 | 2.958509 | True |
188 | 5.409091 | 14793.026717 | 350.0 | 2.775560 | True |
233 | 4.764085 | 41918.471500 | 206.0 | 3.402176 | True |
500 | 43.234876 | 314355.591239 | 156.0 | 3.825651 | True |
512 | 2.379390 | 10491.033344 | 173.0 | 3.640129 | True |
... | ... | ... | ... | ... | ... |
30296 | 38.548572 | 274005.455137 | 107.0 | 4.741864 | True |
30313 | 2.186686 | 12973.622003 | 454.0 | 2.580162 | True |
30379 | 2.927853 | 10943.645525 | 410.0 | 2.637004 | True |
32042 | 91.676950 | 995276.564962 | 128.0 | 4.205886 | True |
33314 | 0.264085 | 227.239812 | 487.0 | 2.535469 | True |
500 rows Γ 5 columns