{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Experimental Highly Variable Genes API\n", "\n", "This tutorial describes use of the `cellxgene_census.experimental.pp` API for finding highly variable genes (HVGs) in the Census. The HVG algorithm implements the ranked normalized variance method `seurat_v3` described in [scanpy.pp.highly_variable_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html#scanpy.pp.highly_variable_genes).\n", "\n", "There are two API available:\n", "\n", "* `get_highly_variable_genes()` - high level function which accepts arguments similar to `cellxgene_census.get_anndata()`, and returns annotations for each `var` feature in a Pandas DataFrame.\n", "* `highly_variable_genes()` - lower level function which accepts a `tiledbsoma.ExperimentAxisQuery` and returns the same result.\n", "\n", "Both functions accept common arguments to control ranking, with argument semantics matching the Scanpy API:\n", "\n", "* `n_top_genes` - number of genes to rank.\n", "* `batch_key` - if specified, normalized ranking will be done in separate batches based upon the obs column value name specified, and then merged into the final result.\n", "* `span` - the fraction of the data (cells) used when estimating the variance in the [loess model fit](https://has2k1.github.io/scikit-misc/stable/generated/skmisc.loess.loess_model.html#skmisc.loess.loess_model).\n", "\n", "In addition:\n", "\n", "* `max_lowess_jitter` - maxmimum jitter (noise) to data if LOESS fails. Disable by setting to zero.\n", "\n", "For more information, see the docstrings for both functions (e.g. `help(function)`)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:32:12.060594Z", "iopub.status.busy": "2023-07-28T16:32:12.060106Z", "iopub.status.idle": "2023-07-28T16:32:17.031550Z", "shell.execute_reply": "2023-07-28T16:32:17.030945Z" } }, "outputs": [], "source": [ "# Import packages\n", "import cellxgene_census\n", "import pandas as pd\n", "import tiledbsoma as soma\n", "from cellxgene_census.experimental.pp import (\n", " get_highly_variable_genes,\n", " highly_variable_genes,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## get_highly_variable_genes\n", "\n", "This convenience function will meet most use cases, and is a wrapper around `highly_variable_genes`. This demonstration requests the top 500 genes from the Mouse census where `tissue_general` is `heart`, and joins with the `var` dataframe.\n", "\n", "The HVGs returned by get_highly_variable_genes are indexed by their `soma_joinid`. Join with the `var` dataframe to have a merged view of var metadata." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:32:17.034646Z", "iopub.status.busy": "2023-07-28T16:32:17.034205Z", "iopub.status.idle": "2023-07-28T16:32:31.117329Z", "shell.execute_reply": "2023-07-28T16:32:31.116757Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n" ] }, { "data": { "text/html": [ "
\n", " | means | \n", "variances | \n", "highly_variable_rank | \n", "variances_norm | \n", "highly_variable | \n", "
---|---|---|---|---|---|
soma_joinid | \n", "\n", " | \n", " | \n", " | \n", " | \n", " |
0 | \n", "0.230445 | \n", "116.044863 | \n", "NaN | \n", "1.749637 | \n", "False | \n", "
1 | \n", "0.000000 | \n", "0.000000 | \n", "NaN | \n", "0.000000 | \n", "False | \n", "
2 | \n", "0.000000 | \n", "0.000000 | \n", "NaN | \n", "0.000000 | \n", "False | \n", "
3 | \n", "0.287551 | \n", "45.276809 | \n", "NaN | \n", "0.461324 | \n", "False | \n", "
4 | \n", "67.407450 | \n", "363945.055626 | \n", "280.0 | \n", "2.958509 | \n", "True | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
52387 | \n", "0.000000 | \n", "0.000000 | \n", "NaN | \n", "0.000000 | \n", "False | \n", "
52388 | \n", "0.000000 | \n", "0.000000 | \n", "NaN | \n", "0.000000 | \n", "False | \n", "
52389 | \n", "0.000000 | \n", "0.000000 | \n", "NaN | \n", "0.000000 | \n", "False | \n", "
52390 | \n", "0.000000 | \n", "0.000000 | \n", "NaN | \n", "0.000000 | \n", "False | \n", "
52391 | \n", "0.000000 | \n", "0.000000 | \n", "NaN | \n", "0.000000 | \n", "False | \n", "
52392 rows × 5 columns
\n", "\n", " | feature_id | \n", "feature_name | \n", "feature_length | \n", "means | \n", "variances | \n", "highly_variable_rank | \n", "variances_norm | \n", "highly_variable | \n", "
---|---|---|---|---|---|---|---|---|
soma_joinid | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
0 | \n", "ENSMUSG00000051951 | \n", "Xkr4 | \n", "6094 | \n", "0.230445 | \n", "116.044863 | \n", "NaN | \n", "1.749637 | \n", "False | \n", "
1 | \n", "ENSMUSG00000089699 | \n", "Gm1992 | \n", "250 | \n", "0.000000 | \n", "0.000000 | \n", "NaN | \n", "0.000000 | \n", "False | \n", "
2 | \n", "ENSMUSG00000102343 | \n", "Gm37381 | \n", "1364 | \n", "0.000000 | \n", "0.000000 | \n", "NaN | \n", "0.000000 | \n", "False | \n", "
3 | \n", "ENSMUSG00000025900 | \n", "Rp1 | \n", "12311 | \n", "0.287551 | \n", "45.276809 | \n", "NaN | \n", "0.461324 | \n", "False | \n", "
4 | \n", "ENSMUSG00000025902 | \n", "Sox17 | \n", "4772 | \n", "67.407450 | \n", "363945.055626 | \n", "280.0 | \n", "2.958509 | \n", "True | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
52387 | \n", "ENSMUSG00000081591 | \n", "Btf3-ps9 | \n", "496 | \n", "0.000000 | \n", "0.000000 | \n", "NaN | \n", "0.000000 | \n", "False | \n", "
52388 | \n", "ENSMUSG00000118710 | \n", "mmu-mir-467a-3_ENSMUSG00000118710 | \n", "83 | \n", "0.000000 | \n", "0.000000 | \n", "NaN | \n", "0.000000 | \n", "False | \n", "
52389 | \n", "ENSMUSG00000119584 | \n", "Rn18s | \n", "1849 | \n", "0.000000 | \n", "0.000000 | \n", "NaN | \n", "0.000000 | \n", "False | \n", "
52390 | \n", "ENSMUSG00000118538 | \n", "Gm18218 | \n", "970 | \n", "0.000000 | \n", "0.000000 | \n", "NaN | \n", "0.000000 | \n", "False | \n", "
52391 | \n", "ENSMUSG00000084217 | \n", "Setd9-ps | \n", "670 | \n", "0.000000 | \n", "0.000000 | \n", "NaN | \n", "0.000000 | \n", "False | \n", "
52392 rows × 8 columns
\n", "\n", " | feature_id | \n", "feature_name | \n", "feature_length | \n", "means | \n", "variances | \n", "highly_variable_rank | \n", "variances_norm | \n", "highly_variable | \n", "
---|---|---|---|---|---|---|---|---|
soma_joinid | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
4 | \n", "ENSMUSG00000025902 | \n", "Sox17 | \n", "4772 | \n", "67.407450 | \n", "363945.055626 | \n", "280.0 | \n", "2.958509 | \n", "True | \n", "
188 | \n", "ENSMUSG00000026117 | \n", "Zap70 | \n", "2992 | \n", "5.409091 | \n", "14793.026717 | \n", "350.0 | \n", "2.775560 | \n", "True | \n", "
233 | \n", "ENSMUSG00000026073 | \n", "Il1r2 | \n", "1908 | \n", "4.764085 | \n", "41918.471500 | \n", "206.0 | \n", "3.402176 | \n", "True | \n", "
500 | \n", "ENSMUSG00000026185 | \n", "Igfbp5 | \n", "6006 | \n", "43.234876 | \n", "314355.591239 | \n", "156.0 | \n", "3.825651 | \n", "True | \n", "
512 | \n", "ENSMUSG00000026180 | \n", "Cxcr2 | \n", "3048 | \n", "2.379390 | \n", "10491.033344 | \n", "173.0 | \n", "3.640129 | \n", "True | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
30296 | \n", "ENSMUSG00000024803 | \n", "Ankrd1 | \n", "2886 | \n", "38.548572 | \n", "274005.455137 | \n", "107.0 | \n", "4.741864 | \n", "True | \n", "
30313 | \n", "ENSMUSG00000024987 | \n", "Cyp26a1 | \n", "1983 | \n", "2.186686 | \n", "12973.622003 | \n", "454.0 | \n", "2.580162 | \n", "True | \n", "
30379 | \n", "ENSMUSG00000018822 | \n", "Sfrp5 | \n", "1900 | \n", "2.927853 | \n", "10943.645525 | \n", "410.0 | \n", "2.637004 | \n", "True | \n", "
32042 | \n", "ENSMUSG00000031838 | \n", "Ifi30 | \n", "980 | \n", "91.676950 | \n", "995276.564962 | \n", "128.0 | \n", "4.205886 | \n", "True | \n", "
33314 | \n", "ENSMUSG00000092572 | \n", "Serpinb10 | \n", "3490 | \n", "0.264085 | \n", "227.239812 | \n", "487.0 | \n", "2.535469 | \n", "True | \n", "
500 rows × 8 columns
\n", "\n", " | means | \n", "variances | \n", "highly_variable_rank | \n", "variances_norm | \n", "highly_variable | \n", "
---|---|---|---|---|---|
soma_joinid | \n", "\n", " | \n", " | \n", " | \n", " | \n", " |
4 | \n", "67.407450 | \n", "363945.055626 | \n", "280.0 | \n", "2.958509 | \n", "True | \n", "
188 | \n", "5.409091 | \n", "14793.026717 | \n", "350.0 | \n", "2.775560 | \n", "True | \n", "
233 | \n", "4.764085 | \n", "41918.471500 | \n", "206.0 | \n", "3.402176 | \n", "True | \n", "
500 | \n", "43.234876 | \n", "314355.591239 | \n", "156.0 | \n", "3.825651 | \n", "True | \n", "
512 | \n", "2.379390 | \n", "10491.033344 | \n", "173.0 | \n", "3.640129 | \n", "True | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
30296 | \n", "38.548572 | \n", "274005.455137 | \n", "107.0 | \n", "4.741864 | \n", "True | \n", "
30313 | \n", "2.186686 | \n", "12973.622003 | \n", "454.0 | \n", "2.580162 | \n", "True | \n", "
30379 | \n", "2.927853 | \n", "10943.645525 | \n", "410.0 | \n", "2.637004 | \n", "True | \n", "
32042 | \n", "91.676950 | \n", "995276.564962 | \n", "128.0 | \n", "4.205886 | \n", "True | \n", "
33314 | \n", "0.264085 | \n", "227.239812 | \n", "487.0 | \n", "2.535469 | \n", "True | \n", "
500 rows × 5 columns
\n", "