{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Experimental Highly Variable Genes API\n", "\n", "This tutorial describes use of the `cellxgene_census.experimental.pp` API for finding highly variable genes (HVGs) in the Census. The HVG algorithm implements the ranked normalized variance method `seurat_v3` described in [scanpy.pp.highly_variable_genes](https://scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html#scanpy.pp.highly_variable_genes).\n", "\n", "There are two API available:\n", "\n", "* `get_highly_variable_genes()` - high level function which accepts arguments similar to `cellxgene_census.get_anndata()`, and returns annotations for each `var` feature in a Pandas DataFrame.\n", "* `highly_variable_genes()` - lower level function which accepts a `tiledbsoma.ExperimentAxisQuery` and returns the same result.\n", "\n", "Both functions accept common arguments to control ranking, with argument semantics matching the Scanpy API:\n", "\n", "* `n_top_genes` - number of genes to rank.\n", "* `batch_key` - if specified, normalized ranking will be done in separate batches based upon the obs column value name specified, and then merged into the final result.\n", "* `span` - the fraction of the data (cells) used when estimating the variance in the [loess model fit](https://has2k1.github.io/scikit-misc/stable/generated/skmisc.loess.loess_model.html#skmisc.loess.loess_model).\n", "\n", "In addition:\n", "\n", "* `max_lowess_jitter` - maxmimum jitter (noise) to data if LOESS fails. Disable by setting to zero.\n", "\n", "For more information, see the docstrings for both functions (e.g. `help(function)`)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:32:12.060594Z", "iopub.status.busy": "2023-07-28T16:32:12.060106Z", "iopub.status.idle": "2023-07-28T16:32:17.031550Z", "shell.execute_reply": "2023-07-28T16:32:17.030945Z" } }, "outputs": [], "source": [ "# Import packages\n", "import cellxgene_census\n", "import pandas as pd\n", "import tiledbsoma as soma\n", "from cellxgene_census.experimental.pp import (\n", " get_highly_variable_genes,\n", " highly_variable_genes,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## get_highly_variable_genes\n", "\n", "This convenience function will meet most use cases, and is a wrapper around `highly_variable_genes`. This demonstration requests the top 500 genes from the Mouse census where `tissue_general` is `heart`, and joins with the `var` dataframe.\n", "\n", "The HVGs returned by get_highly_variable_genes are indexed by their `soma_joinid`. Join with the `var` dataframe to have a merged view of var metadata." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:32:17.034646Z", "iopub.status.busy": "2023-07-28T16:32:17.034205Z", "iopub.status.idle": "2023-07-28T16:32:31.117329Z", "shell.execute_reply": "2023-07-28T16:32:31.116757Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meansvarianceshighly_variable_rankvariances_normhighly_variable
soma_joinid
00.230445116.044863NaN1.749637False
10.0000000.000000NaN0.000000False
20.0000000.000000NaN0.000000False
30.28755145.276809NaN0.461324False
467.407450363945.055626280.02.958509True
..................
523870.0000000.000000NaN0.000000False
523880.0000000.000000NaN0.000000False
523890.0000000.000000NaN0.000000False
523900.0000000.000000NaN0.000000False
523910.0000000.000000NaN0.000000False
\n", "

52392 rows × 5 columns

\n", "
" ], "text/plain": [ " means variances highly_variable_rank variances_norm \\\n", "soma_joinid \n", "0 0.230445 116.044863 NaN 1.749637 \n", "1 0.000000 0.000000 NaN 0.000000 \n", "2 0.000000 0.000000 NaN 0.000000 \n", "3 0.287551 45.276809 NaN 0.461324 \n", "4 67.407450 363945.055626 280.0 2.958509 \n", "... ... ... ... ... \n", "52387 0.000000 0.000000 NaN 0.000000 \n", "52388 0.000000 0.000000 NaN 0.000000 \n", "52389 0.000000 0.000000 NaN 0.000000 \n", "52390 0.000000 0.000000 NaN 0.000000 \n", "52391 0.000000 0.000000 NaN 0.000000 \n", "\n", " highly_variable \n", "soma_joinid \n", "0 False \n", "1 False \n", "2 False \n", "3 False \n", "4 True \n", "... ... \n", "52387 False \n", "52388 False \n", "52389 False \n", "52390 False \n", "52391 False \n", "\n", "[52392 rows x 5 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with cellxgene_census.open_soma(census_version=\"stable\") as census:\n", " hvgs_df = get_highly_variable_genes(\n", " census,\n", " organism=\"mus_musculus\",\n", " n_top_genes=500,\n", " obs_value_filter=\"\"\"is_primary_data == True and tissue_general == 'heart'\"\"\",\n", " )\n", "\n", " # while the Census is open, also grab the var dataframe for the mouse\n", " var_df = census[\"census_data\"][\"mus_musculus\"].ms[\"RNA\"].var.read().concat().to_pandas()\n", "\n", "hvgs_df" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Concat the two dataframes for convenience:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:32:31.120167Z", "iopub.status.busy": "2023-07-28T16:32:31.119704Z", "iopub.status.idle": "2023-07-28T16:32:31.135930Z", "shell.execute_reply": "2023-07-28T16:32:31.135426Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
feature_idfeature_namefeature_lengthmeansvarianceshighly_variable_rankvariances_normhighly_variable
soma_joinid
0ENSMUSG00000051951Xkr460940.230445116.044863NaN1.749637False
1ENSMUSG00000089699Gm19922500.0000000.000000NaN0.000000False
2ENSMUSG00000102343Gm3738113640.0000000.000000NaN0.000000False
3ENSMUSG00000025900Rp1123110.28755145.276809NaN0.461324False
4ENSMUSG00000025902Sox17477267.407450363945.055626280.02.958509True
...........................
52387ENSMUSG00000081591Btf3-ps94960.0000000.000000NaN0.000000False
52388ENSMUSG00000118710mmu-mir-467a-3_ENSMUSG00000118710830.0000000.000000NaN0.000000False
52389ENSMUSG00000119584Rn18s18490.0000000.000000NaN0.000000False
52390ENSMUSG00000118538Gm182189700.0000000.000000NaN0.000000False
52391ENSMUSG00000084217Setd9-ps6700.0000000.000000NaN0.000000False
\n", "

52392 rows × 8 columns

\n", "
" ], "text/plain": [ " feature_id feature_name \\\n", "soma_joinid \n", "0 ENSMUSG00000051951 Xkr4 \n", "1 ENSMUSG00000089699 Gm1992 \n", "2 ENSMUSG00000102343 Gm37381 \n", "3 ENSMUSG00000025900 Rp1 \n", "4 ENSMUSG00000025902 Sox17 \n", "... ... ... \n", "52387 ENSMUSG00000081591 Btf3-ps9 \n", "52388 ENSMUSG00000118710 mmu-mir-467a-3_ENSMUSG00000118710 \n", "52389 ENSMUSG00000119584 Rn18s \n", "52390 ENSMUSG00000118538 Gm18218 \n", "52391 ENSMUSG00000084217 Setd9-ps \n", "\n", " feature_length means variances highly_variable_rank \\\n", "soma_joinid \n", "0 6094 0.230445 116.044863 NaN \n", "1 250 0.000000 0.000000 NaN \n", "2 1364 0.000000 0.000000 NaN \n", "3 12311 0.287551 45.276809 NaN \n", "4 4772 67.407450 363945.055626 280.0 \n", "... ... ... ... ... \n", "52387 496 0.000000 0.000000 NaN \n", "52388 83 0.000000 0.000000 NaN \n", "52389 1849 0.000000 0.000000 NaN \n", "52390 970 0.000000 0.000000 NaN \n", "52391 670 0.000000 0.000000 NaN \n", "\n", " variances_norm highly_variable \n", "soma_joinid \n", "0 1.749637 False \n", "1 0.000000 False \n", "2 0.000000 False \n", "3 0.461324 False \n", "4 2.958509 True \n", "... ... ... \n", "52387 0.000000 False \n", "52388 0.000000 False \n", "52389 0.000000 False \n", "52390 0.000000 False \n", "52391 0.000000 False \n", "\n", "[52392 rows x 8 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined_df = pd.concat([var_df.set_index(\"soma_joinid\"), hvgs_df], axis=1)\n", "combined_df" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Select _only_ the highly_variable genes by using the `highly_variable` column value:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:32:31.138459Z", "iopub.status.busy": "2023-07-28T16:32:31.137972Z", "iopub.status.idle": "2023-07-28T16:32:31.148883Z", "shell.execute_reply": "2023-07-28T16:32:31.148411Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
feature_idfeature_namefeature_lengthmeansvarianceshighly_variable_rankvariances_normhighly_variable
soma_joinid
4ENSMUSG00000025902Sox17477267.407450363945.055626280.02.958509True
188ENSMUSG00000026117Zap7029925.40909114793.026717350.02.775560True
233ENSMUSG00000026073Il1r219084.76408541918.471500206.03.402176True
500ENSMUSG00000026185Igfbp5600643.234876314355.591239156.03.825651True
512ENSMUSG00000026180Cxcr230482.37939010491.033344173.03.640129True
...........................
30296ENSMUSG00000024803Ankrd1288638.548572274005.455137107.04.741864True
30313ENSMUSG00000024987Cyp26a119832.18668612973.622003454.02.580162True
30379ENSMUSG00000018822Sfrp519002.92785310943.645525410.02.637004True
32042ENSMUSG00000031838Ifi3098091.676950995276.564962128.04.205886True
33314ENSMUSG00000092572Serpinb1034900.264085227.239812487.02.535469True
\n", "

500 rows × 8 columns

\n", "
" ], "text/plain": [ " feature_id feature_name feature_length means \\\n", "soma_joinid \n", "4 ENSMUSG00000025902 Sox17 4772 67.407450 \n", "188 ENSMUSG00000026117 Zap70 2992 5.409091 \n", "233 ENSMUSG00000026073 Il1r2 1908 4.764085 \n", "500 ENSMUSG00000026185 Igfbp5 6006 43.234876 \n", "512 ENSMUSG00000026180 Cxcr2 3048 2.379390 \n", "... ... ... ... ... \n", "30296 ENSMUSG00000024803 Ankrd1 2886 38.548572 \n", "30313 ENSMUSG00000024987 Cyp26a1 1983 2.186686 \n", "30379 ENSMUSG00000018822 Sfrp5 1900 2.927853 \n", "32042 ENSMUSG00000031838 Ifi30 980 91.676950 \n", "33314 ENSMUSG00000092572 Serpinb10 3490 0.264085 \n", "\n", " variances highly_variable_rank variances_norm \\\n", "soma_joinid \n", "4 363945.055626 280.0 2.958509 \n", "188 14793.026717 350.0 2.775560 \n", "233 41918.471500 206.0 3.402176 \n", "500 314355.591239 156.0 3.825651 \n", "512 10491.033344 173.0 3.640129 \n", "... ... ... ... \n", "30296 274005.455137 107.0 4.741864 \n", "30313 12973.622003 454.0 2.580162 \n", "30379 10943.645525 410.0 2.637004 \n", "32042 995276.564962 128.0 4.205886 \n", "33314 227.239812 487.0 2.535469 \n", "\n", " highly_variable \n", "soma_joinid \n", "4 True \n", "188 True \n", "233 True \n", "500 True \n", "512 True \n", "... ... \n", "30296 True \n", "30313 True \n", "30379 True \n", "32042 True \n", "33314 True \n", "\n", "[500 rows x 8 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "combined_df[combined_df.highly_variable]" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## highly_variable_genes\n", "\n", "This API provides the same function as `get_highly_variable_genes`, but accepts any `tiledbsoma.ExperimentAxisQuery`. It is intended for more advanced users who wish to use create and manage their own queries." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:32:31.151131Z", "iopub.status.busy": "2023-07-28T16:32:31.150904Z", "iopub.status.idle": "2023-07-28T16:32:44.076355Z", "shell.execute_reply": "2023-07-28T16:32:44.075579Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
meansvarianceshighly_variable_rankvariances_normhighly_variable
soma_joinid
467.407450363945.055626280.02.958509True
1885.40909114793.026717350.02.775560True
2334.76408541918.471500206.03.402176True
50043.234876314355.591239156.03.825651True
5122.37939010491.033344173.03.640129True
..................
3029638.548572274005.455137107.04.741864True
303132.18668612973.622003454.02.580162True
303792.92785310943.645525410.02.637004True
3204291.676950995276.564962128.04.205886True
333140.264085227.239812487.02.535469True
\n", "

500 rows × 5 columns

\n", "
" ], "text/plain": [ " means variances highly_variable_rank variances_norm \\\n", "soma_joinid \n", "4 67.407450 363945.055626 280.0 2.958509 \n", "188 5.409091 14793.026717 350.0 2.775560 \n", "233 4.764085 41918.471500 206.0 3.402176 \n", "500 43.234876 314355.591239 156.0 3.825651 \n", "512 2.379390 10491.033344 173.0 3.640129 \n", "... ... ... ... ... \n", "30296 38.548572 274005.455137 107.0 4.741864 \n", "30313 2.186686 12973.622003 454.0 2.580162 \n", "30379 2.927853 10943.645525 410.0 2.637004 \n", "32042 91.676950 995276.564962 128.0 4.205886 \n", "33314 0.264085 227.239812 487.0 2.535469 \n", "\n", " highly_variable \n", "soma_joinid \n", "4 True \n", "188 True \n", "233 True \n", "500 True \n", "512 True \n", "... ... \n", "30296 True \n", "30313 True \n", "30379 True \n", "32042 True \n", "33314 True \n", "\n", "[500 rows x 5 columns]" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "with cellxgene_census.open_soma(census_version=\"stable\") as census:\n", " experiment = census[\"census_data\"][\"mus_musculus\"]\n", " with experiment.axis_query(\n", " measurement_name=\"RNA\",\n", " obs_query=soma.AxisQuery(value_filter=\"\"\"is_primary_data == True and tissue_general == 'heart'\"\"\"),\n", " ) as query:\n", " hvgs_df = highly_variable_genes(query, n_top_genes=500)\n", "\n", "hvgs_df[hvgs_df.highly_variable]" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" } }, "nbformat": 4, "nbformat_minor": 2 }