{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Learning about the CZ CELLxGENE Census\n", "\n", "This notebook showcases the Census contents and how to obtain high-level information about it. It covers the organization of data within the Census, what cell and gene metadata are available, and it provides simple demonstrations to summarize cell counts across cell metadata. \n", "\n", "**Contents**\n", "\n", "- Opening the census\n", "- Census organization\n", "- Cell metadata\n", "- Gene metadata\n", "- Census summary content tables\n", "- Understanding Census contents beyond the summary tables\n", "\n", "⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable `is_primary_data` which is described in the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#repeated-data).\n", "\n", "## Opening the Census\n", "\n", "The `cellxgene_census` python package contains a convenient API to open the latest version of the Census. If you open the census, you should close it. `open_soma()` returns a context, so you can open/close it in several ways, like a Python file handle. The context manager is preferred, as it will automatically close upon an error raise.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import cellxgene_census" ] }, { "cell_type": "raw", "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:06.960041Z", "iopub.status.busy": "2023-07-28T14:20:06.959467Z", "iopub.status.idle": "2023-07-28T14:20:10.170466Z", "shell.execute_reply": "2023-07-28T14:20:10.169835Z" } }, "source": [ "# Preferred: use a Python context manager\n", "with cellxgene_census.open_soma() as census:\n", " ...\n", "\n", "# or\n", "census = cellxgene_census.open_soma()\n", "...\n", "census.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can learn more about the `cellxgene_census` methods by accessing their corresponding documentation via `help()`. For example `help(cellxgene_census.open_soma)`.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "census = cellxgene_census.open_soma(census_version=\"2025-11-08\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Census organization\n", "\n", "The [Census schema](https://chanzuckerberg.github.io/cellxgene-census/cellxgene_census_docsite_schema.html) defines the structure of the Census. In short, you can think of the Census as a structured collection of items that stores different pieces of information. All of these items and the parent collection are SOMA objects of various types and can all be accessed with the [TileDB-SOMA API](https://github.com/single-cell-data/TileDB-SOMA) ([documentation](https://tiledbsoma.readthedocs.io/en/latest/)).\n", "\n", "\n", "The `cellxgene_census` package contains some convenient wrappers of the `TileDB-SOMA` API. An example of this is the function we used to open the Census: `cellxgene_census.open_soma()`\n", "\n", "### Main Census components\n", "\n", "With the command above you created `census`, which is a `SOMACollection`. It is analogous to a Python dictionary, and it has two items: `census_info` and `census_data`.\n", "\n", "#### Census summary info\n", "\n", "- `census[\"census_info\"]` A collection of tables providing information of the census as a whole.\n", " - `census[\"census_info\"][\"summary\"]`: A data frame with high-level information of this Census, e.g. build date, total cell count, etc.\n", " - `census[\"census_info\"][\"datasets\"]`: A data frame with all datasets from [CELLxGENE Discover](https://cellxgene.cziscience.com/) used to create the Census.\n", " - `census[\"census_info\"][\"summary_cell_counts\"]`: A data frame with cell counts stratified by **relevant** cell metadata\n", "\n", "#### Census data\n", "\n", "Data for each organism is stored in independent `SOMAExperiment` objects which are a specialized form of a `SOMACollection`. Each of these store a data matrix (cell by genes), cell metadata, gene metadata, and some other useful components not covered in this notebook.\n", "\n", "This is how the data is organized for one organism -- _Homo sapiens_:\n", "\n", "- `census_obj[\"census_data\"][\"homo_sapiens\"].obs`: Cell metadata\n", "- `census_obj[\"census_data\"][\"homo_sapiens\"].ms[\"RNA\"].X:` Data matrices, currently only raw counts exist `X[\"raw\"]`\n", "- `census_obj[\"census_data\"][\"homo_sapiens\"].ms[\"RNA\"].var:` Gene Metadata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cell metadata\n", "\n", "You can obtain all cell metadata variables by directly querying the columns of the corresponding `SOMADataFrame`.\n", "\n", "All of these variables can be used for querying the Census in case you want to work with specific cells.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "['soma_joinid',\n", " 'dataset_id',\n", " 'assay',\n", " 'assay_ontology_term_id',\n", " 'cell_type',\n", " 'cell_type_ontology_term_id',\n", " 'development_stage',\n", " 'development_stage_ontology_term_id',\n", " 'disease',\n", " 'disease_ontology_term_id',\n", " 'donor_id',\n", " 'is_primary_data',\n", " 'observation_joinid',\n", " 'self_reported_ethnicity',\n", " 'self_reported_ethnicity_ontology_term_id',\n", " 'sex',\n", " 'sex_ontology_term_id',\n", " 'suspension_type',\n", " 'tissue',\n", " 'tissue_ontology_term_id',\n", " 'tissue_type',\n", " 'tissue_general',\n", " 'tissue_general_ontology_term_id',\n", " 'raw_sum',\n", " 'nnz',\n", " 'raw_mean_nnz',\n", " 'raw_variance_nnz',\n", " 'n_measured_vars']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "keys = list(census[\"census_data\"][\"homo_sapiens\"].obs.keys())\n", "\n", "keys" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of these variables are defined in the [CELLxGENE dataset schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#obs-cell-metadata) except for the following:\n", "\n", "- `soma_joinid`: a SOMA-defined value use for join operations.\n", "- `dataset_id`: the dataset id as encoded in `census[\"census-info\"][\"datasets\"]`.\n", "- `tissue_general` and `tissue_general_ontology_term_id`: the high-level tissue mapping.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gene metadata\n", "\n", "Similarly, we can obtain all gene metadata variables by directly querying the columns of the corresponding `SOMADataFrame`.\n", "\n", "These are the variables you can use for querying the Census in case there are specific genes you are interested in.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['soma_joinid',\n", " 'feature_id',\n", " 'feature_name',\n", " 'feature_type',\n", " 'feature_length',\n", " 'nnz',\n", " 'n_measured_obs']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "keys = list(census[\"census_data\"][\"homo_sapiens\"].ms[\"RNA\"].var.keys())\n", "\n", "keys" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of these variables are defined in the [CELLxGENE dataset schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#var-and-rawvar-gene-metadata) except for the following:\n", "\n", "- `soma_joinid`: a SOMA-defined value use for join operations.\n", "- `feature_length`: the length in base pairs of the gene.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidlabelvalue
00census_schema_version2.4.0
11census_build_date2025-11-08
22dataset_schema_version7.0.0
33total_cell_count217768036
44unique_cell_count125463259
\n", "
" ], "text/plain": [ " soma_joinid label value\n", "0 0 census_schema_version 2.4.0\n", "1 1 census_build_date 2025-11-08\n", "2 2 dataset_schema_version 7.0.0\n", "3 3 total_cell_count 217768036\n", "4 4 unique_cell_count 125463259" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "census_info = census[\"census_info\"][\"summary\"].read().concat().to_pandas()\n", "\n", "census_info" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Census summary content tables\n", "\n", "You can take a quick look at the high-level Census information by looking at `census[\"census_info\"][\"summary\"]`\n", "\n", "Of special interest are the `label`-`value` combinations for :\n", "\n", "- `total_cell_count` is the total number of cells in the Census.\n", "- `unique_cell_count` is the number of unique cells, as some cells may be present twice due to meta-analysis or consortia-like data.\n", "\n", "### Cell counts by cell metadata\n", "\n", "By looking at `census[\"summary_cell_counts\"]` you can get a general idea of cell counts stratified by **some relevant** cell metadata. Not all cell metadata is included in this table, you can take a look at all cell and gene metadata available in the sections below \"Cell metadata\" and \"Gene metadata\".\n", "\n", "The line below retrieves this table and casts it into a `pandas.DataFrame`.\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidorganismcategorylabelontology_term_idtotal_cell_countunique_cell_count
00callithrix_jacchusallnana22754511712738
11callithrix_jacchusassay10x 3' v3EFO:000992222754511712738
22callithrix_jacchuscell_typeependymal cellCL:00000651911319113
33callithrix_jacchuscell_typeT cellCL:0000084113113
44callithrix_jacchuscell_typeendothelial cellCL:00001154209341320
........................
26152615pan_troglodytessexfemalePATO:00003837808678086
26162616pan_troglodytessexmalePATO:00003848001380013
26172617pan_troglodytessuspension_typenucleusna158099158099
26182618pan_troglodytestissuedorsolateral prefrontal cortexUBERON:0009834158099158099
26192619pan_troglodytestissue_generalbrainUBERON:0000955158099158099
\n", "

2620 rows × 7 columns

\n", "
" ], "text/plain": [ " soma_joinid organism category \\\n", "0 0 callithrix_jacchus all \n", "1 1 callithrix_jacchus assay \n", "2 2 callithrix_jacchus cell_type \n", "3 3 callithrix_jacchus cell_type \n", "4 4 callithrix_jacchus cell_type \n", "... ... ... ... \n", "2615 2615 pan_troglodytes sex \n", "2616 2616 pan_troglodytes sex \n", "2617 2617 pan_troglodytes suspension_type \n", "2618 2618 pan_troglodytes tissue \n", "2619 2619 pan_troglodytes tissue_general \n", "\n", " label ontology_term_id total_cell_count \\\n", "0 na na 2275451 \n", "1 10x 3' v3 EFO:0009922 2275451 \n", "2 ependymal cell CL:0000065 19113 \n", "3 T cell CL:0000084 113 \n", "4 endothelial cell CL:0000115 42093 \n", "... ... ... ... \n", "2615 female PATO:0000383 78086 \n", "2616 male PATO:0000384 80013 \n", "2617 nucleus na 158099 \n", "2618 dorsolateral prefrontal cortex UBERON:0009834 158099 \n", "2619 brain UBERON:0000955 158099 \n", "\n", " unique_cell_count \n", "0 1712738 \n", "1 1712738 \n", "2 19113 \n", "3 113 \n", "4 41320 \n", "... ... \n", "2615 78086 \n", "2616 80013 \n", "2617 158099 \n", "2618 158099 \n", "2619 158099 \n", "\n", "[2620 rows x 7 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "census_counts = census[\"census_info\"][\"summary_cell_counts\"].read().concat().to_pandas()\n", "\n", "census_counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each combination of `organism` and values for each `category` of cell metadata you can take a look at `total_cell_count` and `unique_cell_count` for the cell counts of that combination.\n", "\n", "The values for each `category` are specified in `ontology_term_id` and `label`, which are the value's IDs and labels, respectively.\n", "\n", "#### Example: cell metadata included in the summary counts table\n", "\n", "To get all the available cell metadata in the summary counts table you can do the following. Remember this is not all the cell metadata available, as some variables were omitted in the creation of this table.\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "organism category \n", "callithrix_jacchus all 1\n", " assay 1\n", " cell_type 40\n", " disease 1\n", " self_reported_ethnicity 1\n", " sex 2\n", " suspension_type 1\n", " tissue 33\n", " tissue_general 1\n", "homo_sapiens all 1\n", " assay 39\n", " cell_type 903\n", " disease 261\n", " self_reported_ethnicity 37\n", " sex 3\n", " suspension_type 1\n", " tissue 423\n", " tissue_general 71\n", "macaca_mulatta all 1\n", " assay 2\n", " cell_type 54\n", " disease 1\n", " self_reported_ethnicity 1\n", " sex 3\n", " suspension_type 1\n", " tissue 29\n", " tissue_general 2\n", "mus_musculus all 1\n", " assay 18\n", " cell_type 492\n", " disease 18\n", " self_reported_ethnicity 1\n", " sex 3\n", " suspension_type 1\n", " tissue 102\n", " tissue_general 36\n", "pan_troglodytes all 1\n", " assay 1\n", " cell_type 25\n", " disease 1\n", " self_reported_ethnicity 1\n", " sex 2\n", " suspension_type 1\n", " tissue 1\n", " tissue_general 1\n", "Name: count, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "census_counts[[\"organism\", \"category\"]].value_counts(sort=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example: cell counts for each sequencing assay in human data\n", "\n", "To get the cell counts for each sequencing assay type in human data, you can perform the following `pandas.DataFrame` operations:\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidorganismcategorylabelontology_term_idtotal_cell_countunique_cell_count
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [soma_joinid, organism, category, label, ontology_term_id, total_cell_count, unique_cell_count]\n", "Index: []" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "census_human_assays = census_counts.query(\"organism == 'Homo sapiens' & category == 'assay'\")\n", "census_human_assays.sort_values(\"total_cell_count\", ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example: number of microglial cells in the Census\n", "\n", "If you have a specific term from any of the categories shown above you can directly find out the number of cells for that term.\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidorganismcategorylabelontology_term_idtotal_cell_countunique_cell_count
77callithrix_jacchuscell_typemicroglial cellCL:00001296531357904
182182homo_sapienscell_typemicroglial cellCL:00001291183509910878
18301830macaca_mulattacell_typemicroglial cellCL:000012912958955858
19761976mus_musculuscell_typemicroglial cellCL:0000129144763100961
25922592pan_troglodytescell_typemicroglial cellCL:000012957485748
\n", "
" ], "text/plain": [ " soma_joinid organism category label \\\n", "7 7 callithrix_jacchus cell_type microglial cell \n", "182 182 homo_sapiens cell_type microglial cell \n", "1830 1830 macaca_mulatta cell_type microglial cell \n", "1976 1976 mus_musculus cell_type microglial cell \n", "2592 2592 pan_troglodytes cell_type microglial cell \n", "\n", " ontology_term_id total_cell_count unique_cell_count \n", "7 CL:0000129 65313 57904 \n", "182 CL:0000129 1183509 910878 \n", "1830 CL:0000129 129589 55858 \n", "1976 CL:0000129 144763 100961 \n", "2592 CL:0000129 5748 5748 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "census_counts.query(\"label == 'microglial cell'\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Understanding Census contents beyond the summary tables\n", "\n", "While using the pre-computed tables in `census[\"census_info\"]` is an easy and quick way to understand the contents of the Census, it falls short if you want to learn more about certain slices of the Census.\n", "\n", "For example, you may want to learn more about:\n", "\n", "- What are the cell types available for human liver?\n", "- What are the total number of cells in all lung datasets stratified by sequencing technology?\n", "- What is the sex distribution of all cells from brain in mouse?\n", "- What are the diseases available for T cells?\n", "\n", "All of these questions can be answered by directly querying the cell metadata as shown in the examples below.\n", "\n", "### Example: all cell types available in human\n", "\n", "To exemplify the process of accessing and slicing cell metadata for summary stats, let's start with a trivial example and take a look at all human cell types available in the Census:\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cell_typeis_primary_data
0endothelial cellFalse
1malignant cellFalse
2fibroblastFalse
3fibroblastFalse
4macrophageFalse
.........
158982714pvalb GABAergic cortical interneuronTrue
158982715VIP GABAergic cortical interneuronTrue
158982716L2/3-6 intratelencephalic projecting glutamate...True
158982717astrocyte of the cerebral cortexTrue
158982718sst GABAergic cortical interneuronTrue
\n", "

158982719 rows × 2 columns

\n", "
" ], "text/plain": [ " cell_type is_primary_data\n", "0 endothelial cell False\n", "1 malignant cell False\n", "2 fibroblast False\n", "3 fibroblast False\n", "4 macrophage False\n", "... ... ...\n", "158982714 pvalb GABAergic cortical interneuron True\n", "158982715 VIP GABAergic cortical interneuron True\n", "158982716 L2/3-6 intratelencephalic projecting glutamate... True\n", "158982717 astrocyte of the cerebral cortex True\n", "158982718 sst GABAergic cortical interneuron True\n", "\n", "[158982719 rows x 2 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "human_cell_types = (\n", " census[\"census_data\"][\"homo_sapiens\"].obs.read(column_names=[\"cell_type\", \"is_primary_data\"]).concat().to_pandas()\n", ")\n", "human_cell_types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The number of rows is the total number of cells for humans. Now, if you wish to get the cell counts per cell type we can perform some `pandas` operations on this object.\n", "\n", "In addition, we will only focus on cells that are marked with `is_primary_data=True` as this ensures we de-duplicate cells that appear more than once in CELLxGENE Discover.\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(96591226, 1)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "human_cell_types = (\n", " census[\"census_data\"][\"homo_sapiens\"]\n", " .obs.read(column_names=[\"cell_type\"], value_filter=\"is_primary_data == True\")\n", " .concat()\n", " .to_pandas()\n", ")\n", "\n", "human_cell_types = human_cell_types[[\"cell_type\"]]\n", "human_cell_types.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the number of unique cells. Now let's look at the counts per cell type:\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "cell_type \n", "oligodendrocyte 5705502\n", "neuron 3858369\n", "naive thymus-derived CD4-positive, alpha-beta T cell 3847813\n", "fibroblast 2663513\n", "glutamatergic neuron 2539819\n", " ... \n", "effector T cell 0\n", "A2 amacrine cell 0\n", "OFF retinal ganglion cell 0\n", "type II NK T cell 0\n", "CD38-negative naive B cell 0\n", "Name: count, Length: 898, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "human_cell_type_counts = human_cell_types.value_counts()\n", "human_cell_type_counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows you that the most abundant cell types are \"glutamatergic neuron\", \"CD8-positive, alpha-beta T cell\", and \"CD4-positive, alpha-beta T cell\".\n", "\n", "Now let's take a look at the number of unique cell types:\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(898,)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "human_cell_type_counts.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That is the total number of different cell types for human.\n", "\n", "All the information in this example can be quickly obtained from the summary table at `census[\"census-info\"][\"summary_cell_counts\"]`.\n", "\n", "The examples below are more complex and can only be achieved by accessing the cell metadata.\n", "\n", "### Example: cell types available in human liver\n", "\n", "Similar to the example above, we can learn what cell types are available for a specific tissue, e.g. liver.\n", "\n", "To achieve this goal we just need to limit our cell metadata to that tissue. We will use the information in the cell metadata variable `tissue_general`. This variable contains the high-level tissue label for all cells in the Census:\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "cell_type\n", "malignant cell 196802\n", "T cell 160708\n", "hepatocyte 112485\n", "macrophage 109647\n", "periportal region hepatocyte 90251\n", " ... \n", "epithelial cell of pancreas 0\n", "epithelial cell of prostate 0\n", "epithelial cell of proximal tubule 0\n", "epithelial cell of proximal tubule segment 1 0\n", "ependymal cell 0\n", "Name: count, Length: 898, dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "human_liver_cell_types = (\n", " census[\"census_data\"][\"homo_sapiens\"]\n", " .obs.read(column_names=[\"cell_type\"], value_filter=\"is_primary_data == True and tissue_general == 'liver'\")\n", " .concat()\n", " .to_pandas()\n", ")\n", "\n", "human_liver_cell_types[\"cell_type\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are the cell types and their cell counts in the human liver.\n", "\n", "### Example: diseased T cells in human tissues\n", "\n", "In this example we are going to get the counts for all diseased cells annotated as T cells. For the sake of the example we will focus on \"CD8-positive, alpha-beta T cell\" and \"CD4-positive, alpha-beta T cell\":\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "disease tissue_general \n", "B-cell non-Hodgkin lymphoma lymph node 232979\n", "COVID-19 blood 834850\n", " digestive system 626\n", " lung 71204\n", " nose 13\n", " ... \n", "rheumatoid arthritis blood 242\n", "squamous cell lung carcinoma lung 49279\n", " lymph node 100\n", "systemic lupus erythematosus blood 355471\n", "triple-negative breast carcinoma exocrine gland 2003\n", "Name: count, Length: 63, dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t_cells_list = [\"CD8-positive, alpha-beta T cell\", \"CD4-positive, alpha-beta T cell\"]\n", "\n", "t_cells_diseased = (\n", " census[\"census_data\"][\"homo_sapiens\"]\n", " .obs.read(\n", " column_names=[\"disease\", \"tissue_general\"],\n", " value_filter=f\"is_primary_data == True and cell_type in {t_cells_list} and disease != 'normal'\",\n", " )\n", " .concat()\n", " .to_pandas()\n", ")\n", "\n", "t_cells_diseased = t_cells_diseased[[\"disease\", \"tissue_general\"]].value_counts(sort=False)\n", "t_cells_diseased" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are the cell counts annotated with the indicated disease across human tissues for \"CD8-positive, alpha-beta T cell\" or \"CD4-positive, alpha-beta T cell\".\n", "\n", "NOTE: In Census 2025-11-08 and later (CELLxGENE schema 7.0.0 and above), a subset of datasets encode multiple values in the `disease` field delimited by `' || '`. If our query touched such datasets, then we'd want to handle the `disease` field appropriately.\n", "\n", "And, don't forget to close the census!\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "census.close()\n", "del census" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.9" }, "vscode": { "interpreter": { "hash": "3da8ec1c162cd849e59e6ea2824b2e353dce799884e910aae99411be5277f953" } } }, "nbformat": 4, "nbformat_minor": 4 }