{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Learning about the CZ CELLxGENE Census\n", "\n", "This notebook showcases the Census contents and how to obtain high-level information about it. It covers the organization of data within the Census, what cell and gene metadata are available, and it provides simple demonstrations to summarize cell counts across cell metadata. \n", "\n", "**Contents**\n", "\n", "- Opening the census\n", "- Census organization\n", "- Cell metadata\n", "- Gene metadata\n", "- Census summary content tables\n", "- Understanding Census contents beyond the summary tables\n", "\n", "⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable `is_primary_data` which is described in the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#repeated-data).\n", "\n", "## Opening the Census\n", "\n", "The `cellxgene_census` python package contains a convenient API to open the latest version of the Census. If you open the census, you should close it. `open_soma()` returns a context, so you can open/close it in several ways, like a Python file handle. The context manager is preferred, as it will automatically close upon an error raise.\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:06.960041Z", "iopub.status.busy": "2023-07-28T14:20:06.959467Z", "iopub.status.idle": "2023-07-28T14:20:10.170466Z", "shell.execute_reply": "2023-07-28T14:20:10.169835Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n", "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n" ] } ], "source": [ "import cellxgene_census\n", "\n", "# Preferred: use a Python context manager\n", "with cellxgene_census.open_soma() as census:\n", " ...\n", "\n", "# or\n", "census = cellxgene_census.open_soma()\n", "...\n", "census.close()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can learn more about the `cellxgene_census` methods by accessing their corresponding documentation via `help()`. For example `help(cellxgene_census.open_soma)`.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:10.173670Z", "iopub.status.busy": "2023-07-28T14:20:10.173047Z", "iopub.status.idle": "2023-07-28T14:20:10.494368Z", "shell.execute_reply": "2023-07-28T14:20:10.493750Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n" ] } ], "source": [ "census = cellxgene_census.open_soma()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Census organization\n", "\n", "The [Census schema](https://chanzuckerberg.github.io/cellxgene-census/cellxgene_census_docsite_schema.html) defines the structure of the Census. In short, you can think of the Census as a structured collection of items that stores different pieces of information. All of these items and the parent collection are SOMA objects of various types and can all be accessed with the [TileDB-SOMA API](https://github.com/single-cell-data/TileDB-SOMA) ([documentation](https://tiledbsoma.readthedocs.io/en/latest/)).\n", "\n", "\n", "The `cellxgene_census` package contains some convenient wrappers of the `TileDB-SOMA` API. An example of this is the function we used to open the Census: `cellxgene_census.open_soma()`\n", "\n", "### Main Census components\n", "\n", "With the command above you created `census`, which is a `SOMACollection`. It is analogous to a Python dictionary, and it has two items: `census_info` and `census_data`.\n", "\n", "#### Census summary info\n", "\n", "- `census[\"census_info\"]` A collection of tables providing information of the census as a whole.\n", " - `census[\"census_info\"][\"summary\"]`: A data frame with high-level information of this Census, e.g. build date, total cell count, etc.\n", " - `census[\"census_info\"][\"datasets\"]`: A data frame with all datasets from [CELLxGENE Discover](https://cellxgene.cziscience.com/) used to create the Census.\n", " - `census[\"census_info\"][\"summary_cell_counts\"]`: A data frame with cell counts stratified by **relevant** cell metadata\n", "\n", "#### Census data\n", "\n", "Data for each organism is stored in independent `SOMAExperiment` objects which are a specialized form of a `SOMACollection`. Each of these store a data matrix (cell by genes), cell metadata, gene metadata, and some other useful components not covered in this notebook.\n", "\n", "This is how the data is organized for one organism -- _Homo sapiens_:\n", "\n", "- `census_obj[\"census_data\"][\"homo_sapiens\"].obs`: Cell metadata\n", "- `census_obj[\"census_data\"][\"homo_sapiens\"].ms[\"RNA\"].X:` Data matrices, currently only raw counts exist `X[\"raw\"]`\n", "- `census_obj[\"census_data\"][\"homo_sapiens\"].ms[\"RNA\"].var:` Gene Metadata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Cell metadata\n", "\n", "You can obtain all cell metadata variables by directly querying the columns of the corresponding `SOMADataFrame`.\n", "\n", "All of these variables can be used for querying the Census in case you want to work with specific cells.\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:10.497463Z", "iopub.status.busy": "2023-07-28T14:20:10.496989Z", "iopub.status.idle": "2023-07-28T14:20:10.941903Z", "shell.execute_reply": "2023-07-28T14:20:10.941358Z" }, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "['soma_joinid',\n", " 'dataset_id',\n", " 'assay',\n", " 'assay_ontology_term_id',\n", " 'cell_type',\n", " 'cell_type_ontology_term_id',\n", " 'development_stage',\n", " 'development_stage_ontology_term_id',\n", " 'disease',\n", " 'disease_ontology_term_id',\n", " 'donor_id',\n", " 'is_primary_data',\n", " 'self_reported_ethnicity',\n", " 'self_reported_ethnicity_ontology_term_id',\n", " 'sex',\n", " 'sex_ontology_term_id',\n", " 'suspension_type',\n", " 'tissue',\n", " 'tissue_ontology_term_id',\n", " 'tissue_general',\n", " 'tissue_general_ontology_term_id']" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "keys = list(census[\"census_data\"][\"homo_sapiens\"].obs.keys())\n", "\n", "keys" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of these variables are defined in the [CELLxGENE dataset schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#obs-cell-metadata) except for the following:\n", "\n", "- `soma_joinid`: a SOMA-defined value use for join operations.\n", "- `dataset_id`: the dataset id as encoded in `census[\"census-info\"][\"datasets\"]`.\n", "- `tissue_general` and `tissue_general_ontology_term_id`: the high-level tissue mapping.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gene metadata\n", "\n", "Similarly, we can obtain all gene metadata variables by directly querying the columns of the corresponding `SOMADataFrame`.\n", "\n", "These are the variables you can use for querying the Census in case there are specific genes you are interested in.\n" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:10.944483Z", "iopub.status.busy": "2023-07-28T14:20:10.944219Z", "iopub.status.idle": "2023-07-28T14:20:11.225599Z", "shell.execute_reply": "2023-07-28T14:20:11.225072Z" } }, "outputs": [ { "data": { "text/plain": [ "['soma_joinid', 'feature_id', 'feature_name', 'feature_length']" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "keys = list(census[\"census_data\"][\"homo_sapiens\"].ms[\"RNA\"].var.keys())\n", "\n", "keys" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All of these variables are defined in the [CELLxGENE dataset schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#var-and-rawvar-gene-metadata) except for the following:\n", "\n", "- `soma_joinid`: a SOMA-defined value use for join operations.\n", "- `feature_length`: the length in base pairs of the gene.\n" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:11.228296Z", "iopub.status.busy": "2023-07-28T14:20:11.227881Z", "iopub.status.idle": "2023-07-28T14:20:11.719446Z", "shell.execute_reply": "2023-07-28T14:20:11.718694Z" }, "scrolled": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidlabelvalue
00census_schema_version1.0.0
11census_build_date2023-07-25
22dataset_schema_version3.0.0
33total_cell_count61656118
44unique_cell_count37447773
55number_donors_homo_sapiens13035
66number_donors_mus_musculus1417
\n", "
" ], "text/plain": [ " soma_joinid label value\n", "0 0 census_schema_version 1.0.0\n", "1 1 census_build_date 2023-07-25\n", "2 2 dataset_schema_version 3.0.0\n", "3 3 total_cell_count 61656118\n", "4 4 unique_cell_count 37447773\n", "5 5 number_donors_homo_sapiens 13035\n", "6 6 number_donors_mus_musculus 1417" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "census_info = census[\"census_info\"][\"summary\"].read().concat().to_pandas()\n", "\n", "census_info" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Census summary content tables\n", "\n", "You can take a quick look at the high-level Census information by looking at `census[\"census_info\"][\"summary\"]`\n", "\n", "Of special interest are the `label`-`value` combinations for :\n", "\n", "- `total_cell_count` is the total number of cells in the Census.\n", "- `unique_cell_count` is the number of unique cells, as some cells may be present twice due to meta-analysis or consortia-like data.\n", "- `number_donors_homo_sapiens` and `number_donors_mus_musculus` are the number of individuals for human and mouse. These are not guaranteed to be unique as one individual ID may be present or identical in different datasets.\n", "\n", "### Cell counts by cell metadata\n", "\n", "By looking at `census[\"summary_cell_counts\"]` you can get a general idea of cell counts stratified by **some relevant** cell metadata. Not all cell metadata is included in this table, you can take a look at all cell and gene metadata available in the sections below \"Cell metadata\" and \"Gene metadata\".\n", "\n", "The line below retrieves this table and casts it into a `pandas.DataFrame`.\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:11.722195Z", "iopub.status.busy": "2023-07-28T14:20:11.721723Z", "iopub.status.idle": "2023-07-28T14:20:12.262344Z", "shell.execute_reply": "2023-07-28T14:20:12.261575Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidorganismcategoryontology_term_idunique_cell_counttotal_cell_countlabel
00Homo sapiensallna3336424256400873na
11Homo sapiensassayEFO:0008722264166279635Drop-seq
22Homo sapiensassayEFO:00087802565251304inDrop
33Homo sapiensassayEFO:000891989477206754Seq-Well
44Homo sapiensassayEFO:000893178750188248Smart-seq2
........................
13571357Mus musculustissue_generalUBERON:0002113179684208324kidney
13581358Mus musculustissue_generalUBERON:00023651557731154exocrine gland
13591359Mus musculustissue_generalUBERON:000236737715130135prostate gland
13601360Mus musculustissue_generalUBERON:00023681332226644endocrine gland
13611361Mus musculustissue_generalUBERON:000237190225144962bone marrow
\n", "

1362 rows × 7 columns

\n", "
" ], "text/plain": [ " soma_joinid organism category ontology_term_id \\\n", "0 0 Homo sapiens all na \n", "1 1 Homo sapiens assay EFO:0008722 \n", "2 2 Homo sapiens assay EFO:0008780 \n", "3 3 Homo sapiens assay EFO:0008919 \n", "4 4 Homo sapiens assay EFO:0008931 \n", "... ... ... ... ... \n", "1357 1357 Mus musculus tissue_general UBERON:0002113 \n", "1358 1358 Mus musculus tissue_general UBERON:0002365 \n", "1359 1359 Mus musculus tissue_general UBERON:0002367 \n", "1360 1360 Mus musculus tissue_general UBERON:0002368 \n", "1361 1361 Mus musculus tissue_general UBERON:0002371 \n", "\n", " unique_cell_count total_cell_count label \n", "0 33364242 56400873 na \n", "1 264166 279635 Drop-seq \n", "2 25652 51304 inDrop \n", "3 89477 206754 Seq-Well \n", "4 78750 188248 Smart-seq2 \n", "... ... ... ... \n", "1357 179684 208324 kidney \n", "1358 15577 31154 exocrine gland \n", "1359 37715 130135 prostate gland \n", "1360 13322 26644 endocrine gland \n", "1361 90225 144962 bone marrow \n", "\n", "[1362 rows x 7 columns]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "census_counts = census[\"census_info\"][\"summary_cell_counts\"].read().concat().to_pandas()\n", "\n", "census_counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For each combination of `organism` and values for each `category` of cell metadata you can take a look at `total_cell_count` and `unique_cell_count` for the cell counts of that combination.\n", "\n", "The values for each `category` are specified in `ontology_term_id` and `label`, which are the value's IDs and labels, respectively.\n", "\n", "#### Example: cell metadata included in the summary counts table\n", "\n", "To get all the available cell metadata in the summary counts table you can do the following. Remember this is not all the cell metadata available, as some variables were omitted in the creation of this table.\n" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:12.264918Z", "iopub.status.busy": "2023-07-28T14:20:12.264659Z", "iopub.status.idle": "2023-07-28T14:20:12.271618Z", "shell.execute_reply": "2023-07-28T14:20:12.271084Z" } }, "outputs": [ { "data": { "text/plain": [ "organism category \n", "Homo sapiens all 1\n", " assay 19\n", " cell_type 613\n", " disease 64\n", " self_reported_ethnicity 26\n", " sex 3\n", " suspension_type 1\n", " tissue 220\n", " tissue_general 54\n", "Mus musculus all 1\n", " assay 9\n", " cell_type 248\n", " disease 5\n", " self_reported_ethnicity 1\n", " sex 3\n", " suspension_type 1\n", " tissue 66\n", " tissue_general 27\n", "Name: count, dtype: int64" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "census_counts[[\"organism\", \"category\"]].value_counts(sort=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example: cell counts for each sequencing assay in human data\n", "\n", "To get the cell counts for each sequencing assay type in human data, you can perform the following `pandas.DataFrame` operations:\n" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:12.273932Z", "iopub.status.busy": "2023-07-28T14:20:12.273685Z", "iopub.status.idle": "2023-07-28T14:20:12.284771Z", "shell.execute_reply": "2023-07-28T14:20:12.284296Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidorganismcategoryontology_term_idunique_cell_counttotal_cell_countlabel
1010Homo sapiensassayEFO:0009922118450772559756310x 3' v3
77Homo sapiensassayEFO:000989975591021263879410x 3' v2
1414Homo sapiensassayEFO:00110253872375613978610x 5' v1
1313Homo sapiensassayEFO:001055040629805064268sci-RNA-seq
88Homo sapiensassayEFO:00099002930054313977010x 5' v2
1717Homo sapiensassayEFO:0030004915037108423510x 5' transcription profiling
1616Homo sapiensassayEFO:003000374479881142210x 3' transcription profiling
1515Homo sapiensassayEFO:0030002625175642559microwell-seq
11Homo sapiensassayEFO:0008722264166279635Drop-seq
33Homo sapiensassayEFO:000891989477206754Seq-Well
44Homo sapiensassayEFO:000893178750188248Smart-seq2
1818Homo sapiensassayEFO:0700003146278177276BD Rhapsody Whole Transcriptome Analysis
99Homo sapiensassayEFO:00099014239712139410x 3' v1
1212Homo sapiensassayEFO:001018358981117962single cell library construction
1919Homo sapiensassayEFO:07000049614596145BD Rhapsody Targeted mRNA
22Homo sapiensassayEFO:00087802565251304inDrop
66Homo sapiensassayEFO:000899502912810x technology
55Homo sapiensassayEFO:000895346939386STRT-seq
1111Homo sapiensassayEFO:001001031055244CEL-seq2
\n", "
" ], "text/plain": [ " soma_joinid organism category ontology_term_id unique_cell_count \\\n", "10 10 Homo sapiens assay EFO:0009922 11845077 \n", "7 7 Homo sapiens assay EFO:0009899 7559102 \n", "14 14 Homo sapiens assay EFO:0011025 3872375 \n", "13 13 Homo sapiens assay EFO:0010550 4062980 \n", "8 8 Homo sapiens assay EFO:0009900 2930054 \n", "17 17 Homo sapiens assay EFO:0030004 915037 \n", "16 16 Homo sapiens assay EFO:0030003 744798 \n", "15 15 Homo sapiens assay EFO:0030002 625175 \n", "1 1 Homo sapiens assay EFO:0008722 264166 \n", "3 3 Homo sapiens assay EFO:0008919 89477 \n", "4 4 Homo sapiens assay EFO:0008931 78750 \n", "18 18 Homo sapiens assay EFO:0700003 146278 \n", "9 9 Homo sapiens assay EFO:0009901 42397 \n", "12 12 Homo sapiens assay EFO:0010183 58981 \n", "19 19 Homo sapiens assay EFO:0700004 96145 \n", "2 2 Homo sapiens assay EFO:0008780 25652 \n", "6 6 Homo sapiens assay EFO:0008995 0 \n", "5 5 Homo sapiens assay EFO:0008953 4693 \n", "11 11 Homo sapiens assay EFO:0010010 3105 \n", "\n", " total_cell_count label \n", "10 25597563 10x 3' v3 \n", "7 12638794 10x 3' v2 \n", "14 6139786 10x 5' v1 \n", "13 5064268 sci-RNA-seq \n", "8 3139770 10x 5' v2 \n", "17 1084235 10x 5' transcription profiling \n", "16 811422 10x 3' transcription profiling \n", "15 642559 microwell-seq \n", "1 279635 Drop-seq \n", "3 206754 Seq-Well \n", "4 188248 Smart-seq2 \n", "18 177276 BD Rhapsody Whole Transcriptome Analysis \n", "9 121394 10x 3' v1 \n", "12 117962 single cell library construction \n", "19 96145 BD Rhapsody Targeted mRNA \n", "2 51304 inDrop \n", "6 29128 10x technology \n", "5 9386 STRT-seq \n", "11 5244 CEL-seq2 " ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "census_human_assays = census_counts.query(\"organism == 'Homo sapiens' & category == 'assay'\")\n", "census_human_assays.sort_values(\"total_cell_count\", ascending=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Example: number of microglial cells in the Census\n", "\n", "If you have a specific term from any of the categories shown above you can directly find out the number of cells for that term.\n" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:12.287182Z", "iopub.status.busy": "2023-07-28T14:20:12.286720Z", "iopub.status.idle": "2023-07-28T14:20:12.294371Z", "shell.execute_reply": "2023-07-28T14:20:12.293903Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidorganismcategoryontology_term_idunique_cell_counttotal_cell_countlabel
6969Homo sapienscell_typeCL:0000129268114370771microglial cell
10381038Mus musculuscell_typeCL:00001294899862617microglial cell
\n", "
" ], "text/plain": [ " soma_joinid organism category ontology_term_id \\\n", "69 69 Homo sapiens cell_type CL:0000129 \n", "1038 1038 Mus musculus cell_type CL:0000129 \n", "\n", " unique_cell_count total_cell_count label \n", "69 268114 370771 microglial cell \n", "1038 48998 62617 microglial cell " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "census_counts.query(\"label == 'microglial cell'\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Understanding Census contents beyond the summary tables\n", "\n", "While using the pre-computed tables in `census[\"census_info\"]` is an easy and quick way to understand the contents of the Census, it falls short if you want to learn more about certain slices of the Census.\n", "\n", "For example, you may want to learn more about:\n", "\n", "- What are the cell types available for human liver?\n", "- What are the total number of cells in all lung datasets stratified by sequencing technology?\n", "- What is the sex distribution of all cells from brain in mouse?\n", "- What are the diseases available for T cells?\n", "\n", "All of these questions can be answered by directly querying the cell metadata as shown in the examples below.\n", "\n", "### Example: all cell types available in human\n", "\n", "To exemplify the process of accessing and slicing cell metadata for summary stats, let's start with a trivial example and take a look at all human cell types available in the Census:\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:12.296812Z", "iopub.status.busy": "2023-07-28T14:20:12.296405Z", "iopub.status.idle": "2023-07-28T14:20:15.844398Z", "shell.execute_reply": "2023-07-28T14:20:15.843860Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
cell_typeis_primary_data
0syncytiotrophoblast cellFalse
1placental villous trophoblastFalse
2syncytiotrophoblast cellFalse
3syncytiotrophoblast cellFalse
4extravillous trophoblastFalse
.........
56400868pericyteTrue
56400869pericyteTrue
56400870pericyteTrue
56400871pericyteTrue
56400872pericyteTrue
\n", "

56400873 rows × 2 columns

\n", "
" ], "text/plain": [ " cell_type is_primary_data\n", "0 syncytiotrophoblast cell False\n", "1 placental villous trophoblast False\n", "2 syncytiotrophoblast cell False\n", "3 syncytiotrophoblast cell False\n", "4 extravillous trophoblast False\n", "... ... ...\n", "56400868 pericyte True\n", "56400869 pericyte True\n", "56400870 pericyte True\n", "56400871 pericyte True\n", "56400872 pericyte True\n", "\n", "[56400873 rows x 2 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "human_cell_types = (\n", " census[\"census_data\"][\"homo_sapiens\"].obs.read(column_names=[\"cell_type\", \"is_primary_data\"]).concat().to_pandas()\n", ")\n", "human_cell_types" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The number of rows is the total number of cells for humans. Now, if you wish to get the cell counts per cell type we can perform some `pandas` operations on this object.\n", "\n", "In addition, we will only focus on cells that are marked with `is_primary_data=True` as this ensures we de-duplicate cells that appear more than once in CELLxGENE Discover.\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:15.846897Z", "iopub.status.busy": "2023-07-28T14:20:15.846613Z", "iopub.status.idle": "2023-07-28T14:20:18.453082Z", "shell.execute_reply": "2023-07-28T14:20:18.452533Z" } }, "outputs": [ { "data": { "text/plain": [ "(33364242, 1)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "human_cell_types = (\n", " census[\"census_data\"][\"homo_sapiens\"]\n", " .obs.read(column_names=[\"cell_type\"], value_filter=\"is_primary_data == True\")\n", " .concat()\n", " .to_pandas()\n", ")\n", "\n", "human_cell_types = human_cell_types[[\"cell_type\"]]\n", "human_cell_types.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the number of unique cells. Now let's look at the counts per cell type:\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:18.455764Z", "iopub.status.busy": "2023-07-28T14:20:18.455499Z", "iopub.status.idle": "2023-07-28T14:20:20.602756Z", "shell.execute_reply": "2023-07-28T14:20:20.602220Z" } }, "outputs": [ { "data": { "text/plain": [ "cell_type \n", "neuron 2673669\n", "glutamatergic neuron 1541605\n", "CD4-positive, alpha-beta T cell 1258976\n", "CD8-positive, alpha-beta T cell 1235987\n", "classical monocyte 1030996\n", " ... \n", "microfold cell of epithelium of small intestine 19\n", "mature conventional dendritic cell 17\n", "serous cell of epithelium of bronchus 15\n", "sperm 11\n", "type N enteroendocrine cell 10\n", "Name: count, Length: 599, dtype: int64" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "human_cell_type_counts = human_cell_types.value_counts()\n", "human_cell_type_counts" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This shows you that the most abundant cell types are \"glutamatergic neuron\", \"CD8-positive, alpha-beta T cell\", and \"CD4-positive, alpha-beta T cell\".\n", "\n", "Now let's take a look at the number of unique cell types:\n" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:20.605245Z", "iopub.status.busy": "2023-07-28T14:20:20.604980Z", "iopub.status.idle": "2023-07-28T14:20:20.608595Z", "shell.execute_reply": "2023-07-28T14:20:20.608117Z" } }, "outputs": [ { "data": { "text/plain": [ "(599,)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "human_cell_type_counts.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "That is the total number of different cell types for human.\n", "\n", "All the information in this example can be quickly obtained from the summary table at `census[\"census-info\"][\"summary_cell_counts\"]`.\n", "\n", "The examples below are more complex and can only be achieved by accessing the cell metadata.\n", "\n", "### Example: cell types available in human liver\n", "\n", "Similar to the example above, we can learn what cell types are available for a specific tissue, e.g. liver.\n", "\n", "To achieve this goal we just need to limit our cell metadata to that tissue. We will use the information in the cell metadata variable `tissue_general`. This variable contains the high-level tissue label for all cells in the Census:\n" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:20.610968Z", "iopub.status.busy": "2023-07-28T14:20:20.610624Z", "iopub.status.idle": "2023-07-28T14:20:21.566043Z", "shell.execute_reply": "2023-07-28T14:20:21.565314Z" } }, "outputs": [ { "data": { "text/plain": [ "cell_type\n", "T cell 85739\n", "hepatoblast 58447\n", "neoplastic cell 52431\n", "erythroblast 45605\n", "monocyte 31388\n", " ... \n", "pulmonary artery endothelial cell 1\n", "germinal center B cell 1\n", "enteroendocrine cell 1\n", "type I pneumocyte 1\n", "group 2 innate lymphoid cell 1\n", "Name: count, Length: 126, dtype: int64" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "human_liver_cell_types = (\n", " census[\"census_data\"][\"homo_sapiens\"]\n", " .obs.read(column_names=[\"cell_type\"], value_filter=\"is_primary_data == True and tissue_general == 'liver'\")\n", " .concat()\n", " .to_pandas()\n", ")\n", "\n", "human_liver_cell_types[\"cell_type\"].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are the cell types and their cell counts in the human liver.\n", "\n", "### Example: diseased T cells in human tissues\n", "\n", "In this example we are going to get the counts for all diseased cells annotated as T cells. For the sake of the example we will focus on \"CD8-positive, alpha-beta T cell\" and \"CD4-positive, alpha-beta T cell\":\n" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:21.568806Z", "iopub.status.busy": "2023-07-28T14:20:21.568542Z", "iopub.status.idle": "2023-07-28T14:20:23.436424Z", "shell.execute_reply": "2023-07-28T14:20:23.435878Z" } }, "outputs": [ { "data": { "text/plain": [ "disease tissue_general \n", "B-cell non-Hodgkin lymphoma blood 62499\n", "COVID-19 blood 819428\n", " lung 30578\n", " nose 13\n", " respiratory system 4\n", " saliva 41\n", "Crohn disease colon 17490\n", " small intestine 52029\n", "Down syndrome bone marrow 181\n", "breast cancer breast 1850\n", "chronic obstructive pulmonary disease lung 9382\n", "chronic rhinitis nose 909\n", "clear cell renal carcinoma blood 6548\n", " kidney 20540\n", " lymph node 36\n", "cystic fibrosis lung 7\n", "follicular lymphoma lymph node 1089\n", "influenza blood 8871\n", "interstitial lung disease lung 1803\n", "kidney benign neoplasm blood 20\n", " kidney 10\n", "kidney oncocytoma blood 16\n", " kidney 2408\n", "lung adenocarcinoma adrenal gland 205\n", " brain 3274\n", " liver 507\n", " lung 215013\n", " lymph node 24969\n", " pleural fluid 11558\n", "lung large cell carcinoma lung 5922\n", "lymphangioleiomyomatosis lung 513\n", "non-small cell lung carcinoma lung 36573\n", "nonpapillary renal cell carcinoma adipose tissue 243\n", " adrenal gland 4828\n", " blood 288\n", " blood clot 1717\n", " kidney 69136\n", "pleomorphic carcinoma lung 1715\n", "pneumonia lung 856\n", "pulmonary fibrosis lung 1671\n", "respiratory system disorder blood 34301\n", "squamous cell lung carcinoma lung 52053\n", " lymph node 100\n", "systemic lupus erythematosus blood 355471\n", "Name: count, dtype: int64" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "t_cells_list = [\"CD8-positive, alpha-beta T cell\", \"CD4-positive, alpha-beta T cell\"]\n", "\n", "t_cells_diseased = (\n", " census[\"census_data\"][\"homo_sapiens\"]\n", " .obs.read(\n", " column_names=[\"disease\", \"tissue_general\"],\n", " value_filter=f\"is_primary_data == True and cell_type in {t_cells_list} and disease != 'normal'\",\n", " )\n", " .concat()\n", " .to_pandas()\n", ")\n", "\n", "t_cells_diseased = t_cells_diseased[[\"disease\", \"tissue_general\"]].value_counts(sort=False)\n", "t_cells_diseased" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are the cell counts annotated with the indicated disease across human tissues for \"CD8-positive, alpha-beta T cell\" or \"CD4-positive, alpha-beta T cell\".\n", "\n", "And, don't forget to close the census!\n" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:23.438957Z", "iopub.status.busy": "2023-07-28T14:20:23.438667Z", "iopub.status.idle": "2023-07-28T14:20:23.441777Z", "shell.execute_reply": "2023-07-28T14:20:23.441276Z" } }, "outputs": [], "source": [ "census.close()\n", "del census" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" }, "vscode": { "interpreter": { "hash": "3da8ec1c162cd849e59e6ea2824b2e353dce799884e910aae99411be5277f953" } } }, "nbformat": 4, "nbformat_minor": 2 }