{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Learning about the CZ CELLxGENE Census\n",
    "\n",
    "This notebook showcases the Census contents and how to obtain high-level information about it. It covers the organization of data within the Census, what cell and gene metadata are available, and it provides simple demonstrations to summarize cell counts across cell metadata. \n",
    "\n",
    "**Contents**\n",
    "\n",
    "- Opening the census\n",
    "- Census organization\n",
    "- Cell metadata\n",
    "- Gene metadata\n",
    "- Census summary content tables\n",
    "- Understanding Census contents beyond the summary tables\n",
    "\n",
    "⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable `is_primary_data` which is described in the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#repeated-data).\n",
    "\n",
    "## Opening the Census\n",
    "\n",
    "The `cellxgene_census` python package contains a convenient API to open the latest version of the Census. If you open the census, you should close it. `open_soma()` returns a context, so you can open/close it in several ways, like a Python file handle. The context manager is preferred, as it will automatically close upon an error raise.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:06.960041Z",
     "iopub.status.busy": "2023-07-28T14:20:06.959467Z",
     "iopub.status.idle": "2023-07-28T14:20:10.170466Z",
     "shell.execute_reply": "2023-07-28T14:20:10.169835Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n",
      "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n"
     ]
    }
   ],
   "source": [
    "import cellxgene_census\n",
    "\n",
    "# Preferred: use a Python context manager\n",
    "with cellxgene_census.open_soma() as census:\n",
    "    ...\n",
    "\n",
    "# or\n",
    "census = cellxgene_census.open_soma()\n",
    "...\n",
    "census.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can learn more about the `cellxgene_census` methods by accessing their corresponding documentation via `help()`. For example `help(cellxgene_census.open_soma)`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:10.173670Z",
     "iopub.status.busy": "2023-07-28T14:20:10.173047Z",
     "iopub.status.idle": "2023-07-28T14:20:10.494368Z",
     "shell.execute_reply": "2023-07-28T14:20:10.493750Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n"
     ]
    }
   ],
   "source": [
    "census = cellxgene_census.open_soma()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Census organization\n",
    "\n",
    "The [Census schema](https://chanzuckerberg.github.io/cellxgene-census/cellxgene_census_docsite_schema.html) defines the structure of the Census. In short, you can think of the Census as a structured collection of items that stores different pieces of information. All of these items and the parent collection are SOMA objects of various types and can all be accessed with the [TileDB-SOMA API](https://github.com/single-cell-data/TileDB-SOMA) ([documentation](https://tiledbsoma.readthedocs.io/en/latest/)).\n",
    "\n",
    "\n",
    "The `cellxgene_census` package contains some convenient wrappers of the `TileDB-SOMA` API. An example of this is the function we used to open the Census: `cellxgene_census.open_soma()`\n",
    "\n",
    "### Main Census components\n",
    "\n",
    "With the command above you created `census`, which is a `SOMACollection`. It is analogous to a Python dictionary, and it has two items: `census_info` and `census_data`.\n",
    "\n",
    "#### Census summary info\n",
    "\n",
    "- `census[\"census_info\"]` A collection of tables providing information of the census as a whole.\n",
    "  - `census[\"census_info\"][\"summary\"]`: A data frame with high-level information of this Census, e.g. build date, total cell count, etc.\n",
    "  - `census[\"census_info\"][\"datasets\"]`: A data frame with all datasets from [CELLxGENE Discover](https://cellxgene.cziscience.com/) used to create the Census.\n",
    "  - `census[\"census_info\"][\"summary_cell_counts\"]`: A data frame with cell counts stratified by **relevant** cell metadata\n",
    "\n",
    "#### Census data\n",
    "\n",
    "Data for each organism is stored in independent `SOMAExperiment` objects which are a specialized form of a `SOMACollection`. Each of these store a data matrix (cell by genes), cell metadata, gene metadata, and some other useful components not covered in this notebook.\n",
    "\n",
    "This is how the data is organized for one organism -- _Homo sapiens_:\n",
    "\n",
    "- `census_obj[\"census_data\"][\"homo_sapiens\"].obs`: Cell metadata\n",
    "- `census_obj[\"census_data\"][\"homo_sapiens\"].ms[\"RNA\"].X:` Data matrices, currently only raw counts exist `X[\"raw\"]`\n",
    "- `census_obj[\"census_data\"][\"homo_sapiens\"].ms[\"RNA\"].var:` Gene Metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cell metadata\n",
    "\n",
    "You can obtain all cell metadata variables by directly querying the columns of the corresponding `SOMADataFrame`.\n",
    "\n",
    "All of these variables can be used for querying the Census in case you want to work with specific cells.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:10.497463Z",
     "iopub.status.busy": "2023-07-28T14:20:10.496989Z",
     "iopub.status.idle": "2023-07-28T14:20:10.941903Z",
     "shell.execute_reply": "2023-07-28T14:20:10.941358Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['soma_joinid',\n",
       " 'dataset_id',\n",
       " 'assay',\n",
       " 'assay_ontology_term_id',\n",
       " 'cell_type',\n",
       " 'cell_type_ontology_term_id',\n",
       " 'development_stage',\n",
       " 'development_stage_ontology_term_id',\n",
       " 'disease',\n",
       " 'disease_ontology_term_id',\n",
       " 'donor_id',\n",
       " 'is_primary_data',\n",
       " 'self_reported_ethnicity',\n",
       " 'self_reported_ethnicity_ontology_term_id',\n",
       " 'sex',\n",
       " 'sex_ontology_term_id',\n",
       " 'suspension_type',\n",
       " 'tissue',\n",
       " 'tissue_ontology_term_id',\n",
       " 'tissue_general',\n",
       " 'tissue_general_ontology_term_id']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "keys = list(census[\"census_data\"][\"homo_sapiens\"].obs.keys())\n",
    "\n",
    "keys"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All of these variables are defined in the [CELLxGENE dataset schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#obs-cell-metadata) except for the following:\n",
    "\n",
    "- `soma_joinid`: a SOMA-defined value use for join operations.\n",
    "- `dataset_id`: the dataset id as encoded in `census[\"census-info\"][\"datasets\"]`.\n",
    "- `tissue_general` and `tissue_general_ontology_term_id`: the high-level tissue mapping.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gene metadata\n",
    "\n",
    "Similarly, we can obtain all gene metadata variables by directly querying the columns of the corresponding `SOMADataFrame`.\n",
    "\n",
    "These are the variables you can use for querying the Census in case there are specific genes you are interested in.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:10.944483Z",
     "iopub.status.busy": "2023-07-28T14:20:10.944219Z",
     "iopub.status.idle": "2023-07-28T14:20:11.225599Z",
     "shell.execute_reply": "2023-07-28T14:20:11.225072Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['soma_joinid', 'feature_id', 'feature_name', 'feature_length']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "keys = list(census[\"census_data\"][\"homo_sapiens\"].ms[\"RNA\"].var.keys())\n",
    "\n",
    "keys"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All of these variables are defined in the [CELLxGENE dataset schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#var-and-rawvar-gene-metadata) except for the following:\n",
    "\n",
    "- `soma_joinid`: a SOMA-defined value use for join operations.\n",
    "- `feature_length`: the length in base pairs of the gene.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:11.228296Z",
     "iopub.status.busy": "2023-07-28T14:20:11.227881Z",
     "iopub.status.idle": "2023-07-28T14:20:11.719446Z",
     "shell.execute_reply": "2023-07-28T14:20:11.718694Z"
    },
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "
\n",
       "\n",
       "
\n",
       "  \n",
       "    \n",
       "      | \n",
       " | soma_joinid\n",
       " | label\n",
       " | value\n",
       " | 
\n",
       "  \n",
       "  \n",
       "    \n",
       "      | 0\n",
       " | 0\n",
       " | census_schema_version\n",
       " | 1.0.0\n",
       " | 
\n",
       "    \n",
       "      | 1\n",
       " | 1\n",
       " | census_build_date\n",
       " | 2023-07-25\n",
       " | 
\n",
       "    \n",
       "      | 2\n",
       " | 2\n",
       " | dataset_schema_version\n",
       " | 3.0.0\n",
       " | 
\n",
       "    \n",
       "      | 3\n",
       " | 3\n",
       " | total_cell_count\n",
       " | 61656118\n",
       " | 
\n",
       "    \n",
       "      | 4\n",
       " | 4\n",
       " | unique_cell_count\n",
       " | 37447773\n",
       " | 
\n",
       "    \n",
       "      | 5\n",
       " | 5\n",
       " | number_donors_homo_sapiens\n",
       " | 13035\n",
       " | 
\n",
       "    \n",
       "      | 6\n",
       " | 6\n",
       " | number_donors_mus_musculus\n",
       " | 1417\n",
       " | 
\n",
       "  \n",
       "
\n",
       "
\n",
       "\n",
       "
\n",
       "  \n",
       "    \n",
       "      | \n",
       " | soma_joinid\n",
       " | organism\n",
       " | category\n",
       " | ontology_term_id\n",
       " | unique_cell_count\n",
       " | total_cell_count\n",
       " | label\n",
       " | 
\n",
       "  \n",
       "  \n",
       "    \n",
       "      | 0\n",
       " | 0\n",
       " | Homo sapiens\n",
       " | all\n",
       " | na\n",
       " | 33364242\n",
       " | 56400873\n",
       " | na\n",
       " | 
\n",
       "    \n",
       "      | 1\n",
       " | 1\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0008722\n",
       " | 264166\n",
       " | 279635\n",
       " | Drop-seq\n",
       " | 
\n",
       "    \n",
       "      | 2\n",
       " | 2\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0008780\n",
       " | 25652\n",
       " | 51304\n",
       " | inDrop\n",
       " | 
\n",
       "    \n",
       "      | 3\n",
       " | 3\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0008919\n",
       " | 89477\n",
       " | 206754\n",
       " | Seq-Well\n",
       " | 
\n",
       "    \n",
       "      | 4\n",
       " | 4\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0008931\n",
       " | 78750\n",
       " | 188248\n",
       " | Smart-seq2\n",
       " | 
\n",
       "    \n",
       "      | ...\n",
       " | ...\n",
       " | ...\n",
       " | ...\n",
       " | ...\n",
       " | ...\n",
       " | ...\n",
       " | ...\n",
       " | 
\n",
       "    \n",
       "      | 1357\n",
       " | 1357\n",
       " | Mus musculus\n",
       " | tissue_general\n",
       " | UBERON:0002113\n",
       " | 179684\n",
       " | 208324\n",
       " | kidney\n",
       " | 
\n",
       "    \n",
       "      | 1358\n",
       " | 1358\n",
       " | Mus musculus\n",
       " | tissue_general\n",
       " | UBERON:0002365\n",
       " | 15577\n",
       " | 31154\n",
       " | exocrine gland\n",
       " | 
\n",
       "    \n",
       "      | 1359\n",
       " | 1359\n",
       " | Mus musculus\n",
       " | tissue_general\n",
       " | UBERON:0002367\n",
       " | 37715\n",
       " | 130135\n",
       " | prostate gland\n",
       " | 
\n",
       "    \n",
       "      | 1360\n",
       " | 1360\n",
       " | Mus musculus\n",
       " | tissue_general\n",
       " | UBERON:0002368\n",
       " | 13322\n",
       " | 26644\n",
       " | endocrine gland\n",
       " | 
\n",
       "    \n",
       "      | 1361\n",
       " | 1361\n",
       " | Mus musculus\n",
       " | tissue_general\n",
       " | UBERON:0002371\n",
       " | 90225\n",
       " | 144962\n",
       " | bone marrow\n",
       " | 
\n",
       "  \n",
       "
\n",
       "
1362 rows × 7 columns
\n",
       "
\n",
       "\n",
       "
\n",
       "  \n",
       "    \n",
       "      | \n",
       " | soma_joinid\n",
       " | organism\n",
       " | category\n",
       " | ontology_term_id\n",
       " | unique_cell_count\n",
       " | total_cell_count\n",
       " | label\n",
       " | 
\n",
       "  \n",
       "  \n",
       "    \n",
       "      | 10\n",
       " | 10\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0009922\n",
       " | 11845077\n",
       " | 25597563\n",
       " | 10x 3' v3\n",
       " | 
\n",
       "    \n",
       "      | 7\n",
       " | 7\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0009899\n",
       " | 7559102\n",
       " | 12638794\n",
       " | 10x 3' v2\n",
       " | 
\n",
       "    \n",
       "      | 14\n",
       " | 14\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0011025\n",
       " | 3872375\n",
       " | 6139786\n",
       " | 10x 5' v1\n",
       " | 
\n",
       "    \n",
       "      | 13\n",
       " | 13\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0010550\n",
       " | 4062980\n",
       " | 5064268\n",
       " | sci-RNA-seq\n",
       " | 
\n",
       "    \n",
       "      | 8\n",
       " | 8\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0009900\n",
       " | 2930054\n",
       " | 3139770\n",
       " | 10x 5' v2\n",
       " | 
\n",
       "    \n",
       "      | 17\n",
       " | 17\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0030004\n",
       " | 915037\n",
       " | 1084235\n",
       " | 10x 5' transcription profiling\n",
       " | 
\n",
       "    \n",
       "      | 16\n",
       " | 16\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0030003\n",
       " | 744798\n",
       " | 811422\n",
       " | 10x 3' transcription profiling\n",
       " | 
\n",
       "    \n",
       "      | 15\n",
       " | 15\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0030002\n",
       " | 625175\n",
       " | 642559\n",
       " | microwell-seq\n",
       " | 
\n",
       "    \n",
       "      | 1\n",
       " | 1\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0008722\n",
       " | 264166\n",
       " | 279635\n",
       " | Drop-seq\n",
       " | 
\n",
       "    \n",
       "      | 3\n",
       " | 3\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0008919\n",
       " | 89477\n",
       " | 206754\n",
       " | Seq-Well\n",
       " | 
\n",
       "    \n",
       "      | 4\n",
       " | 4\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0008931\n",
       " | 78750\n",
       " | 188248\n",
       " | Smart-seq2\n",
       " | 
\n",
       "    \n",
       "      | 18\n",
       " | 18\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0700003\n",
       " | 146278\n",
       " | 177276\n",
       " | BD Rhapsody Whole Transcriptome Analysis\n",
       " | 
\n",
       "    \n",
       "      | 9\n",
       " | 9\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0009901\n",
       " | 42397\n",
       " | 121394\n",
       " | 10x 3' v1\n",
       " | 
\n",
       "    \n",
       "      | 12\n",
       " | 12\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0010183\n",
       " | 58981\n",
       " | 117962\n",
       " | single cell library construction\n",
       " | 
\n",
       "    \n",
       "      | 19\n",
       " | 19\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0700004\n",
       " | 96145\n",
       " | 96145\n",
       " | BD Rhapsody Targeted mRNA\n",
       " | 
\n",
       "    \n",
       "      | 2\n",
       " | 2\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0008780\n",
       " | 25652\n",
       " | 51304\n",
       " | inDrop\n",
       " | 
\n",
       "    \n",
       "      | 6\n",
       " | 6\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0008995\n",
       " | 0\n",
       " | 29128\n",
       " | 10x technology\n",
       " | 
\n",
       "    \n",
       "      | 5\n",
       " | 5\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0008953\n",
       " | 4693\n",
       " | 9386\n",
       " | STRT-seq\n",
       " | 
\n",
       "    \n",
       "      | 11\n",
       " | 11\n",
       " | Homo sapiens\n",
       " | assay\n",
       " | EFO:0010010\n",
       " | 3105\n",
       " | 5244\n",
       " | CEL-seq2\n",
       " | 
\n",
       "  \n",
       "
\n",
       "
\n",
       "\n",
       "
\n",
       "  \n",
       "    \n",
       "      | \n",
       " | soma_joinid\n",
       " | organism\n",
       " | category\n",
       " | ontology_term_id\n",
       " | unique_cell_count\n",
       " | total_cell_count\n",
       " | label\n",
       " | 
\n",
       "  \n",
       "  \n",
       "    \n",
       "      | 69\n",
       " | 69\n",
       " | Homo sapiens\n",
       " | cell_type\n",
       " | CL:0000129\n",
       " | 268114\n",
       " | 370771\n",
       " | microglial cell\n",
       " | 
\n",
       "    \n",
       "      | 1038\n",
       " | 1038\n",
       " | Mus musculus\n",
       " | cell_type\n",
       " | CL:0000129\n",
       " | 48998\n",
       " | 62617\n",
       " | microglial cell\n",
       " | 
\n",
       "  \n",
       "
\n",
       "
\n",
       "\n",
       "
\n",
       "  \n",
       "    \n",
       "      | \n",
       " | cell_type\n",
       " | is_primary_data\n",
       " | 
\n",
       "  \n",
       "  \n",
       "    \n",
       "      | 0\n",
       " | syncytiotrophoblast cell\n",
       " | False\n",
       " | 
\n",
       "    \n",
       "      | 1\n",
       " | placental villous trophoblast\n",
       " | False\n",
       " | 
\n",
       "    \n",
       "      | 2\n",
       " | syncytiotrophoblast cell\n",
       " | False\n",
       " | 
\n",
       "    \n",
       "      | 3\n",
       " | syncytiotrophoblast cell\n",
       " | False\n",
       " | 
\n",
       "    \n",
       "      | 4\n",
       " | extravillous trophoblast\n",
       " | False\n",
       " | 
\n",
       "    \n",
       "      | ...\n",
       " | ...\n",
       " | ...\n",
       " | 
\n",
       "    \n",
       "      | 56400868\n",
       " | pericyte\n",
       " | True\n",
       " | 
\n",
       "    \n",
       "      | 56400869\n",
       " | pericyte\n",
       " | True\n",
       " | 
\n",
       "    \n",
       "      | 56400870\n",
       " | pericyte\n",
       " | True\n",
       " | 
\n",
       "    \n",
       "      | 56400871\n",
       " | pericyte\n",
       " | True\n",
       " | 
\n",
       "    \n",
       "      | 56400872\n",
       " | pericyte\n",
       " | True\n",
       " | 
\n",
       "  \n",
       "
\n",
       "
56400873 rows × 2 columns
\n",
       "