{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Learning about the CZ CELLxGENE Census\n",
    "\n",
    "This notebook showcases the Census contents and how to obtain high-level information about it. It covers the organization of data within the Census, what cell and gene metadata are available, and it provides simple demonstrations to summarize cell counts across cell metadata. \n",
    "\n",
    "**Contents**\n",
    "\n",
    "- Opening the census\n",
    "- Census organization\n",
    "- Cell metadata\n",
    "- Gene metadata\n",
    "- Census summary content tables\n",
    "- Understanding Census contents beyond the summary tables\n",
    "\n",
    "⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable `is_primary_data` which is described in the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#repeated-data).\n",
    "\n",
    "## Opening the Census\n",
    "\n",
    "The `cellxgene_census` python package contains a convenient API to open the latest version of the Census. If you open the census, you should close it. `open_soma()` returns a context, so you can open/close it in several ways, like a Python file handle. The context manager is preferred, as it will automatically close upon an error raise.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:06.960041Z",
     "iopub.status.busy": "2023-07-28T14:20:06.959467Z",
     "iopub.status.idle": "2023-07-28T14:20:10.170466Z",
     "shell.execute_reply": "2023-07-28T14:20:10.169835Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n",
      "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n"
     ]
    }
   ],
   "source": [
    "import cellxgene_census\n",
    "\n",
    "# Preferred: use a Python context manager\n",
    "with cellxgene_census.open_soma() as census:\n",
    "    ...\n",
    "\n",
    "# or\n",
    "census = cellxgene_census.open_soma()\n",
    "...\n",
    "census.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can learn more about the `cellxgene_census` methods by accessing their corresponding documentation via `help()`. For example `help(cellxgene_census.open_soma)`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:10.173670Z",
     "iopub.status.busy": "2023-07-28T14:20:10.173047Z",
     "iopub.status.idle": "2023-07-28T14:20:10.494368Z",
     "shell.execute_reply": "2023-07-28T14:20:10.493750Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n"
     ]
    }
   ],
   "source": [
    "census = cellxgene_census.open_soma()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Census organization\n",
    "\n",
    "The [Census schema](https://chanzuckerberg.github.io/cellxgene-census/cellxgene_census_docsite_schema.html) defines the structure of the Census. In short, you can think of the Census as a structured collection of items that stores different pieces of information. All of these items and the parent collection are SOMA objects of various types and can all be accessed with the [TileDB-SOMA API](https://github.com/single-cell-data/TileDB-SOMA) ([documentation](https://tiledbsoma.readthedocs.io/en/latest/)).\n",
    "\n",
    "\n",
    "The `cellxgene_census` package contains some convenient wrappers of the `TileDB-SOMA` API. An example of this is the function we used to open the Census: `cellxgene_census.open_soma()`\n",
    "\n",
    "### Main Census components\n",
    "\n",
    "With the command above you created `census`, which is a `SOMACollection`. It is analogous to a Python dictionary, and it has two items: `census_info` and `census_data`.\n",
    "\n",
    "#### Census summary info\n",
    "\n",
    "- `census[\"census_info\"]` A collection of tables providing information of the census as a whole.\n",
    "  - `census[\"census_info\"][\"summary\"]`: A data frame with high-level information of this Census, e.g. build date, total cell count, etc.\n",
    "  - `census[\"census_info\"][\"datasets\"]`: A data frame with all datasets from [CELLxGENE Discover](https://cellxgene.cziscience.com/) used to create the Census.\n",
    "  - `census[\"census_info\"][\"summary_cell_counts\"]`: A data frame with cell counts stratified by **relevant** cell metadata\n",
    "\n",
    "#### Census data\n",
    "\n",
    "Data for each organism is stored in independent `SOMAExperiment` objects which are a specialized form of a `SOMACollection`. Each of these store a data matrix (cell by genes), cell metadata, gene metadata, and some other useful components not covered in this notebook.\n",
    "\n",
    "This is how the data is organized for one organism -- _Homo sapiens_:\n",
    "\n",
    "- `census_obj[\"census_data\"][\"homo_sapiens\"].obs`: Cell metadata\n",
    "- `census_obj[\"census_data\"][\"homo_sapiens\"].ms[\"RNA\"].X:` Data matrices, currently only raw counts exist `X[\"raw\"]`\n",
    "- `census_obj[\"census_data\"][\"homo_sapiens\"].ms[\"RNA\"].var:` Gene Metadata"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Cell metadata\n",
    "\n",
    "You can obtain all cell metadata variables by directly querying the columns of the corresponding `SOMADataFrame`.\n",
    "\n",
    "All of these variables can be used for querying the Census in case you want to work with specific cells.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:10.497463Z",
     "iopub.status.busy": "2023-07-28T14:20:10.496989Z",
     "iopub.status.idle": "2023-07-28T14:20:10.941903Z",
     "shell.execute_reply": "2023-07-28T14:20:10.941358Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['soma_joinid',\n",
       " 'dataset_id',\n",
       " 'assay',\n",
       " 'assay_ontology_term_id',\n",
       " 'cell_type',\n",
       " 'cell_type_ontology_term_id',\n",
       " 'development_stage',\n",
       " 'development_stage_ontology_term_id',\n",
       " 'disease',\n",
       " 'disease_ontology_term_id',\n",
       " 'donor_id',\n",
       " 'is_primary_data',\n",
       " 'self_reported_ethnicity',\n",
       " 'self_reported_ethnicity_ontology_term_id',\n",
       " 'sex',\n",
       " 'sex_ontology_term_id',\n",
       " 'suspension_type',\n",
       " 'tissue',\n",
       " 'tissue_ontology_term_id',\n",
       " 'tissue_general',\n",
       " 'tissue_general_ontology_term_id']"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "keys = list(census[\"census_data\"][\"homo_sapiens\"].obs.keys())\n",
    "\n",
    "keys"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All of these variables are defined in the [CELLxGENE dataset schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#obs-cell-metadata) except for the following:\n",
    "\n",
    "- `soma_joinid`: a SOMA-defined value use for join operations.\n",
    "- `dataset_id`: the dataset id as encoded in `census[\"census-info\"][\"datasets\"]`.\n",
    "- `tissue_general` and `tissue_general_ontology_term_id`: the high-level tissue mapping.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Gene metadata\n",
    "\n",
    "Similarly, we can obtain all gene metadata variables by directly querying the columns of the corresponding `SOMADataFrame`.\n",
    "\n",
    "These are the variables you can use for querying the Census in case there are specific genes you are interested in.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:10.944483Z",
     "iopub.status.busy": "2023-07-28T14:20:10.944219Z",
     "iopub.status.idle": "2023-07-28T14:20:11.225599Z",
     "shell.execute_reply": "2023-07-28T14:20:11.225072Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['soma_joinid', 'feature_id', 'feature_name', 'feature_length']"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "keys = list(census[\"census_data\"][\"homo_sapiens\"].ms[\"RNA\"].var.keys())\n",
    "\n",
    "keys"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "All of these variables are defined in the [CELLxGENE dataset schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#var-and-rawvar-gene-metadata) except for the following:\n",
    "\n",
    "- `soma_joinid`: a SOMA-defined value use for join operations.\n",
    "- `feature_length`: the length in base pairs of the gene.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:11.228296Z",
     "iopub.status.busy": "2023-07-28T14:20:11.227881Z",
     "iopub.status.idle": "2023-07-28T14:20:11.719446Z",
     "shell.execute_reply": "2023-07-28T14:20:11.718694Z"
    },
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>label</th>\n",
       "      <th>value</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>census_schema_version</td>\n",
       "      <td>1.0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>census_build_date</td>\n",
       "      <td>2023-07-25</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>dataset_schema_version</td>\n",
       "      <td>3.0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>total_cell_count</td>\n",
       "      <td>61656118</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>unique_cell_count</td>\n",
       "      <td>37447773</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>number_donors_homo_sapiens</td>\n",
       "      <td>13035</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>number_donors_mus_musculus</td>\n",
       "      <td>1417</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   soma_joinid                       label       value\n",
       "0            0       census_schema_version       1.0.0\n",
       "1            1           census_build_date  2023-07-25\n",
       "2            2      dataset_schema_version       3.0.0\n",
       "3            3            total_cell_count    61656118\n",
       "4            4           unique_cell_count    37447773\n",
       "5            5  number_donors_homo_sapiens       13035\n",
       "6            6  number_donors_mus_musculus        1417"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "census_info = census[\"census_info\"][\"summary\"].read().concat().to_pandas()\n",
    "\n",
    "census_info"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Census summary content tables\n",
    "\n",
    "You can take a quick look at the high-level Census information by looking at `census[\"census_info\"][\"summary\"]`\n",
    "\n",
    "Of special interest are the `label`-`value` combinations for :\n",
    "\n",
    "- `total_cell_count` is the total number of cells in the Census.\n",
    "- `unique_cell_count` is the number of unique cells, as some cells may be present twice due to meta-analysis or consortia-like data.\n",
    "- `number_donors_homo_sapiens` and `number_donors_mus_musculus` are the number of individuals for human and mouse. These are not guaranteed to be unique as one individual ID may be present or identical in different datasets.\n",
    "\n",
    "### Cell counts by cell metadata\n",
    "\n",
    "By looking at `census[\"summary_cell_counts\"]` you can get a general idea of cell counts stratified by **some relevant** cell metadata. Not all cell metadata is included in this table, you can take a look at all cell and gene metadata available in the sections below \"Cell metadata\" and \"Gene metadata\".\n",
    "\n",
    "The line below retrieves this table and casts it into a `pandas.DataFrame`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:11.722195Z",
     "iopub.status.busy": "2023-07-28T14:20:11.721723Z",
     "iopub.status.idle": "2023-07-28T14:20:12.262344Z",
     "shell.execute_reply": "2023-07-28T14:20:12.261575Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>organism</th>\n",
       "      <th>category</th>\n",
       "      <th>ontology_term_id</th>\n",
       "      <th>unique_cell_count</th>\n",
       "      <th>total_cell_count</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>all</td>\n",
       "      <td>na</td>\n",
       "      <td>33364242</td>\n",
       "      <td>56400873</td>\n",
       "      <td>na</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0008722</td>\n",
       "      <td>264166</td>\n",
       "      <td>279635</td>\n",
       "      <td>Drop-seq</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0008780</td>\n",
       "      <td>25652</td>\n",
       "      <td>51304</td>\n",
       "      <td>inDrop</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0008919</td>\n",
       "      <td>89477</td>\n",
       "      <td>206754</td>\n",
       "      <td>Seq-Well</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0008931</td>\n",
       "      <td>78750</td>\n",
       "      <td>188248</td>\n",
       "      <td>Smart-seq2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1357</th>\n",
       "      <td>1357</td>\n",
       "      <td>Mus musculus</td>\n",
       "      <td>tissue_general</td>\n",
       "      <td>UBERON:0002113</td>\n",
       "      <td>179684</td>\n",
       "      <td>208324</td>\n",
       "      <td>kidney</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1358</th>\n",
       "      <td>1358</td>\n",
       "      <td>Mus musculus</td>\n",
       "      <td>tissue_general</td>\n",
       "      <td>UBERON:0002365</td>\n",
       "      <td>15577</td>\n",
       "      <td>31154</td>\n",
       "      <td>exocrine gland</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1359</th>\n",
       "      <td>1359</td>\n",
       "      <td>Mus musculus</td>\n",
       "      <td>tissue_general</td>\n",
       "      <td>UBERON:0002367</td>\n",
       "      <td>37715</td>\n",
       "      <td>130135</td>\n",
       "      <td>prostate gland</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1360</th>\n",
       "      <td>1360</td>\n",
       "      <td>Mus musculus</td>\n",
       "      <td>tissue_general</td>\n",
       "      <td>UBERON:0002368</td>\n",
       "      <td>13322</td>\n",
       "      <td>26644</td>\n",
       "      <td>endocrine gland</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1361</th>\n",
       "      <td>1361</td>\n",
       "      <td>Mus musculus</td>\n",
       "      <td>tissue_general</td>\n",
       "      <td>UBERON:0002371</td>\n",
       "      <td>90225</td>\n",
       "      <td>144962</td>\n",
       "      <td>bone marrow</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1362 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      soma_joinid      organism        category ontology_term_id  \\\n",
       "0               0  Homo sapiens             all               na   \n",
       "1               1  Homo sapiens           assay      EFO:0008722   \n",
       "2               2  Homo sapiens           assay      EFO:0008780   \n",
       "3               3  Homo sapiens           assay      EFO:0008919   \n",
       "4               4  Homo sapiens           assay      EFO:0008931   \n",
       "...           ...           ...             ...              ...   \n",
       "1357         1357  Mus musculus  tissue_general   UBERON:0002113   \n",
       "1358         1358  Mus musculus  tissue_general   UBERON:0002365   \n",
       "1359         1359  Mus musculus  tissue_general   UBERON:0002367   \n",
       "1360         1360  Mus musculus  tissue_general   UBERON:0002368   \n",
       "1361         1361  Mus musculus  tissue_general   UBERON:0002371   \n",
       "\n",
       "      unique_cell_count  total_cell_count            label  \n",
       "0              33364242          56400873               na  \n",
       "1                264166            279635         Drop-seq  \n",
       "2                 25652             51304           inDrop  \n",
       "3                 89477            206754         Seq-Well  \n",
       "4                 78750            188248       Smart-seq2  \n",
       "...                 ...               ...              ...  \n",
       "1357             179684            208324           kidney  \n",
       "1358              15577             31154   exocrine gland  \n",
       "1359              37715            130135   prostate gland  \n",
       "1360              13322             26644  endocrine gland  \n",
       "1361              90225            144962      bone marrow  \n",
       "\n",
       "[1362 rows x 7 columns]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "census_counts = census[\"census_info\"][\"summary_cell_counts\"].read().concat().to_pandas()\n",
    "\n",
    "census_counts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For each combination of `organism` and values for each `category` of cell metadata you can take a look at `total_cell_count` and `unique_cell_count` for the cell counts of that combination.\n",
    "\n",
    "The values for each `category` are specified in `ontology_term_id` and `label`, which are the value's IDs and labels, respectively.\n",
    "\n",
    "#### Example: cell metadata included in the summary counts table\n",
    "\n",
    "To get all the available cell metadata in the summary counts table you can do the following. Remember this is not all the cell metadata available, as some variables were omitted in the creation of this table.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:12.264918Z",
     "iopub.status.busy": "2023-07-28T14:20:12.264659Z",
     "iopub.status.idle": "2023-07-28T14:20:12.271618Z",
     "shell.execute_reply": "2023-07-28T14:20:12.271084Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "organism      category               \n",
       "Homo sapiens  all                          1\n",
       "              assay                       19\n",
       "              cell_type                  613\n",
       "              disease                     64\n",
       "              self_reported_ethnicity     26\n",
       "              sex                          3\n",
       "              suspension_type              1\n",
       "              tissue                     220\n",
       "              tissue_general              54\n",
       "Mus musculus  all                          1\n",
       "              assay                        9\n",
       "              cell_type                  248\n",
       "              disease                      5\n",
       "              self_reported_ethnicity      1\n",
       "              sex                          3\n",
       "              suspension_type              1\n",
       "              tissue                      66\n",
       "              tissue_general              27\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "census_counts[[\"organism\", \"category\"]].value_counts(sort=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example: cell counts for each sequencing assay in human data\n",
    "\n",
    "To get the cell counts for each sequencing assay type in human data, you can perform the following `pandas.DataFrame` operations:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:12.273932Z",
     "iopub.status.busy": "2023-07-28T14:20:12.273685Z",
     "iopub.status.idle": "2023-07-28T14:20:12.284771Z",
     "shell.execute_reply": "2023-07-28T14:20:12.284296Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>organism</th>\n",
       "      <th>category</th>\n",
       "      <th>ontology_term_id</th>\n",
       "      <th>unique_cell_count</th>\n",
       "      <th>total_cell_count</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>10</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0009922</td>\n",
       "      <td>11845077</td>\n",
       "      <td>25597563</td>\n",
       "      <td>10x 3' v3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>7</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0009899</td>\n",
       "      <td>7559102</td>\n",
       "      <td>12638794</td>\n",
       "      <td>10x 3' v2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>14</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0011025</td>\n",
       "      <td>3872375</td>\n",
       "      <td>6139786</td>\n",
       "      <td>10x 5' v1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>13</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0010550</td>\n",
       "      <td>4062980</td>\n",
       "      <td>5064268</td>\n",
       "      <td>sci-RNA-seq</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>8</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0009900</td>\n",
       "      <td>2930054</td>\n",
       "      <td>3139770</td>\n",
       "      <td>10x 5' v2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>17</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0030004</td>\n",
       "      <td>915037</td>\n",
       "      <td>1084235</td>\n",
       "      <td>10x 5' transcription profiling</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>16</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0030003</td>\n",
       "      <td>744798</td>\n",
       "      <td>811422</td>\n",
       "      <td>10x 3' transcription profiling</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>15</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0030002</td>\n",
       "      <td>625175</td>\n",
       "      <td>642559</td>\n",
       "      <td>microwell-seq</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0008722</td>\n",
       "      <td>264166</td>\n",
       "      <td>279635</td>\n",
       "      <td>Drop-seq</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0008919</td>\n",
       "      <td>89477</td>\n",
       "      <td>206754</td>\n",
       "      <td>Seq-Well</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0008931</td>\n",
       "      <td>78750</td>\n",
       "      <td>188248</td>\n",
       "      <td>Smart-seq2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>18</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0700003</td>\n",
       "      <td>146278</td>\n",
       "      <td>177276</td>\n",
       "      <td>BD Rhapsody Whole Transcriptome Analysis</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>9</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0009901</td>\n",
       "      <td>42397</td>\n",
       "      <td>121394</td>\n",
       "      <td>10x 3' v1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>12</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0010183</td>\n",
       "      <td>58981</td>\n",
       "      <td>117962</td>\n",
       "      <td>single cell library construction</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>19</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0700004</td>\n",
       "      <td>96145</td>\n",
       "      <td>96145</td>\n",
       "      <td>BD Rhapsody Targeted mRNA</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0008780</td>\n",
       "      <td>25652</td>\n",
       "      <td>51304</td>\n",
       "      <td>inDrop</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>6</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0008995</td>\n",
       "      <td>0</td>\n",
       "      <td>29128</td>\n",
       "      <td>10x technology</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>5</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0008953</td>\n",
       "      <td>4693</td>\n",
       "      <td>9386</td>\n",
       "      <td>STRT-seq</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>11</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0010010</td>\n",
       "      <td>3105</td>\n",
       "      <td>5244</td>\n",
       "      <td>CEL-seq2</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    soma_joinid      organism category ontology_term_id  unique_cell_count  \\\n",
       "10           10  Homo sapiens    assay      EFO:0009922           11845077   \n",
       "7             7  Homo sapiens    assay      EFO:0009899            7559102   \n",
       "14           14  Homo sapiens    assay      EFO:0011025            3872375   \n",
       "13           13  Homo sapiens    assay      EFO:0010550            4062980   \n",
       "8             8  Homo sapiens    assay      EFO:0009900            2930054   \n",
       "17           17  Homo sapiens    assay      EFO:0030004             915037   \n",
       "16           16  Homo sapiens    assay      EFO:0030003             744798   \n",
       "15           15  Homo sapiens    assay      EFO:0030002             625175   \n",
       "1             1  Homo sapiens    assay      EFO:0008722             264166   \n",
       "3             3  Homo sapiens    assay      EFO:0008919              89477   \n",
       "4             4  Homo sapiens    assay      EFO:0008931              78750   \n",
       "18           18  Homo sapiens    assay      EFO:0700003             146278   \n",
       "9             9  Homo sapiens    assay      EFO:0009901              42397   \n",
       "12           12  Homo sapiens    assay      EFO:0010183              58981   \n",
       "19           19  Homo sapiens    assay      EFO:0700004              96145   \n",
       "2             2  Homo sapiens    assay      EFO:0008780              25652   \n",
       "6             6  Homo sapiens    assay      EFO:0008995                  0   \n",
       "5             5  Homo sapiens    assay      EFO:0008953               4693   \n",
       "11           11  Homo sapiens    assay      EFO:0010010               3105   \n",
       "\n",
       "    total_cell_count                                     label  \n",
       "10          25597563                                 10x 3' v3  \n",
       "7           12638794                                 10x 3' v2  \n",
       "14           6139786                                 10x 5' v1  \n",
       "13           5064268                               sci-RNA-seq  \n",
       "8            3139770                                 10x 5' v2  \n",
       "17           1084235            10x 5' transcription profiling  \n",
       "16            811422            10x 3' transcription profiling  \n",
       "15            642559                             microwell-seq  \n",
       "1             279635                                  Drop-seq  \n",
       "3             206754                                  Seq-Well  \n",
       "4             188248                                Smart-seq2  \n",
       "18            177276  BD Rhapsody Whole Transcriptome Analysis  \n",
       "9             121394                                 10x 3' v1  \n",
       "12            117962          single cell library construction  \n",
       "19             96145                 BD Rhapsody Targeted mRNA  \n",
       "2              51304                                    inDrop  \n",
       "6              29128                            10x technology  \n",
       "5               9386                                  STRT-seq  \n",
       "11              5244                                  CEL-seq2  "
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "census_human_assays = census_counts.query(\"organism == 'Homo sapiens' & category == 'assay'\")\n",
    "census_human_assays.sort_values(\"total_cell_count\", ascending=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Example: number of microglial cells in the Census\n",
    "\n",
    "If you have a specific term from any of the categories shown above you can directly find out the number of cells for that term.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:12.287182Z",
     "iopub.status.busy": "2023-07-28T14:20:12.286720Z",
     "iopub.status.idle": "2023-07-28T14:20:12.294371Z",
     "shell.execute_reply": "2023-07-28T14:20:12.293903Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>organism</th>\n",
       "      <th>category</th>\n",
       "      <th>ontology_term_id</th>\n",
       "      <th>unique_cell_count</th>\n",
       "      <th>total_cell_count</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>69</th>\n",
       "      <td>69</td>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>cell_type</td>\n",
       "      <td>CL:0000129</td>\n",
       "      <td>268114</td>\n",
       "      <td>370771</td>\n",
       "      <td>microglial cell</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1038</th>\n",
       "      <td>1038</td>\n",
       "      <td>Mus musculus</td>\n",
       "      <td>cell_type</td>\n",
       "      <td>CL:0000129</td>\n",
       "      <td>48998</td>\n",
       "      <td>62617</td>\n",
       "      <td>microglial cell</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      soma_joinid      organism   category ontology_term_id  \\\n",
       "69             69  Homo sapiens  cell_type       CL:0000129   \n",
       "1038         1038  Mus musculus  cell_type       CL:0000129   \n",
       "\n",
       "      unique_cell_count  total_cell_count            label  \n",
       "69               268114            370771  microglial cell  \n",
       "1038              48998             62617  microglial cell  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "census_counts.query(\"label == 'microglial cell'\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Understanding Census contents beyond the summary tables\n",
    "\n",
    "While using the pre-computed tables in `census[\"census_info\"]` is an easy and quick way to understand the contents of the Census, it falls short if you want to learn more about certain slices of the Census.\n",
    "\n",
    "For example, you may want to learn more about:\n",
    "\n",
    "- What are the cell types available for human liver?\n",
    "- What are the total number of cells in all lung datasets stratified by sequencing technology?\n",
    "- What is the sex distribution of all cells from brain in mouse?\n",
    "- What are the diseases available for T cells?\n",
    "\n",
    "All of these questions can be answered by directly querying the cell metadata as shown in the examples below.\n",
    "\n",
    "### Example: all cell types available in human\n",
    "\n",
    "To exemplify the process of accessing and slicing cell metadata for summary stats, let's start with a trivial example and take a look at all human cell types available in the Census:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:12.296812Z",
     "iopub.status.busy": "2023-07-28T14:20:12.296405Z",
     "iopub.status.idle": "2023-07-28T14:20:15.844398Z",
     "shell.execute_reply": "2023-07-28T14:20:15.843860Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cell_type</th>\n",
       "      <th>is_primary_data</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>syncytiotrophoblast cell</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>placental villous trophoblast</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>syncytiotrophoblast cell</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>syncytiotrophoblast cell</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>extravillous trophoblast</td>\n",
       "      <td>False</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56400868</th>\n",
       "      <td>pericyte</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56400869</th>\n",
       "      <td>pericyte</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56400870</th>\n",
       "      <td>pericyte</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56400871</th>\n",
       "      <td>pericyte</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>56400872</th>\n",
       "      <td>pericyte</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>56400873 rows × 2 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                              cell_type  is_primary_data\n",
       "0              syncytiotrophoblast cell            False\n",
       "1         placental villous trophoblast            False\n",
       "2              syncytiotrophoblast cell            False\n",
       "3              syncytiotrophoblast cell            False\n",
       "4              extravillous trophoblast            False\n",
       "...                                 ...              ...\n",
       "56400868                       pericyte             True\n",
       "56400869                       pericyte             True\n",
       "56400870                       pericyte             True\n",
       "56400871                       pericyte             True\n",
       "56400872                       pericyte             True\n",
       "\n",
       "[56400873 rows x 2 columns]"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "human_cell_types = (\n",
    "    census[\"census_data\"][\"homo_sapiens\"].obs.read(column_names=[\"cell_type\", \"is_primary_data\"]).concat().to_pandas()\n",
    ")\n",
    "human_cell_types"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The number of rows is the total number of cells for humans. Now, if you wish to get the cell counts per cell type we can perform some `pandas` operations on this object.\n",
    "\n",
    "In addition, we will only focus on cells that are marked with `is_primary_data=True` as this ensures we de-duplicate cells that appear more than once in CELLxGENE Discover.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:15.846897Z",
     "iopub.status.busy": "2023-07-28T14:20:15.846613Z",
     "iopub.status.idle": "2023-07-28T14:20:18.453082Z",
     "shell.execute_reply": "2023-07-28T14:20:18.452533Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(33364242, 1)"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "human_cell_types = (\n",
    "    census[\"census_data\"][\"homo_sapiens\"]\n",
    "    .obs.read(column_names=[\"cell_type\"], value_filter=\"is_primary_data == True\")\n",
    "    .concat()\n",
    "    .to_pandas()\n",
    ")\n",
    "\n",
    "human_cell_types = human_cell_types[[\"cell_type\"]]\n",
    "human_cell_types.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This is the number of unique cells. Now let's look at the counts per cell type:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:18.455764Z",
     "iopub.status.busy": "2023-07-28T14:20:18.455499Z",
     "iopub.status.idle": "2023-07-28T14:20:20.602756Z",
     "shell.execute_reply": "2023-07-28T14:20:20.602220Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "cell_type                                      \n",
       "neuron                                             2673669\n",
       "glutamatergic neuron                               1541605\n",
       "CD4-positive, alpha-beta T cell                    1258976\n",
       "CD8-positive, alpha-beta T cell                    1235987\n",
       "classical monocyte                                 1030996\n",
       "                                                    ...   \n",
       "microfold cell of epithelium of small intestine         19\n",
       "mature conventional dendritic cell                      17\n",
       "serous cell of epithelium of bronchus                   15\n",
       "sperm                                                   11\n",
       "type N enteroendocrine cell                             10\n",
       "Name: count, Length: 599, dtype: int64"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "human_cell_type_counts = human_cell_types.value_counts()\n",
    "human_cell_type_counts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This shows you that the most abundant cell types are \"glutamatergic neuron\", \"CD8-positive, alpha-beta T cell\", and \"CD4-positive, alpha-beta T cell\".\n",
    "\n",
    "Now let's take a look at the number of unique cell types:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:20.605245Z",
     "iopub.status.busy": "2023-07-28T14:20:20.604980Z",
     "iopub.status.idle": "2023-07-28T14:20:20.608595Z",
     "shell.execute_reply": "2023-07-28T14:20:20.608117Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(599,)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "human_cell_type_counts.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That is the total number of different cell types for human.\n",
    "\n",
    "All the information in this example can be quickly obtained from the summary table at `census[\"census-info\"][\"summary_cell_counts\"]`.\n",
    "\n",
    "The examples below are more complex and can only be achieved by accessing the cell metadata.\n",
    "\n",
    "### Example: cell types available in human liver\n",
    "\n",
    "Similar to the example above, we can learn what cell types are available for a specific tissue, e.g. liver.\n",
    "\n",
    "To achieve this goal we just need to limit our cell metadata to that tissue. We will use the information in the cell metadata variable `tissue_general`. This variable contains the high-level tissue label for all cells in the Census:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:20.610968Z",
     "iopub.status.busy": "2023-07-28T14:20:20.610624Z",
     "iopub.status.idle": "2023-07-28T14:20:21.566043Z",
     "shell.execute_reply": "2023-07-28T14:20:21.565314Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "cell_type\n",
       "T cell                               85739\n",
       "hepatoblast                          58447\n",
       "neoplastic cell                      52431\n",
       "erythroblast                         45605\n",
       "monocyte                             31388\n",
       "                                     ...  \n",
       "pulmonary artery endothelial cell        1\n",
       "germinal center B cell                   1\n",
       "enteroendocrine cell                     1\n",
       "type I pneumocyte                        1\n",
       "group 2 innate lymphoid cell             1\n",
       "Name: count, Length: 126, dtype: int64"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "human_liver_cell_types = (\n",
    "    census[\"census_data\"][\"homo_sapiens\"]\n",
    "    .obs.read(column_names=[\"cell_type\"], value_filter=\"is_primary_data == True and tissue_general == 'liver'\")\n",
    "    .concat()\n",
    "    .to_pandas()\n",
    ")\n",
    "\n",
    "human_liver_cell_types[\"cell_type\"].value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are the cell types and their cell counts in the human liver.\n",
    "\n",
    "### Example: diseased T cells in human tissues\n",
    "\n",
    "In this example we are going to get the counts for all diseased cells annotated as T cells. For the sake of the example we will focus on \"CD8-positive, alpha-beta T cell\" and \"CD4-positive, alpha-beta T cell\":\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:21.568806Z",
     "iopub.status.busy": "2023-07-28T14:20:21.568542Z",
     "iopub.status.idle": "2023-07-28T14:20:23.436424Z",
     "shell.execute_reply": "2023-07-28T14:20:23.435878Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "disease                                tissue_general    \n",
       "B-cell non-Hodgkin lymphoma            blood                  62499\n",
       "COVID-19                               blood                 819428\n",
       "                                       lung                   30578\n",
       "                                       nose                      13\n",
       "                                       respiratory system         4\n",
       "                                       saliva                    41\n",
       "Crohn disease                          colon                  17490\n",
       "                                       small intestine        52029\n",
       "Down syndrome                          bone marrow              181\n",
       "breast cancer                          breast                  1850\n",
       "chronic obstructive pulmonary disease  lung                    9382\n",
       "chronic rhinitis                       nose                     909\n",
       "clear cell renal carcinoma             blood                   6548\n",
       "                                       kidney                 20540\n",
       "                                       lymph node                36\n",
       "cystic fibrosis                        lung                       7\n",
       "follicular lymphoma                    lymph node              1089\n",
       "influenza                              blood                   8871\n",
       "interstitial lung disease              lung                    1803\n",
       "kidney benign neoplasm                 blood                     20\n",
       "                                       kidney                    10\n",
       "kidney oncocytoma                      blood                     16\n",
       "                                       kidney                  2408\n",
       "lung adenocarcinoma                    adrenal gland            205\n",
       "                                       brain                   3274\n",
       "                                       liver                    507\n",
       "                                       lung                  215013\n",
       "                                       lymph node             24969\n",
       "                                       pleural fluid          11558\n",
       "lung large cell carcinoma              lung                    5922\n",
       "lymphangioleiomyomatosis               lung                     513\n",
       "non-small cell lung carcinoma          lung                   36573\n",
       "nonpapillary renal cell carcinoma      adipose tissue           243\n",
       "                                       adrenal gland           4828\n",
       "                                       blood                    288\n",
       "                                       blood clot              1717\n",
       "                                       kidney                 69136\n",
       "pleomorphic carcinoma                  lung                    1715\n",
       "pneumonia                              lung                     856\n",
       "pulmonary fibrosis                     lung                    1671\n",
       "respiratory system disorder            blood                  34301\n",
       "squamous cell lung carcinoma           lung                   52053\n",
       "                                       lymph node               100\n",
       "systemic lupus erythematosus           blood                 355471\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "t_cells_list = [\"CD8-positive, alpha-beta T cell\", \"CD4-positive, alpha-beta T cell\"]\n",
    "\n",
    "t_cells_diseased = (\n",
    "    census[\"census_data\"][\"homo_sapiens\"]\n",
    "    .obs.read(\n",
    "        column_names=[\"disease\", \"tissue_general\"],\n",
    "        value_filter=f\"is_primary_data == True and cell_type in {t_cells_list} and disease != 'normal'\",\n",
    "    )\n",
    "    .concat()\n",
    "    .to_pandas()\n",
    ")\n",
    "\n",
    "t_cells_diseased = t_cells_diseased[[\"disease\", \"tissue_general\"]].value_counts(sort=False)\n",
    "t_cells_diseased"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These are the cell counts annotated with the indicated disease across human tissues for \"CD8-positive, alpha-beta T cell\" or \"CD4-positive, alpha-beta T cell\".\n",
    "\n",
    "And, don't forget to close the census!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:20:23.438957Z",
     "iopub.status.busy": "2023-07-28T14:20:23.438667Z",
     "iopub.status.idle": "2023-07-28T14:20:23.441777Z",
     "shell.execute_reply": "2023-07-28T14:20:23.441276Z"
    }
   },
   "outputs": [],
   "source": [
    "census.close()\n",
    "del census"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  },
  "vscode": {
   "interpreter": {
    "hash": "3da8ec1c162cd849e59e6ea2824b2e353dce799884e910aae99411be5277f953"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}