{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exploring pre-calculated summary cell counts\n",
    "\n",
    "This tutorial describes how to access pre-calculated summary cell counts. Each Census contains a top-level dataframe summarizing counts of various cell labels, this is the `census_summary_cell_counts` dataframe . You can read this into a Pandas DataFrame\n",
    "\n",
    "**Contents**\n",
    "\n",
    "1. Fetching the `census_summary_cell_counts` dataframe.\n",
    "2. Creating summary counts beyond pre-calculated values.\n",
    "\n",
    "⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable `is_primary_data` which is described in the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#repeated-data).\n",
    "\n",
    "## Fetching the `census_summary_cell_counts` dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T16:17:28.143432Z",
     "iopub.status.busy": "2023-07-28T16:17:28.143007Z",
     "iopub.status.idle": "2023-07-28T16:17:31.207795Z",
     "shell.execute_reply": "2023-07-28T16:17:31.207159Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>organism</th>\n",
       "      <th>category</th>\n",
       "      <th>ontology_term_id</th>\n",
       "      <th>unique_cell_count</th>\n",
       "      <th>total_cell_count</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>all</td>\n",
       "      <td>na</td>\n",
       "      <td>33364242</td>\n",
       "      <td>56400873</td>\n",
       "      <td>na</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0008722</td>\n",
       "      <td>264166</td>\n",
       "      <td>279635</td>\n",
       "      <td>Drop-seq</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0008780</td>\n",
       "      <td>25652</td>\n",
       "      <td>51304</td>\n",
       "      <td>inDrop</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0008919</td>\n",
       "      <td>89477</td>\n",
       "      <td>206754</td>\n",
       "      <td>Seq-Well</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Homo sapiens</td>\n",
       "      <td>assay</td>\n",
       "      <td>EFO:0008931</td>\n",
       "      <td>78750</td>\n",
       "      <td>188248</td>\n",
       "      <td>Smart-seq2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1357</th>\n",
       "      <td>Mus musculus</td>\n",
       "      <td>tissue_general</td>\n",
       "      <td>UBERON:0002113</td>\n",
       "      <td>179684</td>\n",
       "      <td>208324</td>\n",
       "      <td>kidney</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1358</th>\n",
       "      <td>Mus musculus</td>\n",
       "      <td>tissue_general</td>\n",
       "      <td>UBERON:0002365</td>\n",
       "      <td>15577</td>\n",
       "      <td>31154</td>\n",
       "      <td>exocrine gland</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1359</th>\n",
       "      <td>Mus musculus</td>\n",
       "      <td>tissue_general</td>\n",
       "      <td>UBERON:0002367</td>\n",
       "      <td>37715</td>\n",
       "      <td>130135</td>\n",
       "      <td>prostate gland</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1360</th>\n",
       "      <td>Mus musculus</td>\n",
       "      <td>tissue_general</td>\n",
       "      <td>UBERON:0002368</td>\n",
       "      <td>13322</td>\n",
       "      <td>26644</td>\n",
       "      <td>endocrine gland</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1361</th>\n",
       "      <td>Mus musculus</td>\n",
       "      <td>tissue_general</td>\n",
       "      <td>UBERON:0002371</td>\n",
       "      <td>90225</td>\n",
       "      <td>144962</td>\n",
       "      <td>bone marrow</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>1362 rows × 6 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "          organism        category ontology_term_id  unique_cell_count  \\\n",
       "0     Homo sapiens             all               na           33364242   \n",
       "1     Homo sapiens           assay      EFO:0008722             264166   \n",
       "2     Homo sapiens           assay      EFO:0008780              25652   \n",
       "3     Homo sapiens           assay      EFO:0008919              89477   \n",
       "4     Homo sapiens           assay      EFO:0008931              78750   \n",
       "...            ...             ...              ...                ...   \n",
       "1357  Mus musculus  tissue_general   UBERON:0002113             179684   \n",
       "1358  Mus musculus  tissue_general   UBERON:0002365              15577   \n",
       "1359  Mus musculus  tissue_general   UBERON:0002367              37715   \n",
       "1360  Mus musculus  tissue_general   UBERON:0002368              13322   \n",
       "1361  Mus musculus  tissue_general   UBERON:0002371              90225   \n",
       "\n",
       "      total_cell_count            label  \n",
       "0             56400873               na  \n",
       "1               279635         Drop-seq  \n",
       "2                51304           inDrop  \n",
       "3               206754         Seq-Well  \n",
       "4               188248       Smart-seq2  \n",
       "...                ...              ...  \n",
       "1357            208324           kidney  \n",
       "1358             31154   exocrine gland  \n",
       "1359            130135   prostate gland  \n",
       "1360             26644  endocrine gland  \n",
       "1361            144962      bone marrow  \n",
       "\n",
       "[1362 rows x 6 columns]"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import cellxgene_census\n",
    "\n",
    "census = cellxgene_census.open_soma()\n",
    "census_summary_cell_counts = census[\"census_info\"][\"summary_cell_counts\"].read().concat().to_pandas()\n",
    "\n",
    "# Dropping the soma_joinid column as it isn't useful in this demo\n",
    "census_summary_cell_counts = census_summary_cell_counts.drop(columns=[\"soma_joinid\"])\n",
    "\n",
    "census_summary_cell_counts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Creating summary counts beyond pre-calculated values.\n",
    "\n",
    "The dataframe above is precomputed from the experiments in the Census, providing a quick overview of the Census contents.\n",
    "\n",
    "You can do similar group statistics using Pandas `groupby` functions. \n",
    "\n",
    "The code below reproduces the above counts using full `obs` dataframe in the `Homo_sapiens` experiment.\n",
    "\n",
    "Keep in mind that the Census is very large, and any queries will return significant amount of data. You can manage that by narrowing the query request using `column_names` and `value_filter` in your query."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T16:17:31.210438Z",
     "iopub.status.busy": "2023-07-28T16:17:31.210021Z",
     "iopub.status.idle": "2023-07-28T16:17:43.764065Z",
     "shell.execute_reply": "2023-07-28T16:17:43.763547Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cell_type_ontology_term_id</th>\n",
       "      <th>cell_type</th>\n",
       "      <th>size</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>CL:0000001</td>\n",
       "      <td>primary cultured cell</td>\n",
       "      <td>80</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>CL:0000003</td>\n",
       "      <td>native cell</td>\n",
       "      <td>1308000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>CL:0000006</td>\n",
       "      <td>neuronal receptor cell</td>\n",
       "      <td>2502</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>CL:0000015</td>\n",
       "      <td>male germ cell</td>\n",
       "      <td>621</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>CL:0000019</td>\n",
       "      <td>sperm</td>\n",
       "      <td>22</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>608</th>\n",
       "      <td>CL:4028006</td>\n",
       "      <td>alveolar type 2 fibroblast cell</td>\n",
       "      <td>38250</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>609</th>\n",
       "      <td>CL:4030009</td>\n",
       "      <td>epithelial cell of proximal tubule segment 1</td>\n",
       "      <td>777</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>610</th>\n",
       "      <td>CL:4030011</td>\n",
       "      <td>epithelial cell of proximal tubule segment 3</td>\n",
       "      <td>989</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>611</th>\n",
       "      <td>CL:4030018</td>\n",
       "      <td>kidney connecting tubule principal cell</td>\n",
       "      <td>107</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>612</th>\n",
       "      <td>CL:4030023</td>\n",
       "      <td>respiratory hillock cell</td>\n",
       "      <td>10170</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>613 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "    cell_type_ontology_term_id                                     cell_type  \\\n",
       "0                   CL:0000001                         primary cultured cell   \n",
       "1                   CL:0000003                                   native cell   \n",
       "2                   CL:0000006                        neuronal receptor cell   \n",
       "3                   CL:0000015                                male germ cell   \n",
       "4                   CL:0000019                                         sperm   \n",
       "..                         ...                                           ...   \n",
       "608                 CL:4028006               alveolar type 2 fibroblast cell   \n",
       "609                 CL:4030009  epithelial cell of proximal tubule segment 1   \n",
       "610                 CL:4030011  epithelial cell of proximal tubule segment 3   \n",
       "611                 CL:4030018       kidney connecting tubule principal cell   \n",
       "612                 CL:4030023                      respiratory hillock cell   \n",
       "\n",
       "        size  \n",
       "0         80  \n",
       "1    1308000  \n",
       "2       2502  \n",
       "3        621  \n",
       "4         22  \n",
       "..       ...  \n",
       "608    38250  \n",
       "609      777  \n",
       "610      989  \n",
       "611      107  \n",
       "612    10170  \n",
       "\n",
       "[613 rows x 3 columns]"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "human = census[\"census_data\"][\"homo_sapiens\"]\n",
    "obs_df = human.obs.read(column_names=[\"cell_type_ontology_term_id\", \"cell_type\"]).concat().to_pandas()\n",
    "obs_df.groupby(by=[\"cell_type_ontology_term_id\", \"cell_type\"], as_index=False, observed=True).size()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Close the census when complete. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T16:17:43.766821Z",
     "iopub.status.busy": "2023-07-28T16:17:43.766325Z",
     "iopub.status.idle": "2023-07-28T16:17:43.769229Z",
     "shell.execute_reply": "2023-07-28T16:17:43.768748Z"
    }
   },
   "outputs": [],
   "source": [
    "census.close()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  },
  "vscode": {
   "interpreter": {
    "hash": "3da8ec1c162cd849e59e6ea2824b2e353dce799884e910aae99411be5277f953"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}