{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Summarizing cell and gene metadata\n",
    "\n",
    "This notebook provides examples for basic axis metadata handling using Pandas. The Census stores `obs` (cell) and `var` (gene) metadata in `SOMADataFrame` objects via the [TileDB-SOMA API](https://github.com/single-cell-data/TileDB-SOMA) ([documentation](https://tiledbsoma.readthedocs.io/en/latest/)), which can be queried and read as a Pandas `DataFrame` using `TileDB-SOMA`. \n",
    "\n",
    "Note that Pandas `DataFrame` is an in-memory object, therefore queries should be small enough for results to fit in memory.\n",
    "\n",
    "**Contents**\n",
    "\n",
    "1. Opening the Census\n",
    "1. Summarizing cell metadata\n",
    "   1. Example: Summarize all cell types\n",
    "   1. Example: Summarize a subset of cell types, selected with a `value_filter`\n",
    "1. Full Census metadata stats\n",
    "\n",
    "⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable `is_primary_data` which is described in the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#repeated-data).\n",
    "\n",
    "## Opening the Census\n",
    "\n",
    "The `cellxgene_census` python package contains a convenient API to open the latest version of the Census. If you open the Census, you should close it. `open_soma()` returns a context, so you can open/close it in several ways, like a Python file handle. The context manager is preferred, as it will automatically close upon an error raise.\n",
    "\n",
    "You can learn more about the `cellxgene_census` methods by accessing their corresponding documentation via `help()`. For example `help(cellxgene_census.open_soma)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:30:26.404893Z",
     "iopub.status.busy": "2023-07-28T14:30:26.404633Z",
     "iopub.status.idle": "2023-07-28T14:30:29.216830Z",
     "shell.execute_reply": "2023-07-28T14:30:29.216225Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n",
      "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n"
     ]
    }
   ],
   "source": [
    "import cellxgene_census\n",
    "\n",
    "# Preferred: use a Python context manager\n",
    "with cellxgene_census.open_soma() as census:\n",
    "    ...\n",
    "\n",
    "# or, directly open the census (don't forget to close it!)\n",
    "census = cellxgene_census.open_soma()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summarizing cell metadata\n",
    "\n",
    "Once the Census is open you can use its `TileDB-SOMA` methods as it is itself a `SOMACollection`. You can thus access the metadata `SOMADataFrame` objects encoding cell and gene metadata.\n",
    "\n",
    "Tips:\n",
    "\n",
    "- You can read an _entire_ `SOMADataFrame` into a Pandas `DataFrame` using `soma_df.read().concat().to_pandas()`, allowing the use of the standard Pandas API.\n",
    "- Queries will be much faster if you request only the DataFrame columns required for your analysis (e.g., `column_names=[\"cell_type_ontology_term_id\"]`).\n",
    "- You can also further refine query results by using a `value_filter`, which will filter the census for matching records.\n",
    "\n",
    "### Example: Summarize all cell types\n",
    "\n",
    "This example reads the cell metadata (`obs`) into a Pandas DataFrame, and summarizes in a variety of ways using Pandas API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:30:29.219969Z",
     "iopub.status.busy": "2023-07-28T14:30:29.219515Z",
     "iopub.status.idle": "2023-07-28T14:30:36.711046Z",
     "shell.execute_reply": "2023-07-28T14:30:36.710444Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 613 cell types in the Census! The first 10 are: ['CL:0000525', 'CL:2000060', 'CL:0008036', 'CL:0002488', 'CL:0002343', 'CL:0000084', 'CL:0001078', 'CL:0000815', 'CL:0000235', 'CL:3000001']\n",
      "\n",
      "The top 10 cell types and their counts are:\n",
      "cell_type_ontology_term_id\n",
      "CL:0000540    7665340\n",
      "CL:0000679    1894047\n",
      "CL:0000128    1881077\n",
      "CL:0000624    1508920\n",
      "CL:0000625    1477453\n",
      "CL:0000235    1419507\n",
      "CL:0000057    1397813\n",
      "CL:0000860    1369142\n",
      "CL:0000003    1308000\n",
      "CL:4023040    1229658\n",
      "Name: count, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# Read entire _obs_ into a pandas dataframe.\n",
    "obs_df = cellxgene_census.get_obs(census, \"homo_sapiens\", column_names=[\"cell_type_ontology_term_id\"])\n",
    "\n",
    "# Use Pandas API to find all unique values in the `cell_type_ontology_term_id` column.\n",
    "unique_cell_type_ontology_term_id = obs_df.cell_type_ontology_term_id.unique()\n",
    "\n",
    "# Display only the first 10, as there are a LOT!\n",
    "print(\n",
    "    f\"There are {len(unique_cell_type_ontology_term_id)} cell types in the Census! The first 10 are:\",\n",
    "    unique_cell_type_ontology_term_id[0:10].tolist(),\n",
    ")\n",
    "\n",
    "# Using Pandas API, count the instances of each cell type term and return the top 10.\n",
    "top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]\n",
    "print(\"\\nThe top 10 cell types and their counts are:\")\n",
    "print(top_10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example: Summarize a subset of cell types, selected with a `value_filter`\n",
    "\n",
    "This example utilizes a SOMA \"value filter\" to read the subset of cells with `tissue_ontology_term_id` equal to `UBERON:0002048` (lung tissue), and summarizes the query result using Pandas."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:30:36.713974Z",
     "iopub.status.busy": "2023-07-28T14:30:36.713681Z",
     "iopub.status.idle": "2023-07-28T14:30:38.595585Z",
     "shell.execute_reply": "2023-07-28T14:30:38.595010Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "There are 185 cell types in the Census where tissue_ontology_term_id == UBERON:0002048! The first 10 are: ['CL:0002063', 'CL:0000775', 'CL:0001044', 'CL:0001050', 'CL:0000814', 'CL:0000071', 'CL:0000192', 'CL:0002503', 'CL:0000235', 'CL:0002370']\n",
      "\n",
      "Top 10 cell types where tissue_ontology_term_id == UBERON:0002048\n",
      "cell_type_ontology_term_id\n",
      "CL:0000003    562038\n",
      "CL:0000583    526859\n",
      "CL:0000625    323985\n",
      "CL:0000624    323610\n",
      "CL:0000235    266333\n",
      "CL:0002063    255425\n",
      "CL:0000860    205013\n",
      "CL:0000623    164944\n",
      "CL:0001064    149067\n",
      "CL:0002632    132243\n",
      "Name: count, dtype: int64\n"
     ]
    }
   ],
   "source": [
    "# Count cell_type occurrences for cells with tissue == 'lung'\n",
    "\n",
    "# Read cell_type terms for cells which have a specific tissue term\n",
    "LUNG_TISSUE = \"UBERON:0002048\"\n",
    "\n",
    "obs_df = cellxgene_census.get_obs(\n",
    "    census,\n",
    "    \"homo_sapiens\",\n",
    "    column_names=[\"cell_type_ontology_term_id\"],\n",
    "    value_filter=f\"tissue_ontology_term_id == '{LUNG_TISSUE}'\",\n",
    ")\n",
    "\n",
    "# Use Pandas API to find all unique values in the `cell_type_ontology_term_id` column.\n",
    "unique_cell_type_ontology_term_id = obs_df.cell_type_ontology_term_id.unique()\n",
    "\n",
    "print(\n",
    "    f\"There are {len(unique_cell_type_ontology_term_id)} cell types in the Census where tissue_ontology_term_id == {LUNG_TISSUE}! The first 10 are:\",\n",
    "    unique_cell_type_ontology_term_id[0:10].tolist(),\n",
    ")\n",
    "\n",
    "# Use Pandas API to count, and grab 10 most common\n",
    "top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]\n",
    "print(f\"\\nTop 10 cell types where tissue_ontology_term_id == {LUNG_TISSUE}\")\n",
    "print(top_10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also define much more complex value filters. For example:\n",
    "\n",
    "* combine terms with `and` and `or`\n",
    "* use the `in` operator to query on multiple values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:30:38.598036Z",
     "iopub.status.busy": "2023-07-28T14:30:38.597767Z",
     "iopub.status.idle": "2023-07-28T14:30:39.797374Z",
     "shell.execute_reply": "2023-07-28T14:30:39.796849Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "cell_type_ontology_term_id\n",
       "CL:0000746    49929\n",
       "CL:0008034    33361\n",
       "CL:0002548    33180\n",
       "CL:0002131    30915\n",
       "CL:0000115    30054\n",
       "CL:0000003    18391\n",
       "CL:0000763    14408\n",
       "CL:0000669    13552\n",
       "CL:0000057     9690\n",
       "CL:0002144     9025\n",
       "Name: count, dtype: int64"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# You can also do more complex queries, such as testing for inclusion in a list of values and \"and\" operations\n",
    "VENTRICLES = [\"UBERON:0002082\", \"UBERON:OOO2084\", \"UBERON:0002080\"]\n",
    "\n",
    "obs_df = cellxgene_census.get_obs(\n",
    "    census,\n",
    "    \"homo_sapiens\",\n",
    "    column_names=[\"cell_type_ontology_term_id\"],\n",
    "    value_filter=f\"tissue_ontology_term_id in {VENTRICLES} and is_primary_data == True\",\n",
    ")\n",
    "\n",
    "# Use Pandas API to summarize\n",
    "top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]\n",
    "display(top_10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Full Census metadata stats\n",
    "\n",
    "This example queries all organisms in the Census, and summarizes the diversity of various metadata lables."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:30:39.799869Z",
     "iopub.status.busy": "2023-07-28T14:30:39.799605Z",
     "iopub.status.idle": "2023-07-28T14:30:54.607269Z",
     "shell.execute_reply": "2023-07-28T14:30:54.606651Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Complete census contains 61656118 cells.\n",
      "mus_musculus\n",
      "\tUnique cell_type_ontology_term_id values: 248\n",
      "\tUnique assay_ontology_term_id values: 9\n",
      "\tUnique tissue_ontology_term_id values: 66\n",
      "homo_sapiens\n",
      "\tUnique cell_type_ontology_term_id values: 613\n",
      "\tUnique assay_ontology_term_id values: 19\n",
      "\tUnique tissue_ontology_term_id values: 220\n"
     ]
    }
   ],
   "source": [
    "COLS_TO_QUERY = [\n",
    "    \"cell_type_ontology_term_id\",\n",
    "    \"assay_ontology_term_id\",\n",
    "    \"tissue_ontology_term_id\",\n",
    "]\n",
    "\n",
    "obs_df = {\n",
    "    name: cellxgene_census.get_obs(census, name, column_names=COLS_TO_QUERY) for name in census[\"census_data\"].keys()\n",
    "}\n",
    "\n",
    "# Use Pandas API to summarize each organism\n",
    "print(f\"Complete census contains {sum(len(df) for df in obs_df.values())} cells.\")\n",
    "for organism, df in obs_df.items():\n",
    "    print(organism)\n",
    "    for col in COLS_TO_QUERY:\n",
    "        print(f\"\\tUnique {col} values: {len(df[col].unique())}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Close the census"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:30:54.609746Z",
     "iopub.status.busy": "2023-07-28T14:30:54.609482Z",
     "iopub.status.idle": "2023-07-28T14:30:54.612526Z",
     "shell.execute_reply": "2023-07-28T14:30:54.612022Z"
    }
   },
   "outputs": [],
   "source": [
    "census.close()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  },
  "vscode": {
   "interpreter": {
    "hash": "3da8ec1c162cd849e59e6ea2824b2e353dce799884e910aae99411be5277f953"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}