{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Understanding and filtering out duplicate cells\n",
    "\n",
    "This tutorial provides an explanation for the existence of duplicate cells in the Census, and it showcases different ways to handle these cells when performing queries on the Census using the `is_primary_data` cell metadata variable. \n",
    "\n",
    "**Contents**\n",
    "\n",
    "1. Why are there duplicate cells in the Census?\n",
    "2. An example: duplicate cells in the Tabula Muris Senis data.\n",
    "3. Filtering out duplicates cells.\n",
    "   1. Filtering out duplicate cells when reading the `obs` data frame.\n",
    "   2. Filtering out duplicate cells when creating an AnnData.\n",
    "   3. Filtering out duplicate cells for out-of-core operations.\n",
    "   \n",
    "## Why are there duplicate cells in the Census?\n",
    "\n",
    "Duplicate cells are labeled on the `is_primary_data` cell metadata variable as `False`. To learn more about this please take a look at the corresponding [section of the dataset schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#is_primary_data). \n",
    "\n",
    "The Census data is a concatenation of most RNA data from CZ CELLxGENE Discover and these data are ingested one dataset at a time. You can take a look at what data is included in the Census [here](https://chanzuckerberg.github.io/cellxgene-census/cellxgene_census_docsite_schema.html).\n",
    "\n",
    "In some cases data from the same cell exists in different datasets, therefore cells can be duplicated throughout CELLxGENE Discover and by extension the Census. \n",
    "\n",
    "The following are a few examples where cells are duplicated in CELLxGENE Discover:\n",
    "\n",
    "* There are datasets that combine data from other, pre-existing datasets.\n",
    "\n",
    "> *For example [Tabula Sapiens](https://cellxgene.cziscience.com/collections/e5f58829-1a66-40b5-a624-9046778e74f5) has one dataset with all of its cells and separate datasets with cells divided by high-level lineage (i.e. immune, epithelial, stromal, endothelial)*\n",
    "\n",
    "* A dataset may provide a meta-analysis of pre-existing datasets.\n",
    "\n",
    "> *For example [Jin et al.](https://cellxgene.cziscience.com/collections/b9fc3d70-5a72-4479-a046-c2cc1ab19efc) performed a meta-analysis of COVID-19 data, and they included both the individual datasets as well as one concatenated dataset*\n",
    "\n",
    "The Census has all of these data to allow for the execution of dataset-based queries, which would be otherwise be limited if only non-duplicate cells were included.\n",
    "\n",
    "## An example: duplicate cells in the Tabula Muris Senis data\n",
    "\n",
    "Let's take a look at an example from the Census using the Tabula Muris Senis data. Some of its datasets contain duplicated cells.\n",
    "\n",
    "We can obtain cell metadata for the **main** Tabula Muris Senis dataset: \"All - A single-cell transcriptomic atlas characterizes ageing tissues in the mouse - 10x\", which contains the original (non-duplicated) cells.\n",
    "\n",
    "And remember we must include the `is_primary_data` column."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-05-17T15:37:06.246546Z",
     "iopub.status.busy": "2023-05-17T15:37:06.246069Z",
     "iopub.status.idle": "2023-05-17T15:37:08.867857Z",
     "shell.execute_reply": "2023-05-17T15:37:08.867253Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The \"stable\" release is currently 2023-05-15. Specify 'census_version=\"2023-05-15\"' in future calls to open_soma() to ensure data consistency.\n"
     ]
    }
   ],
   "source": [
    "import cellxgene_census\n",
    "\n",
    "tabula_muris_dataset_id = \"48b37086-25f7-4ecd-be66-f5bb378e3aea\"\n",
    "\n",
    "with cellxgene_census.open_soma() as census:\n",
    "    tabula_muris_obs = census[\"census_data\"][\"mus_musculus\"].obs.read(\n",
    "        value_filter=f\"dataset_id == '{tabula_muris_dataset_id}'\", column_names=[\"tissue\", \"is_primary_data\"]\n",
    "    )\n",
    "\n",
    "    tabula_muris_obs = tabula_muris_obs.concat().to_pandas()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's take a look at counts for the unique combinations of values.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-05-17T15:37:08.871706Z",
     "iopub.status.busy": "2023-05-17T15:37:08.870353Z",
     "iopub.status.idle": "2023-05-17T15:37:08.911114Z",
     "shell.execute_reply": "2023-05-17T15:37:08.910586Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tissue           is_primary_data  dataset_id                          \n",
       "bone marrow      True             48b37086-25f7-4ecd-be66-f5bb378e3aea    40220\n",
       "spleen           True             48b37086-25f7-4ecd-be66-f5bb378e3aea    35718\n",
       "limb muscle      True             48b37086-25f7-4ecd-be66-f5bb378e3aea    28867\n",
       "lung             True             48b37086-25f7-4ecd-be66-f5bb378e3aea    24540\n",
       "kidney           True             48b37086-25f7-4ecd-be66-f5bb378e3aea    21647\n",
       "tongue           True             48b37086-25f7-4ecd-be66-f5bb378e3aea    20680\n",
       "mammary gland    True             48b37086-25f7-4ecd-be66-f5bb378e3aea    12295\n",
       "thymus           True             48b37086-25f7-4ecd-be66-f5bb378e3aea     9275\n",
       "bladder lumen    True             48b37086-25f7-4ecd-be66-f5bb378e3aea     8945\n",
       "heart            True             48b37086-25f7-4ecd-be66-f5bb378e3aea     8613\n",
       "trachea          True             48b37086-25f7-4ecd-be66-f5bb378e3aea     7976\n",
       "liver            True             48b37086-25f7-4ecd-be66-f5bb378e3aea     7294\n",
       "adipose tissue   True             48b37086-25f7-4ecd-be66-f5bb378e3aea     6777\n",
       "pancreas         True             48b37086-25f7-4ecd-be66-f5bb378e3aea     6201\n",
       "skin of body     True             48b37086-25f7-4ecd-be66-f5bb378e3aea     4454\n",
       "large intestine  True             48b37086-25f7-4ecd-be66-f5bb378e3aea     1887\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tabula_muris_obs.value_counts()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see all cells across the tissues are labelled as `True` for `is_primary_data`.\n",
    "\n",
    "But what if we select cells from the dataset that only contains cells from the liver: \"Liver - A single-cell transcriptomic atlas characterizes ageing tissues in the mouse - 10x\".\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-05-17T15:37:08.913202Z",
     "iopub.status.busy": "2023-05-17T15:37:08.913060Z",
     "iopub.status.idle": "2023-05-17T15:37:09.968086Z",
     "shell.execute_reply": "2023-05-17T15:37:09.967626Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The \"stable\" release is currently 2023-05-15. Specify 'census_version=\"2023-05-15\"' in future calls to open_soma() to ensure data consistency.\n"
     ]
    }
   ],
   "source": [
    "tabula_muris_liver_dataset_id = \"6202a243-b713-4e12-9ced-c387f8483dea\"\n",
    "\n",
    "with cellxgene_census.open_soma() as census:\n",
    "    tabula_muris_liver_obs = census[\"census_data\"][\"mus_musculus\"].obs.read(\n",
    "        value_filter=f\"dataset_id == '{tabula_muris_liver_dataset_id}'\", column_names=[\"tissue\", \"is_primary_data\"]\n",
    "    )\n",
    "\n",
    "    tabula_muris_liver_obs = tabula_muris_liver_obs.concat().to_pandas()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And we take a look at counts for the unique combinations of values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-05-17T15:37:09.970720Z",
     "iopub.status.busy": "2023-05-17T15:37:09.970563Z",
     "iopub.status.idle": "2023-05-17T15:37:09.976304Z",
     "shell.execute_reply": "2023-05-17T15:37:09.975953Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "tissue  is_primary_data  dataset_id                          \n",
       "liver   False            6202a243-b713-4e12-9ced-c387f8483dea    7294\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tabula_muris_liver_obs.value_counts()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see that:\n",
    "\n",
    "1. This dataset only contains cells from liver.\n",
    "2. All cells are labelled as `False` for `is_primary_data`. **This is because the cells are marked as duplicate cells of the main Tabula Muris Senis dataset.**\n",
    "\n",
    "##  Filtering out duplicate cells\n",
    "\n",
    "In some cases you may be interested in getting all cells for a specific biological context, for example *\"all natural killer cells from blood of female cells with COVID-19\"* but you need to be aware that there is a chance you end up with some duplicate cells.\n",
    "\n",
    "We therefore recommend that you always look at `is_primary_data` and use that information based on your needs.\n",
    "\n",
    "If you know *a priori* that you don't want duplicated cells this section shows you how to efficiently exclude them from your queries. \n",
    "\n",
    "### Filtering out duplicate cells when reading the `obs` data frame.\n",
    "\n",
    "Let's say you are interested in looking at the cell metadata of *\"all natural killer cells from blood of female cells with COVID-19\"* but you want to exclude duplicate cells, then you can use `value_filter` when reading the data frame to only include cells with `is_primary_data` as `True`.\n",
    "\n",
    "Let's first read the cell metadata including **all** cells:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-05-17T15:37:09.978023Z",
     "iopub.status.busy": "2023-05-17T15:37:09.977879Z",
     "iopub.status.idle": "2023-05-17T15:37:16.049592Z",
     "shell.execute_reply": "2023-05-17T15:37:16.048490Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The \"stable\" release is currently 2023-05-15. Specify 'census_version=\"2023-05-15\"' in future calls to open_soma() to ensure data consistency.\n"
     ]
    }
   ],
   "source": [
    "with cellxgene_census.open_soma() as census:\n",
    "    nk_cells = census[\"census_data\"][\"homo_sapiens\"].obs.read(\n",
    "        value_filter=\"cell_type == 'natural killer cell' \"\n",
    "        \"and disease == 'COVID-19' \"\n",
    "        \"and sex == 'female'\"\n",
    "        \"and tissue_general == 'blood'\"\n",
    "    )\n",
    "\n",
    "    nk_cells = nk_cells.concat().to_pandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-05-17T15:37:16.052588Z",
     "iopub.status.busy": "2023-05-17T15:37:16.052364Z",
     "iopub.status.idle": "2023-05-17T15:37:16.055971Z",
     "shell.execute_reply": "2023-05-17T15:37:16.055607Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(80935, 21)"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nk_cells.shape"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we repeat the query only using cells marked as `True` for `is_primary_data`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-05-17T15:37:16.057773Z",
     "iopub.status.busy": "2023-05-17T15:37:16.057628Z",
     "iopub.status.idle": "2023-05-17T15:37:22.371662Z",
     "shell.execute_reply": "2023-05-17T15:37:22.370999Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The \"stable\" release is currently 2023-05-15. Specify 'census_version=\"2023-05-15\"' in future calls to open_soma() to ensure data consistency.\n"
     ]
    }
   ],
   "source": [
    "with cellxgene_census.open_soma() as census:\n",
    "    nk_cells_primary = census[\"census_data\"][\"homo_sapiens\"].obs.read(\n",
    "        value_filter=\"cell_type == 'natural killer cell' \"\n",
    "        \"and disease == 'COVID-19' \"\n",
    "        \"and tissue_general == 'blood'\"\n",
    "        \"and sex == 'female'\"\n",
    "        \"and is_primary_data == True\"\n",
    "    )\n",
    "\n",
    "    nk_cells_primary = nk_cells_primary.concat().to_pandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-05-17T15:37:22.374378Z",
     "iopub.status.busy": "2023-05-17T15:37:22.374168Z",
     "iopub.status.idle": "2023-05-17T15:37:22.377211Z",
     "shell.execute_reply": "2023-05-17T15:37:22.376841Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(59109, 21)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nk_cells_primary.shape"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can see a clear reduction in the number of cells.\n",
    "\n",
    "### Filtering out duplicate cells when creating an AnnData\n",
    "\n",
    "You can also utilize `is_primary_data` on the `obs_value_filter` of `get_anndata`.\n",
    "\n",
    "Let's repeat the process above. First querying by including **all** cells. To reduce the bandwidth and memory usage, let's just fetch data for one gene. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-05-17T15:37:22.379352Z",
     "iopub.status.busy": "2023-05-17T15:37:22.379138Z",
     "iopub.status.idle": "2023-05-17T15:37:35.927633Z",
     "shell.execute_reply": "2023-05-17T15:37:35.926700Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The \"stable\" release is currently 2023-05-15. Specify 'census_version=\"2023-05-15\"' in future calls to open_soma() to ensure data consistency.\n"
     ]
    }
   ],
   "source": [
    "with cellxgene_census.open_soma() as census:\n",
    "    adata = cellxgene_census.get_anndata(\n",
    "        census,\n",
    "        organism=\"Homo sapiens\",\n",
    "        var_value_filter=\"feature_name == 'AQP5'\",\n",
    "        obs_value_filter=\"cell_type == 'natural killer cell' \"\n",
    "        \"and disease == 'COVID-19' \"\n",
    "        \"and sex == 'female'\"\n",
    "        \"and tissue_general == 'blood'\",\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-05-17T15:37:35.930595Z",
     "iopub.status.busy": "2023-05-17T15:37:35.930444Z",
     "iopub.status.idle": "2023-05-17T15:37:35.934725Z",
     "shell.execute_reply": "2023-05-17T15:37:35.933957Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "80935"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(adata.obs)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we repeat the query only using cells marked as `True` for `is_primary_data`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-05-17T15:37:35.936993Z",
     "iopub.status.busy": "2023-05-17T15:37:35.936846Z",
     "iopub.status.idle": "2023-05-17T15:37:46.880757Z",
     "shell.execute_reply": "2023-05-17T15:37:46.879659Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The \"stable\" release is currently 2023-05-15. Specify 'census_version=\"2023-05-15\"' in future calls to open_soma() to ensure data consistency.\n"
     ]
    }
   ],
   "source": [
    "with cellxgene_census.open_soma() as census:\n",
    "    adata_primary = cellxgene_census.get_anndata(\n",
    "        census,\n",
    "        organism=\"Homo sapiens\",\n",
    "        var_value_filter=\"feature_name == 'AQP5'\",\n",
    "        obs_value_filter=\"cell_type == 'natural killer cell' \"\n",
    "        \"and disease == 'COVID-19' \"\n",
    "        \"and sex == 'female' \"\n",
    "        \"and tissue_general == 'blood'\"\n",
    "        \"and is_primary_data == True\",\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-05-17T15:37:46.883574Z",
     "iopub.status.busy": "2023-05-17T15:37:46.883432Z",
     "iopub.status.idle": "2023-05-17T15:37:46.888006Z",
     "shell.execute_reply": "2023-05-17T15:37:46.887189Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "59109"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(adata_primary.obs)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case you can also observe a clear reduction in the number of cells.\n",
    "\n",
    "#### Filtering out duplicate cells for out-of-core operations.\n",
    "\n",
    "Finally we can utilize `is_primary_data` on the `value_filter` of `obs` of an \"Axis Query\" to perform out-of-core operations.\n",
    "\n",
    "In this example we only include the version with duplicated cells removed."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-05-17T15:37:46.890416Z",
     "iopub.status.busy": "2023-05-17T15:37:46.890270Z",
     "iopub.status.idle": "2023-05-17T15:38:11.311838Z",
     "shell.execute_reply": "2023-05-17T15:38:11.310915Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The \"stable\" release is currently 2023-05-15. Specify 'census_version=\"2023-05-15\"' in future calls to open_soma() to ensure data consistency.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pyarrow.Table\n",
      "soma_dim_0: int64\n",
      "soma_dim_1: int64\n",
      "soma_data: float\n",
      "----\n",
      "soma_dim_0: [[8448858,8448858,8448858,8448858,8448858,...,52812487,52812553,52812556,52812556,52812566]]\n",
      "soma_dim_1: [[59,60,62,113,170,...,37033,37052,36904,36919,37033]]\n",
      "soma_data: [[1,1,1,1,1,...,1,1,1,1,2]]\n"
     ]
    }
   ],
   "source": [
    "import tiledbsoma\n",
    "\n",
    "with cellxgene_census.open_soma() as census:\n",
    "    human = census[\"census_data\"][\"homo_sapiens\"]\n",
    "\n",
    "    # initialize lazy query\n",
    "    query = human.axis_query(\n",
    "        measurement_name=\"RNA\",\n",
    "        obs_query=tiledbsoma.AxisQuery(\n",
    "            value_filter=\"cell_type == 'natural killer cell' \"\n",
    "            \"and disease == 'COVID-19' \"\n",
    "            \"and tissue_general == 'blood' \"\n",
    "            \"and sex == 'female' \"\n",
    "            \"and is_primary_data == True\"\n",
    "        ),\n",
    "    )\n",
    "\n",
    "    # get iterator for X\n",
    "    iterator = query.X(\"raw\").tables()\n",
    "\n",
    "    # iterate in chunks\n",
    "    for chunk in iterator:\n",
    "        print(chunk)\n",
    "\n",
    "        # since this is a demo we stop right away\n",
    "        break"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.11"
  },
  "vscode": {
   "interpreter": {
    "hash": "3da8ec1c162cd849e59e6ea2824b2e353dce799884e910aae99411be5277f953"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}