{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Understanding and filtering out duplicate cells\n", "\n", "This tutorial provides an explanation for the existence of duplicate cells in the Census, and it showcases different ways to handle these cells when performing queries on the Census using the `is_primary_data` cell metadata variable. \n", "\n", "**Contents**\n", "\n", "1. Why are there duplicate cells in the Census?\n", "2. An example: duplicate cells in the Tabula Muris Senis data.\n", "3. Filtering out duplicates cells.\n", " 1. Filtering out duplicate cells when reading the `obs` data frame.\n", " 2. Filtering out duplicate cells when creating an AnnData.\n", " 3. Filtering out duplicate cells for out-of-core operations.\n", " \n", "## Why are there duplicate cells in the Census?\n", "\n", "Duplicate cells are labeled on the `is_primary_data` cell metadata variable as `False`. To learn more about this please take a look at the corresponding [section of the dataset schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md#is_primary_data). \n", "\n", "The Census data is a concatenation of most RNA data from CZ CELLxGENE Discover and these data are ingested one dataset at a time. You can take a look at what data is included in the Census [here](https://chanzuckerberg.github.io/cellxgene-census/cellxgene_census_docsite_schema.html).\n", "\n", "In some cases data from the same cell exists in different datasets, therefore cells can be duplicated throughout CELLxGENE Discover and by extension the Census. \n", "\n", "The following are a few examples where cells are duplicated in CELLxGENE Discover:\n", "\n", "* There are datasets that combine data from other, pre-existing datasets.\n", "\n", "> *For example [Tabula Sapiens](https://cellxgene.cziscience.com/collections/e5f58829-1a66-40b5-a624-9046778e74f5) has one dataset with all of its cells and separate datasets with cells divided by high-level lineage (i.e. immune, epithelial, stromal, endothelial)*\n", "\n", "* A dataset may provide a meta-analysis of pre-existing datasets.\n", "\n", "> *For example [Jin et al.](https://cellxgene.cziscience.com/collections/b9fc3d70-5a72-4479-a046-c2cc1ab19efc) performed a meta-analysis of COVID-19 data, and they included both the individual datasets as well as one concatenated dataset*\n", "\n", "The Census has all of these data to allow for the execution of dataset-based queries, which would be otherwise be limited if only non-duplicate cells were included.\n", "\n", "## An example: duplicate cells in the Tabula Muris Senis data\n", "\n", "Let's take a look at an example from the Census using the Tabula Muris Senis data. Some of its datasets contain duplicated cells.\n", "\n", "We can obtain cell metadata for the **main** Tabula Muris Senis dataset: \"All - A single-cell transcriptomic atlas characterizes ageing tissues in the mouse - 10x\", which contains the original (non-duplicated) cells.\n", "\n", "And remember we must include the `is_primary_data` column." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2023-05-17T15:37:06.246546Z", "iopub.status.busy": "2023-05-17T15:37:06.246069Z", "iopub.status.idle": "2023-05-17T15:37:08.867857Z", "shell.execute_reply": "2023-05-17T15:37:08.867253Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The \"stable\" release is currently 2023-05-15. Specify 'census_version=\"2023-05-15\"' in future calls to open_soma() to ensure data consistency.\n" ] } ], "source": [ "import cellxgene_census\n", "\n", "tabula_muris_dataset_id = \"48b37086-25f7-4ecd-be66-f5bb378e3aea\"\n", "\n", "with cellxgene_census.open_soma() as census:\n", " tabula_muris_obs = census[\"census_data\"][\"mus_musculus\"].obs.read(\n", " value_filter=f\"dataset_id == '{tabula_muris_dataset_id}'\", column_names=[\"tissue\", \"is_primary_data\"]\n", " )\n", "\n", " tabula_muris_obs = tabula_muris_obs.concat().to_pandas()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now let's take a look at counts for the unique combinations of values.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2023-05-17T15:37:08.871706Z", "iopub.status.busy": "2023-05-17T15:37:08.870353Z", "iopub.status.idle": "2023-05-17T15:37:08.911114Z", "shell.execute_reply": "2023-05-17T15:37:08.910586Z" } }, "outputs": [ { "data": { "text/plain": [ "tissue is_primary_data dataset_id \n", "bone marrow True 48b37086-25f7-4ecd-be66-f5bb378e3aea 40220\n", "spleen True 48b37086-25f7-4ecd-be66-f5bb378e3aea 35718\n", "limb muscle True 48b37086-25f7-4ecd-be66-f5bb378e3aea 28867\n", "lung True 48b37086-25f7-4ecd-be66-f5bb378e3aea 24540\n", "kidney True 48b37086-25f7-4ecd-be66-f5bb378e3aea 21647\n", "tongue True 48b37086-25f7-4ecd-be66-f5bb378e3aea 20680\n", "mammary gland True 48b37086-25f7-4ecd-be66-f5bb378e3aea 12295\n", "thymus True 48b37086-25f7-4ecd-be66-f5bb378e3aea 9275\n", "bladder lumen True 48b37086-25f7-4ecd-be66-f5bb378e3aea 8945\n", "heart True 48b37086-25f7-4ecd-be66-f5bb378e3aea 8613\n", "trachea True 48b37086-25f7-4ecd-be66-f5bb378e3aea 7976\n", "liver True 48b37086-25f7-4ecd-be66-f5bb378e3aea 7294\n", "adipose tissue True 48b37086-25f7-4ecd-be66-f5bb378e3aea 6777\n", "pancreas True 48b37086-25f7-4ecd-be66-f5bb378e3aea 6201\n", "skin of body True 48b37086-25f7-4ecd-be66-f5bb378e3aea 4454\n", "large intestine True 48b37086-25f7-4ecd-be66-f5bb378e3aea 1887\n", "Name: count, dtype: int64" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tabula_muris_obs.value_counts()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "You can see all cells across the tissues are labelled as `True` for `is_primary_data`.\n", "\n", "But what if we select cells from the dataset that only contains cells from the liver: \"Liver - A single-cell transcriptomic atlas characterizes ageing tissues in the mouse - 10x\".\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2023-05-17T15:37:08.913202Z", "iopub.status.busy": "2023-05-17T15:37:08.913060Z", "iopub.status.idle": "2023-05-17T15:37:09.968086Z", "shell.execute_reply": "2023-05-17T15:37:09.967626Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The \"stable\" release is currently 2023-05-15. Specify 'census_version=\"2023-05-15\"' in future calls to open_soma() to ensure data consistency.\n" ] } ], "source": [ "tabula_muris_liver_dataset_id = \"6202a243-b713-4e12-9ced-c387f8483dea\"\n", "\n", "with cellxgene_census.open_soma() as census:\n", " tabula_muris_liver_obs = census[\"census_data\"][\"mus_musculus\"].obs.read(\n", " value_filter=f\"dataset_id == '{tabula_muris_liver_dataset_id}'\", column_names=[\"tissue\", \"is_primary_data\"]\n", " )\n", "\n", " tabula_muris_liver_obs = tabula_muris_liver_obs.concat().to_pandas()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "And we take a look at counts for the unique combinations of values." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2023-05-17T15:37:09.970720Z", "iopub.status.busy": "2023-05-17T15:37:09.970563Z", "iopub.status.idle": "2023-05-17T15:37:09.976304Z", "shell.execute_reply": "2023-05-17T15:37:09.975953Z" } }, "outputs": [ { "data": { "text/plain": [ "tissue is_primary_data dataset_id \n", "liver False 6202a243-b713-4e12-9ced-c387f8483dea 7294\n", "Name: count, dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tabula_muris_liver_obs.value_counts()" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "You can see that:\n", "\n", "1. This dataset only contains cells from liver.\n", "2. All cells are labelled as `False` for `is_primary_data`. **This is because the cells are marked as duplicate cells of the main Tabula Muris Senis dataset.**\n", "\n", "## Filtering out duplicate cells\n", "\n", "In some cases you may be interested in getting all cells for a specific biological context, for example *\"all natural killer cells from blood of female cells with COVID-19\"* but you need to be aware that there is a chance you end up with some duplicate cells.\n", "\n", "We therefore recommend that you always look at `is_primary_data` and use that information based on your needs.\n", "\n", "If you know *a priori* that you don't want duplicated cells this section shows you how to efficiently exclude them from your queries. \n", "\n", "### Filtering out duplicate cells when reading the `obs` data frame.\n", "\n", "Let's say you are interested in looking at the cell metadata of *\"all natural killer cells from blood of female cells with COVID-19\"* but you want to exclude duplicate cells, then you can use `value_filter` when reading the data frame to only include cells with `is_primary_data` as `True`.\n", "\n", "Let's first read the cell metadata including **all** cells:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2023-05-17T15:37:09.978023Z", "iopub.status.busy": "2023-05-17T15:37:09.977879Z", "iopub.status.idle": "2023-05-17T15:37:16.049592Z", "shell.execute_reply": "2023-05-17T15:37:16.048490Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The \"stable\" release is currently 2023-05-15. Specify 'census_version=\"2023-05-15\"' in future calls to open_soma() to ensure data consistency.\n" ] } ], "source": [ "with cellxgene_census.open_soma() as census:\n", " nk_cells = census[\"census_data\"][\"homo_sapiens\"].obs.read(\n", " value_filter=\"cell_type == 'natural killer cell' \"\n", " \"and disease == 'COVID-19' \"\n", " \"and sex == 'female'\"\n", " \"and tissue_general == 'blood'\"\n", " )\n", "\n", " nk_cells = nk_cells.concat().to_pandas()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2023-05-17T15:37:16.052588Z", "iopub.status.busy": "2023-05-17T15:37:16.052364Z", "iopub.status.idle": "2023-05-17T15:37:16.055971Z", "shell.execute_reply": "2023-05-17T15:37:16.055607Z" } }, "outputs": [ { "data": { "text/plain": [ "(80935, 21)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nk_cells.shape" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "And now we repeat the query only using cells marked as `True` for `is_primary_data`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2023-05-17T15:37:16.057773Z", "iopub.status.busy": "2023-05-17T15:37:16.057628Z", "iopub.status.idle": "2023-05-17T15:37:22.371662Z", "shell.execute_reply": "2023-05-17T15:37:22.370999Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The \"stable\" release is currently 2023-05-15. Specify 'census_version=\"2023-05-15\"' in future calls to open_soma() to ensure data consistency.\n" ] } ], "source": [ "with cellxgene_census.open_soma() as census:\n", " nk_cells_primary = census[\"census_data\"][\"homo_sapiens\"].obs.read(\n", " value_filter=\"cell_type == 'natural killer cell' \"\n", " \"and disease == 'COVID-19' \"\n", " \"and tissue_general == 'blood'\"\n", " \"and sex == 'female'\"\n", " \"and is_primary_data == True\"\n", " )\n", "\n", " nk_cells_primary = nk_cells_primary.concat().to_pandas()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2023-05-17T15:37:22.374378Z", "iopub.status.busy": "2023-05-17T15:37:22.374168Z", "iopub.status.idle": "2023-05-17T15:37:22.377211Z", "shell.execute_reply": "2023-05-17T15:37:22.376841Z" } }, "outputs": [ { "data": { "text/plain": [ "(59109, 21)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nk_cells_primary.shape" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "You can see a clear reduction in the number of cells.\n", "\n", "### Filtering out duplicate cells when creating an AnnData\n", "\n", "You can also utilize `is_primary_data` on the `obs_value_filter` of `get_anndata`.\n", "\n", "Let's repeat the process above. First querying by including **all** cells. To reduce the bandwidth and memory usage, let's just fetch data for one gene. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2023-05-17T15:37:22.379352Z", "iopub.status.busy": "2023-05-17T15:37:22.379138Z", "iopub.status.idle": "2023-05-17T15:37:35.927633Z", "shell.execute_reply": "2023-05-17T15:37:35.926700Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The \"stable\" release is currently 2023-05-15. Specify 'census_version=\"2023-05-15\"' in future calls to open_soma() to ensure data consistency.\n" ] } ], "source": [ "with cellxgene_census.open_soma() as census:\n", " adata = cellxgene_census.get_anndata(\n", " census,\n", " organism=\"Homo sapiens\",\n", " var_value_filter=\"feature_name == 'AQP5'\",\n", " obs_value_filter=\"cell_type == 'natural killer cell' \"\n", " \"and disease == 'COVID-19' \"\n", " \"and sex == 'female'\"\n", " \"and tissue_general == 'blood'\",\n", " )" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2023-05-17T15:37:35.930595Z", "iopub.status.busy": "2023-05-17T15:37:35.930444Z", "iopub.status.idle": "2023-05-17T15:37:35.934725Z", "shell.execute_reply": "2023-05-17T15:37:35.933957Z" } }, "outputs": [ { "data": { "text/plain": [ "80935" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(adata.obs)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "And now we repeat the query only using cells marked as `True` for `is_primary_data`." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2023-05-17T15:37:35.936993Z", "iopub.status.busy": "2023-05-17T15:37:35.936846Z", "iopub.status.idle": "2023-05-17T15:37:46.880757Z", "shell.execute_reply": "2023-05-17T15:37:46.879659Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The \"stable\" release is currently 2023-05-15. Specify 'census_version=\"2023-05-15\"' in future calls to open_soma() to ensure data consistency.\n" ] } ], "source": [ "with cellxgene_census.open_soma() as census:\n", " adata_primary = cellxgene_census.get_anndata(\n", " census,\n", " organism=\"Homo sapiens\",\n", " var_value_filter=\"feature_name == 'AQP5'\",\n", " obs_value_filter=\"cell_type == 'natural killer cell' \"\n", " \"and disease == 'COVID-19' \"\n", " \"and sex == 'female' \"\n", " \"and tissue_general == 'blood'\"\n", " \"and is_primary_data == True\",\n", " )" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2023-05-17T15:37:46.883574Z", "iopub.status.busy": "2023-05-17T15:37:46.883432Z", "iopub.status.idle": "2023-05-17T15:37:46.888006Z", "shell.execute_reply": "2023-05-17T15:37:46.887189Z" } }, "outputs": [ { "data": { "text/plain": [ "59109" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(adata_primary.obs)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "In this case you can also observe a clear reduction in the number of cells.\n", "\n", "#### Filtering out duplicate cells for out-of-core operations.\n", "\n", "Finally we can utilize `is_primary_data` on the `value_filter` of `obs` of an \"Axis Query\" to perform out-of-core operations.\n", "\n", "In this example we only include the version with duplicated cells removed." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2023-05-17T15:37:46.890416Z", "iopub.status.busy": "2023-05-17T15:37:46.890270Z", "iopub.status.idle": "2023-05-17T15:38:11.311838Z", "shell.execute_reply": "2023-05-17T15:38:11.310915Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The \"stable\" release is currently 2023-05-15. Specify 'census_version=\"2023-05-15\"' in future calls to open_soma() to ensure data consistency.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "pyarrow.Table\n", "soma_dim_0: int64\n", "soma_dim_1: int64\n", "soma_data: float\n", "----\n", "soma_dim_0: [[8448858,8448858,8448858,8448858,8448858,...,52812487,52812553,52812556,52812556,52812566]]\n", "soma_dim_1: [[59,60,62,113,170,...,37033,37052,36904,36919,37033]]\n", "soma_data: [[1,1,1,1,1,...,1,1,1,1,2]]\n" ] } ], "source": [ "import tiledbsoma\n", "\n", "with cellxgene_census.open_soma() as census:\n", " human = census[\"census_data\"][\"homo_sapiens\"]\n", "\n", " # initialize lazy query\n", " query = human.axis_query(\n", " measurement_name=\"RNA\",\n", " obs_query=tiledbsoma.AxisQuery(\n", " value_filter=\"cell_type == 'natural killer cell' \"\n", " \"and disease == 'COVID-19' \"\n", " \"and tissue_general == 'blood' \"\n", " \"and sex == 'female' \"\n", " \"and is_primary_data == True\"\n", " ),\n", " )\n", "\n", " # get iterator for X\n", " iterator = query.X(\"raw\").tables()\n", "\n", " # iterate in chunks\n", " for chunk in iterator:\n", " print(chunk)\n", "\n", " # since this is a demo we stop right away\n", " break" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.11" }, "vscode": { "interpreter": { "hash": "3da8ec1c162cd849e59e6ea2824b2e353dce799884e910aae99411be5277f953" } } }, "nbformat": 4, "nbformat_minor": 2 }