{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Summarizing cell and gene metadata\n", "\n", "This notebook provides examples for basic axis metadata handling using Pandas. The Census stores `obs` (cell) and `var` (gene) metadata in `SOMADataFrame` objects via the [TileDB-SOMA API](https://github.com/single-cell-data/TileDB-SOMA) ([documentation](https://tiledbsoma.readthedocs.io/en/latest/)), which can be queried and read as a Pandas `DataFrame` using `TileDB-SOMA`. \n", "\n", "Note that Pandas `DataFrame` is an in-memory object, therefore queries should be small enough for results to fit in memory.\n", "\n", "**Contents**\n", "\n", "1. Opening the Census\n", "1. Summarizing cell metadata\n", " 1. Example: Summarize all cell types\n", " 1. Example: Summarize a subset of cell types, selected with a `value_filter`\n", "1. Full Census metadata stats\n", "\n", "⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable `is_primary_data` which is described in the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#repeated-data).\n", "\n", "## Opening the Census\n", "\n", "The `cellxgene_census` python package contains a convenient API to open the latest version of the Census. If you open the Census, you should close it. `open_soma()` returns a context, so you can open/close it in several ways, like a Python file handle. The context manager is preferred, as it will automatically close upon an error raise.\n", "\n", "You can learn more about the `cellxgene_census` methods by accessing their corresponding documentation via `help()`. For example `help(cellxgene_census.open_soma)`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:30:26.404893Z", "iopub.status.busy": "2023-07-28T14:30:26.404633Z", "iopub.status.idle": "2023-07-28T14:30:29.216830Z", "shell.execute_reply": "2023-07-28T14:30:29.216225Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n", "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n" ] } ], "source": [ "import cellxgene_census\n", "\n", "# Preferred: use a Python context manager\n", "with cellxgene_census.open_soma() as census:\n", " ...\n", "\n", "# or, directly open the census (don't forget to close it!)\n", "census = cellxgene_census.open_soma()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Summarizing cell metadata\n", "\n", "Once the Census is open you can use its `TileDB-SOMA` methods as it is itself a `SOMACollection`. You can thus access the metadata `SOMADataFrame` objects encoding cell and gene metadata.\n", "\n", "Tips:\n", "\n", "- You can read an _entire_ `SOMADataFrame` into a Pandas `DataFrame` using `soma_df.read().concat().to_pandas()`, allowing the use of the standard Pandas API.\n", "- Queries will be much faster if you request only the DataFrame columns required for your analysis (e.g., `column_names=[\"cell_type_ontology_term_id\"]`).\n", "- You can also further refine query results by using a `value_filter`, which will filter the census for matching records.\n", "\n", "### Example: Summarize all cell types\n", "\n", "This example reads the cell metadata (`obs`) into a Pandas DataFrame, and summarizes in a variety of ways using Pandas API." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:30:29.219969Z", "iopub.status.busy": "2023-07-28T14:30:29.219515Z", "iopub.status.idle": "2023-07-28T14:30:36.711046Z", "shell.execute_reply": "2023-07-28T14:30:36.710444Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 613 cell types in the Census! The first 10 are: ['CL:0000525', 'CL:2000060', 'CL:0008036', 'CL:0002488', 'CL:0002343', 'CL:0000084', 'CL:0001078', 'CL:0000815', 'CL:0000235', 'CL:3000001']\n", "\n", "The top 10 cell types and their counts are:\n", "cell_type_ontology_term_id\n", "CL:0000540 7665340\n", "CL:0000679 1894047\n", "CL:0000128 1881077\n", "CL:0000624 1508920\n", "CL:0000625 1477453\n", "CL:0000235 1419507\n", "CL:0000057 1397813\n", "CL:0000860 1369142\n", "CL:0000003 1308000\n", "CL:4023040 1229658\n", "Name: count, dtype: int64\n" ] } ], "source": [ "human = census[\"census_data\"][\"homo_sapiens\"]\n", "\n", "# Read entire _obs_ into a pandas dataframe.\n", "obs_df = human.obs.read(column_names=[\"cell_type_ontology_term_id\"]).concat().to_pandas()\n", "\n", "# Use Pandas API to find all unique values in the `cell_type_ontology_term_id` column.\n", "unique_cell_type_ontology_term_id = obs_df.cell_type_ontology_term_id.unique()\n", "\n", "# Display only the first 10, as there are a LOT!\n", "print(\n", " f\"There are {len(unique_cell_type_ontology_term_id)} cell types in the Census! The first 10 are:\",\n", " unique_cell_type_ontology_term_id[0:10].tolist(),\n", ")\n", "\n", "# Using Pandas API, count the instances of each cell type term and return the top 10.\n", "top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]\n", "print(\"\\nThe top 10 cell types and their counts are:\")\n", "print(top_10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Example: Summarize a subset of cell types, selected with a `value_filter`\n", "\n", "This example utilizes a SOMA \"value filter\" to read the subset of cells with `tissue_ontology_term_id` equal to `UBERON:0002048` (lung tissue), and summarizes the query result using Pandas." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:30:36.713974Z", "iopub.status.busy": "2023-07-28T14:30:36.713681Z", "iopub.status.idle": "2023-07-28T14:30:38.595585Z", "shell.execute_reply": "2023-07-28T14:30:38.595010Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 185 cell types in the Census where tissue_ontology_term_id == UBERON:0002048! The first 10 are: ['CL:0002063', 'CL:0000775', 'CL:0001044', 'CL:0001050', 'CL:0000814', 'CL:0000071', 'CL:0000192', 'CL:0002503', 'CL:0000235', 'CL:0002370']\n", "\n", "Top 10 cell types where tissue_ontology_term_id == UBERON:0002048\n", "cell_type_ontology_term_id\n", "CL:0000003 562038\n", "CL:0000583 526859\n", "CL:0000625 323985\n", "CL:0000624 323610\n", "CL:0000235 266333\n", "CL:0002063 255425\n", "CL:0000860 205013\n", "CL:0000623 164944\n", "CL:0001064 149067\n", "CL:0002632 132243\n", "Name: count, dtype: int64\n" ] } ], "source": [ "# Count cell_type occurrences for cells with tissue == 'lung'\n", "human = census[\"census_data\"][\"homo_sapiens\"]\n", "\n", "# Read cell_type terms for cells which have a specific tissue term\n", "LUNG_TISSUE = \"UBERON:0002048\"\n", "\n", "obs_df = (\n", " human.obs.read(\n", " column_names=[\"cell_type_ontology_term_id\"],\n", " value_filter=f\"tissue_ontology_term_id == '{LUNG_TISSUE}'\",\n", " )\n", " .concat()\n", " .to_pandas()\n", ")\n", "\n", "# Use Pandas API to find all unique values in the `cell_type_ontology_term_id` column.\n", "unique_cell_type_ontology_term_id = obs_df.cell_type_ontology_term_id.unique()\n", "\n", "print(\n", " f\"There are {len(unique_cell_type_ontology_term_id)} cell types in the Census where tissue_ontology_term_id == {LUNG_TISSUE}! The first 10 are:\",\n", " unique_cell_type_ontology_term_id[0:10].tolist(),\n", ")\n", "\n", "# Use Pandas API to count, and grab 10 most common\n", "top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]\n", "print(f\"\\nTop 10 cell types where tissue_ontology_term_id == {LUNG_TISSUE}\")\n", "print(top_10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also define much more complex value filters. For example:\n", "\n", "* combine terms with `and` and `or`\n", "* use the `in` operator to query on multiple values" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:30:38.598036Z", "iopub.status.busy": "2023-07-28T14:30:38.597767Z", "iopub.status.idle": "2023-07-28T14:30:39.797374Z", "shell.execute_reply": "2023-07-28T14:30:39.796849Z" } }, "outputs": [ { "data": { "text/plain": [ "cell_type_ontology_term_id\n", "CL:0000746 49929\n", "CL:0008034 33361\n", "CL:0002548 33180\n", "CL:0002131 30915\n", "CL:0000115 30054\n", "CL:0000003 18391\n", "CL:0000763 14408\n", "CL:0000669 13552\n", "CL:0000057 9690\n", "CL:0002144 9025\n", "Name: count, dtype: int64" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "# You can also do more complex queries, such as testing for inclusion in a list of values and \"and\" operations\n", "human = census[\"census_data\"][\"homo_sapiens\"]\n", "\n", "VENTRICLES = [\"UBERON:0002082\", \"UBERON:OOO2084\", \"UBERON:0002080\"]\n", "\n", "obs_df = (\n", " human.obs.read(\n", " column_names=[\"cell_type_ontology_term_id\"],\n", " value_filter=f\"tissue_ontology_term_id in {VENTRICLES} and is_primary_data == True\",\n", " )\n", " .concat()\n", " .to_pandas()\n", ")\n", "\n", "# Use Pandas API to summarize\n", "top_10 = obs_df.cell_type_ontology_term_id.value_counts()[0:10]\n", "display(top_10)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Full Census metadata stats\n", "\n", "This example queries all organisms in the Census, and summarizes the diversity of various metadata lables." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:30:39.799869Z", "iopub.status.busy": "2023-07-28T14:30:39.799605Z", "iopub.status.idle": "2023-07-28T14:30:54.607269Z", "shell.execute_reply": "2023-07-28T14:30:54.606651Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Complete census contains 61656118 cells.\n", "mus_musculus\n", "\tUnique cell_type_ontology_term_id values: 248\n", "\tUnique assay_ontology_term_id values: 9\n", "\tUnique tissue_ontology_term_id values: 66\n", "homo_sapiens\n", "\tUnique cell_type_ontology_term_id values: 613\n", "\tUnique assay_ontology_term_id values: 19\n", "\tUnique tissue_ontology_term_id values: 220\n" ] } ], "source": [ "COLS_TO_QUERY = [\n", " \"cell_type_ontology_term_id\",\n", " \"assay_ontology_term_id\",\n", " \"tissue_ontology_term_id\",\n", "]\n", "\n", "obs_df = {\n", " name: experiment.obs.read(column_names=COLS_TO_QUERY).concat().to_pandas()\n", " for name, experiment in census[\"census_data\"].items()\n", "}\n", "\n", "# Use Pandas API to summarize each organism\n", "print(f\"Complete census contains {sum(len(df) for df in obs_df.values())} cells.\")\n", "for organism, df in obs_df.items():\n", " print(organism)\n", " for col in COLS_TO_QUERY:\n", " print(f\"\\tUnique {col} values: {len(df[col].unique())}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Close the census" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:30:54.609746Z", "iopub.status.busy": "2023-07-28T14:30:54.609482Z", "iopub.status.idle": "2023-07-28T14:30:54.612526Z", "shell.execute_reply": "2023-07-28T14:30:54.612022Z" } }, "outputs": [], "source": [ "census.close()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" }, "vscode": { "interpreter": { "hash": "3da8ec1c162cd849e59e6ea2824b2e353dce799884e910aae99411be5277f953" } } }, "nbformat": 4, "nbformat_minor": 2 }