{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Exploring the Census Datasets table\n", "\n", "This tutorial demonstrates basic use of the `census_datasets` dataframe that contains metadata of the Census source datasets. This metadata can be joined to the cell metadata dataframe (`obs`) via the column `dataset_id`, \n", "\n", "**Contents**\n", "\n", "1. Fetching the datasets table.\n", "2. Fetching the expression data from a single dataset.\n", "3. Downloading the original source H5AD file of a dataset.\n", "\n", "⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable `is_primary_data` which is described in the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#repeated-data).\n", "\n", "## Fetching the datasets table\n", "\n", "\n", "Each Census contains a top-level dataframe itemizing the datasets contained therein. You can read this into a `pandas.DataFrame`." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T13:51:38.383599Z", "iopub.status.busy": "2023-07-28T13:51:38.383335Z", "iopub.status.idle": "2023-07-28T13:51:41.392248Z", "shell.execute_reply": "2023-07-28T13:51:41.391544Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
collection_idcollection_namecollection_doidataset_iddataset_titledataset_h5ad_pathdataset_total_cell_count
soma_joinid
0e2c257e7-6f79-487c-b81c-39451cd4ab3cSpatial multiomics map of trophoblast developm...10.1038/s41586-023-05869-0f171db61-e57e-4535-a06a-35d8b6ef8f2bdonor_p13_trophoblastsf171db61-e57e-4535-a06a-35d8b6ef8f2b.h5ad31497
1e2c257e7-6f79-487c-b81c-39451cd4ab3cSpatial multiomics map of trophoblast developm...10.1038/s41586-023-05869-0ecf2e08e-2032-4a9e-b466-b65b395f4a02All donors trophoblastsecf2e08e-2032-4a9e-b466-b65b395f4a02.h5ad67070
2e2c257e7-6f79-487c-b81c-39451cd4ab3cSpatial multiomics map of trophoblast developm...10.1038/s41586-023-05869-074cff64f-9da9-4b2a-9b3b-8a04a1598040All donors all cell states (in vivo)74cff64f-9da9-4b2a-9b3b-8a04a1598040.h5ad286326
3f7cecffa-00b4-4560-a29a-8ad626b8ee08Mapping single-cell transcriptomes in the intr...10.1016/j.ccell.2022.11.0015af90777-6760-4003-9dba-8f945fec6fdfSingle-cell transcriptomic datasets of Renal c...5af90777-6760-4003-9dba-8f945fec6fdf.h5ad270855
43f50314f-bdc9-40c6-8e4a-b0901ebfbe4cSingle-cell sequencing links multiregional imm...10.1016/j.ccell.2021.03.007bd65a70f-b274-4133-b9dd-0d1431b6af34Single-cell sequencing links multiregional imm...bd65a70f-b274-4133-b9dd-0d1431b6af34.h5ad167283
........................
588180bff9c-c8a5-4539-b13b-ddbc00d643e6Molecular characterization of selectively vuln...10.1038/s41593-020-00764-7f9ad5649-f372-43e1-a3a8-423383e5a8a2Molecular characterization of selectively vuln...f9ad5649-f372-43e1-a3a8-423383e5a8a2.h5ad8168
589a72afd53-ab92-4511-88da-252fb0e26b9aSingle-cell atlas of peripheral immune respons...10.1038/s41591-020-0944-y456e8b9b-f872-488b-871d-94534090a865Single-cell atlas of peripheral immune respons...456e8b9b-f872-488b-871d-94534090a865.h5ad44721
59038833785-fac5-48fd-944a-0f62a4c23ed1Construction of a human cell landscape at sing...10.1038/s41586-020-2157-42adb1f8a-a6b1-4909-8ee8-484814e2d4bfConstruction of a human cell landscape at sing...2adb1f8a-a6b1-4909-8ee8-484814e2d4bf.h5ad598266
5915d445965-6f1a-4b68-ba3a-b8f765155d3aA molecular cell atlas of the human lung from ...10.1038/s41586-020-2922-4e04daea4-4412-45b5-989e-76a9be070a89Krasnow Lab Human Lung Cell Atlas, Smart-seq2e04daea4-4412-45b5-989e-76a9be070a89.h5ad9409
5925d445965-6f1a-4b68-ba3a-b8f765155d3aA molecular cell atlas of the human lung from ...10.1038/s41586-020-2922-48c42cfd0-0b0a-46d5-910c-fc833d83c45eKrasnow Lab Human Lung Cell Atlas, 10X8c42cfd0-0b0a-46d5-910c-fc833d83c45e.h5ad65662
\n", "

593 rows × 7 columns

\n", "
" ], "text/plain": [ " collection_id \\\n", "soma_joinid \n", "0 e2c257e7-6f79-487c-b81c-39451cd4ab3c \n", "1 e2c257e7-6f79-487c-b81c-39451cd4ab3c \n", "2 e2c257e7-6f79-487c-b81c-39451cd4ab3c \n", "3 f7cecffa-00b4-4560-a29a-8ad626b8ee08 \n", "4 3f50314f-bdc9-40c6-8e4a-b0901ebfbe4c \n", "... ... \n", "588 180bff9c-c8a5-4539-b13b-ddbc00d643e6 \n", "589 a72afd53-ab92-4511-88da-252fb0e26b9a \n", "590 38833785-fac5-48fd-944a-0f62a4c23ed1 \n", "591 5d445965-6f1a-4b68-ba3a-b8f765155d3a \n", "592 5d445965-6f1a-4b68-ba3a-b8f765155d3a \n", "\n", " collection_name \\\n", "soma_joinid \n", "0 Spatial multiomics map of trophoblast developm... \n", "1 Spatial multiomics map of trophoblast developm... \n", "2 Spatial multiomics map of trophoblast developm... \n", "3 Mapping single-cell transcriptomes in the intr... \n", "4 Single-cell sequencing links multiregional imm... \n", "... ... \n", "588 Molecular characterization of selectively vuln... \n", "589 Single-cell atlas of peripheral immune respons... \n", "590 Construction of a human cell landscape at sing... \n", "591 A molecular cell atlas of the human lung from ... \n", "592 A molecular cell atlas of the human lung from ... \n", "\n", " collection_doi \\\n", "soma_joinid \n", "0 10.1038/s41586-023-05869-0 \n", "1 10.1038/s41586-023-05869-0 \n", "2 10.1038/s41586-023-05869-0 \n", "3 10.1016/j.ccell.2022.11.001 \n", "4 10.1016/j.ccell.2021.03.007 \n", "... ... \n", "588 10.1038/s41593-020-00764-7 \n", "589 10.1038/s41591-020-0944-y \n", "590 10.1038/s41586-020-2157-4 \n", "591 10.1038/s41586-020-2922-4 \n", "592 10.1038/s41586-020-2922-4 \n", "\n", " dataset_id \\\n", "soma_joinid \n", "0 f171db61-e57e-4535-a06a-35d8b6ef8f2b \n", "1 ecf2e08e-2032-4a9e-b466-b65b395f4a02 \n", "2 74cff64f-9da9-4b2a-9b3b-8a04a1598040 \n", "3 5af90777-6760-4003-9dba-8f945fec6fdf \n", "4 bd65a70f-b274-4133-b9dd-0d1431b6af34 \n", "... ... \n", "588 f9ad5649-f372-43e1-a3a8-423383e5a8a2 \n", "589 456e8b9b-f872-488b-871d-94534090a865 \n", "590 2adb1f8a-a6b1-4909-8ee8-484814e2d4bf \n", "591 e04daea4-4412-45b5-989e-76a9be070a89 \n", "592 8c42cfd0-0b0a-46d5-910c-fc833d83c45e \n", "\n", " dataset_title \\\n", "soma_joinid \n", "0 donor_p13_trophoblasts \n", "1 All donors trophoblasts \n", "2 All donors all cell states (in vivo) \n", "3 Single-cell transcriptomic datasets of Renal c... \n", "4 Single-cell sequencing links multiregional imm... \n", "... ... \n", "588 Molecular characterization of selectively vuln... \n", "589 Single-cell atlas of peripheral immune respons... \n", "590 Construction of a human cell landscape at sing... \n", "591 Krasnow Lab Human Lung Cell Atlas, Smart-seq2 \n", "592 Krasnow Lab Human Lung Cell Atlas, 10X \n", "\n", " dataset_h5ad_path \\\n", "soma_joinid \n", "0 f171db61-e57e-4535-a06a-35d8b6ef8f2b.h5ad \n", "1 ecf2e08e-2032-4a9e-b466-b65b395f4a02.h5ad \n", "2 74cff64f-9da9-4b2a-9b3b-8a04a1598040.h5ad \n", "3 5af90777-6760-4003-9dba-8f945fec6fdf.h5ad \n", "4 bd65a70f-b274-4133-b9dd-0d1431b6af34.h5ad \n", "... ... \n", "588 f9ad5649-f372-43e1-a3a8-423383e5a8a2.h5ad \n", "589 456e8b9b-f872-488b-871d-94534090a865.h5ad \n", "590 2adb1f8a-a6b1-4909-8ee8-484814e2d4bf.h5ad \n", "591 e04daea4-4412-45b5-989e-76a9be070a89.h5ad \n", "592 8c42cfd0-0b0a-46d5-910c-fc833d83c45e.h5ad \n", "\n", " dataset_total_cell_count \n", "soma_joinid \n", "0 31497 \n", "1 67070 \n", "2 286326 \n", "3 270855 \n", "4 167283 \n", "... ... \n", "588 8168 \n", "589 44721 \n", "590 598266 \n", "591 9409 \n", "592 65662 \n", "\n", "[593 rows x 7 columns]" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import cellxgene_census\n", "\n", "census = cellxgene_census.open_soma()\n", "census_datasets = census[\"census_info\"][\"datasets\"].read().concat().to_pandas()\n", "\n", "# for convenience, indexing on the soma_joinid which links this to other census data.\n", "census_datasets = census_datasets.set_index(\"soma_joinid\")\n", "\n", "census_datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The sum cells across all datasets should match the number of cells across all SOMA experiments (human, mouse)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T13:51:41.395414Z", "iopub.status.busy": "2023-07-28T13:51:41.394697Z", "iopub.status.idle": "2023-07-28T13:51:42.997103Z", "shell.execute_reply": "2023-07-28T13:51:42.996409Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Count by experiment:\n", "\t5255245 cells in mus_musculus\n", "\t56400873 cells in homo_sapiens\n", "\n", "Found 61656118 cells in all experiments.\n", "Found 61656118 cells in all datasets.\n" ] } ], "source": [ "# Count cells across all experiments\n", "all_experiments = (\n", " (organism_name, organism_experiment) for organism_name, organism_experiment in census[\"census_data\"].items()\n", ")\n", "experiments_total_cells = 0\n", "print(\"Count by experiment:\")\n", "for organism_name, organism_experiment in all_experiments:\n", " num_cells = len(organism_experiment.obs.read(column_names=[\"soma_joinid\"]).concat().to_pandas())\n", " print(f\"\\t{num_cells} cells in {organism_name}\")\n", " experiments_total_cells += num_cells\n", "\n", "print(f\"\\nFound {experiments_total_cells} cells in all experiments.\")\n", "\n", "# Count cells across all datasets\n", "print(f\"Found {census_datasets.dataset_total_cell_count.sum()} cells in all datasets.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Fetching the expression data from a single dataset\n", "\n", "Lets pick one dataset to slice out of the census, and turn into an [AnnData](https://anndata.readthedocs.io/en/latest/) in-memory object. This can be used with the [ScanPy](https://scanpy.readthedocs.io/en/stable/) toolchain. You can also save this AnnData locally using the AnnData [write](https://anndata.readthedocs.io/en/latest/api.html#writing) API." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T13:51:42.999721Z", "iopub.status.busy": "2023-07-28T13:51:42.999445Z", "iopub.status.idle": "2023-07-28T13:51:43.008115Z", "shell.execute_reply": "2023-07-28T13:51:43.007588Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
collection_idcollection_namecollection_doidataset_iddataset_titledataset_h5ad_pathdataset_total_cell_count
soma_joinid
5220b9d8a04-bb9d-44da-aa27-705bb65b54ebTabula Muris Senis10.1038/s41586-020-2496-10bd1a1de-3aee-40e0-b2ec-86c7a30c7149Bone marrow - A single-cell transcriptomic atl...0bd1a1de-3aee-40e0-b2ec-86c7a30c7149.h5ad40220
\n", "
" ], "text/plain": [ " collection_id collection_name \\\n", "soma_joinid \n", "522 0b9d8a04-bb9d-44da-aa27-705bb65b54eb Tabula Muris Senis \n", "\n", " collection_doi dataset_id \\\n", "soma_joinid \n", "522 10.1038/s41586-020-2496-1 0bd1a1de-3aee-40e0-b2ec-86c7a30c7149 \n", "\n", " dataset_title \\\n", "soma_joinid \n", "522 Bone marrow - A single-cell transcriptomic atl... \n", "\n", " dataset_h5ad_path \\\n", "soma_joinid \n", "522 0bd1a1de-3aee-40e0-b2ec-86c7a30c7149.h5ad \n", "\n", " dataset_total_cell_count \n", "soma_joinid \n", "522 40220 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "census_datasets[census_datasets.dataset_id == \"0bd1a1de-3aee-40e0-b2ec-86c7a30c7149\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a query on the mouse experiment, \"RNA\" measurement, for the dataset_id." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T13:51:43.010351Z", "iopub.status.busy": "2023-07-28T13:51:43.010115Z", "iopub.status.idle": "2023-07-28T13:51:48.579486Z", "shell.execute_reply": "2023-07-28T13:51:48.578789Z" } }, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 40220 × 52392\n", " obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id'\n", " var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata = cellxgene_census.get_anndata(\n", " census, organism=\"Mus musculus\", obs_value_filter=\"dataset_id == '0bd1a1de-3aee-40e0-b2ec-86c7a30c7149'\"\n", ")\n", "\n", "adata" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Downloading the original source H5AD file of a dataset.\n", "\n", "You can download the original H5AD file for any given dataset. This is the same H5AD you can download from the [CZ CELLxGENE Discover](https://cellxgene.cziscience.com/), and may contain additional data-submitter provided information which was not included in the Census.\n", "\n", "To do this you can fetch the location in the cloud or directly download to your system using the `cellxgene-census`" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T13:51:48.582025Z", "iopub.status.busy": "2023-07-28T13:51:48.581729Z", "iopub.status.idle": "2023-07-28T13:52:06.733774Z", "shell.execute_reply": "2023-07-28T13:52:06.733158Z" } }, "outputs": [], "source": [ "# Option 1: Direct download\n", "cellxgene_census.download_source_h5ad(\n", " \"0bd1a1de-3aee-40e0-b2ec-86c7a30c7149\", to_path=\"Tabula_Muris_Senis-bone_marrow.h5ad\"\n", ")" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T13:52:06.736609Z", "iopub.status.busy": "2023-07-28T13:52:06.736348Z", "iopub.status.idle": "2023-07-28T13:52:07.447332Z", "shell.execute_reply": "2023-07-28T13:52:07.446692Z" } }, "outputs": [ { "data": { "text/plain": [ "{'uri': 's3://cellxgene-data-public/cell-census/2023-07-25/h5ads/0bd1a1de-3aee-40e0-b2ec-86c7a30c7149.h5ad',\n", " 's3_region': 'us-west-2'}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Option 2: Get location and download via preferred method\n", "uri = cellxgene_census.get_source_h5ad_uri(\"0bd1a1de-3aee-40e0-b2ec-86c7a30c7149\")\n", "uri\n", "\n", "# you can now download the H5AD in shell via AWS CLI e.g. `aws s3 cp uri ./`" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Close the census" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T13:52:07.449802Z", "iopub.status.busy": "2023-07-28T13:52:07.449525Z", "iopub.status.idle": "2023-07-28T13:52:07.452843Z", "shell.execute_reply": "2023-07-28T13:52:07.452279Z" } }, "outputs": [], "source": [ "census.close()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" }, "vscode": { "interpreter": { "hash": "3da8ec1c162cd849e59e6ea2824b2e353dce799884e910aae99411be5277f953" } } }, "nbformat": 4, "nbformat_minor": 2 }