{ "cells": [ { "cell_type": "markdown", "id": "5b269fef", "metadata": {}, "source": [ "# Integrating multi-dataset slices of data\n", "\n", "The Census contains data from multiple studies providing an opportunity to perform inter-dataset analysis. To this end integration of data has to be performed first to account for batch effects.\n", "\n", "This notebook provides a demonstration for integrating two Census datasets using [scvi-tools](https://docs.scvi-tools.org/en/stable/index.html). **The goal is not to provide an exhaustive guide on proper integration, but to showcase what information in the Census can inform data integration.**\n", "\n", "**Contents**\n", "\n", "1. Finding and fetching data from mouse liver (10X Genomics and Smart-Seq2).\n", "1. Gene-length normalization of Smart-Seq2 data.\n", "1. Integration with `scvi-tools`.\n", " 1. Inspecting data prior to integration.\n", " 1. Integration with batch defined as `dataset_id`.\n", " 1. Integration with batch defined as `dataset_id` + `donor_id`.\n", " 1. Integration with batch defined as `dataset_id` + `donor_id` + `assay_ontology_term_id` + `suspension_type`.\n", " \n", "⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable `is_primary_data` which is described in the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#repeated-data).\n", " For this notebook we will focus on individual datasets, therefore we can ignore this variable.\n", "\n", "## Finding and fetching data from mouse liver (10X Genomics and Smart-Seq2)\n", "\n", "Let's load all modules needed for this notebook." ] }, { "cell_type": "code", "execution_count": 1, "id": "512a9dce", "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:25.755153Z", "iopub.status.busy": "2023-07-28T14:20:25.754602Z", "iopub.status.idle": "2023-07-28T14:20:31.287277Z", "shell.execute_reply": "2023-07-28T14:20:31.286640Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/ssm-user/cellxgene-census/venv/lib/python3.10/site-packages/scvi/_settings.py:63: UserWarning: Since v1.0.0, scvi-tools no longer uses a random seed by default. Run `scvi.settings.seed = 0` to reproduce results from previous versions.\n", " self.seed = seed\n", "/home/ssm-user/cellxgene-census/venv/lib/python3.10/site-packages/scvi/_settings.py:70: UserWarning: Setting `dl_pin_memory_gpu_training` is deprecated in v1.0 and will be removed in v1.1. Please pass in `pin_memory` to the data loaders instead.\n", " self.dl_pin_memory_gpu_training = (\n", "/home/ssm-user/cellxgene-census/venv/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", " from .autonotebook import tqdm as notebook_tqdm\n" ] } ], "source": [ "import cellxgene_census\n", "import numpy as np\n", "import scanpy as sc\n", "import scvi\n", "from scipy.sparse import csr_matrix" ] }, { "cell_type": "markdown", "id": "e13d1bdf", "metadata": {}, "source": [ "Now we can open the Census, if you are not familiar with the basics of the Census API you should take a look at the notebook [Learning about the CELLxGENE Census](https://cellxgene-census.readthedocs.io/en/latest/notebooks/analysis_demo/comp_bio_census_info.html)." ] }, { "cell_type": "code", "execution_count": 2, "id": "73d8c1bb", "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:31.290380Z", "iopub.status.busy": "2023-07-28T14:20:31.289801Z", "iopub.status.idle": "2023-07-28T14:20:31.679562Z", "shell.execute_reply": "2023-07-28T14:20:31.678931Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The \"latest\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n" ] } ], "source": [ "census = cellxgene_census.open_soma(census_version=\"latest\")" ] }, { "cell_type": "markdown", "id": "af907e87", "metadata": {}, "source": [ "In this notebook we will use Tabula Muris Senis data from the liver as it contains cells from both 10X Genomics and Smart-Seq2 technologies.\n", "\n", "Let's query the `datasets` table of the Census by filtering on `collection_name` for \"Tabula Muris Senis\" and `dataset_title` for \"liver\". " ] }, { "cell_type": "code", "execution_count": 3, "id": "a5ea2757", "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:20:31.682499Z", "iopub.status.busy": "2023-07-28T14:20:31.682213Z", "iopub.status.idle": "2023-07-28T14:20:32.142204Z", "shell.execute_reply": "2023-07-28T14:20:32.141608Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", " | soma_joinid | \n", "collection_id | \n", "collection_name | \n", "collection_doi | \n", "dataset_id | \n", "dataset_title | \n", "dataset_h5ad_path | \n", "dataset_total_cell_count | \n", "
---|---|---|---|---|---|---|---|---|
13 | \n", "525 | \n", "0b9d8a04-bb9d-44da-aa27-705bb65b54eb | \n", "Tabula Muris Senis | \n", "10.1038/s41586-020-2496-1 | \n", "4546e757-34d0-4d17-be06-538318925fcd | \n", "Liver - A single-cell transcriptomic atlas cha... | \n", "4546e757-34d0-4d17-be06-538318925fcd.h5ad | \n", "2859 | \n", "
34 | \n", "547 | \n", "0b9d8a04-bb9d-44da-aa27-705bb65b54eb | \n", "Tabula Muris Senis | \n", "10.1038/s41586-020-2496-1 | \n", "6202a243-b713-4e12-9ced-c387f8483dea | \n", "Liver - A single-cell transcriptomic atlas cha... | \n", "6202a243-b713-4e12-9ced-c387f8483dea.h5ad | \n", "7294 | \n", "