{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Querying and fetching the single-cell data and cell/gene metadata.\n", "\n", "This tutorial showcases the easiest ways to query the expression data and cell/gene metadata from the Census, and load them into common in-memory Python objects, including `pandas.DataFrame` and `anndata.AnnData`.\n", "\n", "**Contents**\n", "\n", "1. Opening the census.\n", "2. Querying expression data.\n", "3. Querying cell metadata (obs).\n", "4. Querying gene metadata (var).\n", "\n", "⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable `is_primary_data` which is described in the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#repeated-data).\n", "\n", "## Opening the census\n", "\n", "The `cellxgene_census` python package contains a convenient API to open the latest version of the Census." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:16:47.858717Z", "iopub.status.busy": "2023-07-28T16:16:47.858416Z", "iopub.status.idle": "2023-07-28T16:16:50.417890Z", "shell.execute_reply": "2023-07-28T16:16:50.417280Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n" ] } ], "source": [ "import cellxgene_census\n", "\n", "census = cellxgene_census.open_soma()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can learn more about the `cellxgene_census` methods by accessing their corresponding documentation via `help()`. For example `help(cellxgene_census.open_soma)`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Querying expression data\n", "\n", "A convenient way to query and fetch expression data is to use the `get_anndata` method of the `cellxgene_census` API. This is a method that combines the column selection and value filtering we described above to obtain slices of the expression data based on metadata queries.\n", "\n", "The method will return an `anndata.AnnData` object, it takes as an input a census object, the string for an organism, and for both cell and gene metadata we can specify filters and column selection as described above but with the following arguments:\n", "\n", "- `column_names` — a dictionary with two keys `obs` and `var` whose values are lists of strings indicating the columns to select for cell and gene metadata respectively.\n", "- `obs_value_filter` — python expression with selection conditions to fetch **cells** meeting a criteria. For full details see [tiledb.QueryCondition](https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#query-condition).\n", "- `var_value_filter` — python expression with selection conditions to fetch **genes** meeting a criteria. Details as above. For full details see [tiledb.QueryCondition](https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#query-condition).\n", "\n", "\n", "For example if we want to fetch the expression data for:\n", "\n", "- Genes `\"ENSG00000161798\"` and `\"ENSG00000188229\"`.\n", "- All `\"B cells\"` of `\"lung\"` with `\"COVID-19\"` from non-duplicated cells.\n", "- With all gene metadata and adding `sex` cell metadata." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:17:02.614453Z", "iopub.status.busy": "2023-07-28T16:17:02.614195Z", "iopub.status.idle": "2023-07-28T16:17:25.266817Z", "shell.execute_reply": "2023-07-28T16:17:25.266197Z" } }, "outputs": [], "source": [ "adata = cellxgene_census.get_anndata(\n", " census=census,\n", " organism=\"Homo sapiens\",\n", " var_value_filter=\"feature_id in ['ENSG00000161798', 'ENSG00000188229']\",\n", " obs_value_filter=\"cell_type == 'B cell' and tissue_general == 'lung' and disease == 'COVID-19' and is_primary_data == True\",\n", " column_names={\"obs\": [\"sex\"]},\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And now we can take a look at the results." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:17:25.269991Z", "iopub.status.busy": "2023-07-28T16:17:25.269539Z", "iopub.status.idle": "2023-07-28T16:17:25.273729Z", "shell.execute_reply": "2023-07-28T16:17:25.273120Z" } }, "outputs": [ { "data": { "text/plain": [ "AnnData object with n_obs × n_vars = 2313 × 2\n", " obs: 'sex', 'cell_type', 'tissue_general', 'disease', 'is_primary_data'\n", " var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:17:25.276258Z", "iopub.status.busy": "2023-07-28T16:17:25.275933Z", "iopub.status.idle": "2023-07-28T16:17:25.283846Z", "shell.execute_reply": "2023-07-28T16:17:25.283232Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexcell_typetissue_generaldiseaseis_primary_data
0maleB celllungCOVID-19True
1maleB celllungCOVID-19True
2unknownB celllungCOVID-19True
3maleB celllungCOVID-19True
4unknownB celllungCOVID-19True
..................
2308maleB celllungCOVID-19True
2309maleB celllungCOVID-19True
2310maleB celllungCOVID-19True
2311maleB celllungCOVID-19True
2312maleB celllungCOVID-19True
\n", "

2313 rows × 5 columns

\n", "
" ], "text/plain": [ " sex cell_type tissue_general disease is_primary_data\n", "0 male B cell lung COVID-19 True\n", "1 male B cell lung COVID-19 True\n", "2 unknown B cell lung COVID-19 True\n", "3 male B cell lung COVID-19 True\n", "4 unknown B cell lung COVID-19 True\n", "... ... ... ... ... ...\n", "2308 male B cell lung COVID-19 True\n", "2309 male B cell lung COVID-19 True\n", "2310 male B cell lung COVID-19 True\n", "2311 male B cell lung COVID-19 True\n", "2312 male B cell lung COVID-19 True\n", "\n", "[2313 rows x 5 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata.obs" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:17:25.286287Z", "iopub.status.busy": "2023-07-28T16:17:25.285970Z", "iopub.status.idle": "2023-07-28T16:17:25.291899Z", "shell.execute_reply": "2023-07-28T16:17:25.291285Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joinidfeature_idfeature_namefeature_length
08626ENSG00000161798AQP51884
127047ENSG00000188229TUBB4B2037
\n", "
" ], "text/plain": [ " soma_joinid feature_id feature_name feature_length\n", "0 8626 ENSG00000161798 AQP5 1884\n", "1 27047 ENSG00000188229 TUBB4B 2037" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adata.var" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For a full description of `get_anndata()` refer to `help(cellxgene_census.get_anndata)`\n", "\n", "Don't forget to close the census!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Querying cell metadata (obs)\n", "\n", "The human gene metadata of the Census, for RNA assays, is located at `census[\"census_data\"][\"homo_sapiens\"].obs`. This is a `SOMADataFrame` and as such it can be materialized as a `pandas.DataFrame` via the methods `read().concat().to_pandas()`. \n", "\n", "The mouse cell metadata is at `census[\"census_data\"][\"mus_musculus\"].obs`.\n", "\n", "For slicing the cell metadata there are two relevant arguments that can be passed through `read()`:\n", "\n", "- `column_names` — list of strings indicating what metadata columns to fetch. \n", "- `value_filter` — Python expression with selection conditions to fetch rows, it is similar to [pandas.DataFrame.query()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html), for full details see [tiledb.QueryCondition](https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#query-condition) shortly:\n", " - Expressions are one or more comparisons\n", " - Comparisons are one of ` ` or ` `\n", " - Expressions can combine comparisons using and, or, & or |\n", " - op is one of < | > | <= | >= | == | != or in\n", "\n", "To learn what metadata columns are available for fetching and filtering we can directly look at the keys of the cell metadata." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:16:50.420972Z", "iopub.status.busy": "2023-07-28T16:16:50.420548Z", "iopub.status.idle": "2023-07-28T16:16:50.757450Z", "shell.execute_reply": "2023-07-28T16:16:50.756926Z" }, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "['soma_joinid',\n", " 'dataset_id',\n", " 'assay',\n", " 'assay_ontology_term_id',\n", " 'cell_type',\n", " 'cell_type_ontology_term_id',\n", " 'development_stage',\n", " 'development_stage_ontology_term_id',\n", " 'disease',\n", " 'disease_ontology_term_id',\n", " 'donor_id',\n", " 'is_primary_data',\n", " 'self_reported_ethnicity',\n", " 'self_reported_ethnicity_ontology_term_id',\n", " 'sex',\n", " 'sex_ontology_term_id',\n", " 'suspension_type',\n", " 'tissue',\n", " 'tissue_ontology_term_id',\n", " 'tissue_general',\n", " 'tissue_general_ontology_term_id']" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "keys = list(census[\"census_data\"][\"homo_sapiens\"].obs.keys())\n", "\n", "keys" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`soma_joinid` is a special `SOMADataFrame` column that is used for join operations. The definition for all other columns can be found at the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cell_census_schema.md#cell-metadata--census_objcensus_dataorganismobs--somadataframe).\n", "\n", "All of these can be used to fetch specific columns or specific rows matching a condition. For the latter we need to know the values we are looking for _a priori_.\n", "\n", "For example let's see what are the possible values available for `sex`. To this we can load all cell metadata but fetching only for the column `sex`. " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:16:50.760164Z", "iopub.status.busy": "2023-07-28T16:16:50.759744Z", "iopub.status.idle": "2023-07-28T16:16:53.840821Z", "shell.execute_reply": "2023-07-28T16:16:53.840248Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sex
0unknown
669female
385437male
\n", "
" ], "text/plain": [ " sex\n", "0 unknown\n", "669 female\n", "385437 male" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sex_cell_metadata = census[\"census_data\"][\"homo_sapiens\"].obs.read(column_names=[\"sex\"]).concat().to_pandas()\n", "\n", "sex_cell_metadata.drop_duplicates()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see there are only three different values for `sex`, that is `\"male\"`, `\"female\"` and `\"unknown\"`. \n", "\n", "With this information we can fetch all cell metatadata for a specific `sex` value, for example `\"unknown\"`." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:16:53.843663Z", "iopub.status.busy": "2023-07-28T16:16:53.843138Z", "iopub.status.idle": "2023-07-28T16:17:00.444626Z", "shell.execute_reply": "2023-07-28T16:17:00.444078Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
soma_joiniddataset_idassayassay_ontology_term_idcell_typecell_type_ontology_term_iddevelopment_stagedevelopment_stage_ontology_term_iddiseasedisease_ontology_term_id...is_primary_dataself_reported_ethnicityself_reported_ethnicity_ontology_term_idsexsex_ontology_term_idsuspension_typetissuetissue_ontology_term_idtissue_generaltissue_general_ontology_term_id
00f171db61-e57e-4535-a06a-35d8b6ef8f2b10x 3' v3EFO:0009922syncytiotrophoblast cellCL:00005259th week post-fertilization human stageHsapDv:0000046normalPATO:0000461...Falseunknownunknownunknownunknownnucleusdecidua basalisUBERON:0000453placentaUBERON:0001987
11f171db61-e57e-4535-a06a-35d8b6ef8f2b10x 3' v3EFO:0009922placental villous trophoblastCL:20000609th week post-fertilization human stageHsapDv:0000046normalPATO:0000461...Falseunknownunknownunknownunknownnucleusdecidua basalisUBERON:0000453placentaUBERON:0001987
22f171db61-e57e-4535-a06a-35d8b6ef8f2b10x 3' v3EFO:0009922syncytiotrophoblast cellCL:00005259th week post-fertilization human stageHsapDv:0000046normalPATO:0000461...Falseunknownunknownunknownunknownnucleusdecidua basalisUBERON:0000453placentaUBERON:0001987
33f171db61-e57e-4535-a06a-35d8b6ef8f2b10x 3' v3EFO:0009922syncytiotrophoblast cellCL:00005259th week post-fertilization human stageHsapDv:0000046normalPATO:0000461...Falseunknownunknownunknownunknownnucleusdecidua basalisUBERON:0000453placentaUBERON:0001987
44f171db61-e57e-4535-a06a-35d8b6ef8f2b10x 3' v3EFO:0009922extravillous trophoblastCL:00080369th week post-fertilization human stageHsapDv:0000046normalPATO:0000461...Falseunknownunknownunknownunknownnucleusdecidua basalisUBERON:0000453placentaUBERON:0001987
..................................................................
3251329562745732adb1f8a-a6b1-4909-8ee8-484814e2d4bfmicrowell-seqEFO:0030002cord blood hematopoietic stem cellCL:2000095newborn human stageHsapDv:0000082normalPATO:0000461...TrueHan ChineseHANCESTRO:0027unknownunknowncellumbilical cord bloodUBERON:0012168bloodUBERON:0000178
3251330562745742adb1f8a-a6b1-4909-8ee8-484814e2d4bfmicrowell-seqEFO:0030002cord blood hematopoietic stem cellCL:2000095newborn human stageHsapDv:0000082normalPATO:0000461...TrueHan ChineseHANCESTRO:0027unknownunknowncellumbilical cord bloodUBERON:0012168bloodUBERON:0000178
3251331562745752adb1f8a-a6b1-4909-8ee8-484814e2d4bfmicrowell-seqEFO:0030002cord blood hematopoietic stem cellCL:2000095newborn human stageHsapDv:0000082normalPATO:0000461...TrueHan ChineseHANCESTRO:0027unknownunknowncellumbilical cord bloodUBERON:0012168bloodUBERON:0000178
3251332562745762adb1f8a-a6b1-4909-8ee8-484814e2d4bfmicrowell-seqEFO:0030002cord blood hematopoietic stem cellCL:2000095newborn human stageHsapDv:0000082normalPATO:0000461...TrueHan ChineseHANCESTRO:0027unknownunknowncellumbilical cord bloodUBERON:0012168bloodUBERON:0000178
3251333562745772adb1f8a-a6b1-4909-8ee8-484814e2d4bfmicrowell-seqEFO:0030002cord blood hematopoietic stem cellCL:2000095newborn human stageHsapDv:0000082normalPATO:0000461...TrueHan ChineseHANCESTRO:0027unknownunknowncellumbilical cord bloodUBERON:0012168bloodUBERON:0000178
\n", "

3251334 rows × 21 columns

\n", "
" ], "text/plain": [ " soma_joinid dataset_id assay \\\n", "0 0 f171db61-e57e-4535-a06a-35d8b6ef8f2b 10x 3' v3 \n", "1 1 f171db61-e57e-4535-a06a-35d8b6ef8f2b 10x 3' v3 \n", "2 2 f171db61-e57e-4535-a06a-35d8b6ef8f2b 10x 3' v3 \n", "3 3 f171db61-e57e-4535-a06a-35d8b6ef8f2b 10x 3' v3 \n", "4 4 f171db61-e57e-4535-a06a-35d8b6ef8f2b 10x 3' v3 \n", "... ... ... ... \n", "3251329 56274573 2adb1f8a-a6b1-4909-8ee8-484814e2d4bf microwell-seq \n", "3251330 56274574 2adb1f8a-a6b1-4909-8ee8-484814e2d4bf microwell-seq \n", "3251331 56274575 2adb1f8a-a6b1-4909-8ee8-484814e2d4bf microwell-seq \n", "3251332 56274576 2adb1f8a-a6b1-4909-8ee8-484814e2d4bf microwell-seq \n", "3251333 56274577 2adb1f8a-a6b1-4909-8ee8-484814e2d4bf microwell-seq \n", "\n", " assay_ontology_term_id cell_type \\\n", "0 EFO:0009922 syncytiotrophoblast cell \n", "1 EFO:0009922 placental villous trophoblast \n", "2 EFO:0009922 syncytiotrophoblast cell \n", "3 EFO:0009922 syncytiotrophoblast cell \n", "4 EFO:0009922 extravillous trophoblast \n", "... ... ... \n", "3251329 EFO:0030002 cord blood hematopoietic stem cell \n", "3251330 EFO:0030002 cord blood hematopoietic stem cell \n", "3251331 EFO:0030002 cord blood hematopoietic stem cell \n", "3251332 EFO:0030002 cord blood hematopoietic stem cell \n", "3251333 EFO:0030002 cord blood hematopoietic stem cell \n", "\n", " cell_type_ontology_term_id development_stage \\\n", "0 CL:0000525 9th week post-fertilization human stage \n", "1 CL:2000060 9th week post-fertilization human stage \n", "2 CL:0000525 9th week post-fertilization human stage \n", "3 CL:0000525 9th week post-fertilization human stage \n", "4 CL:0008036 9th week post-fertilization human stage \n", "... ... ... \n", "3251329 CL:2000095 newborn human stage \n", "3251330 CL:2000095 newborn human stage \n", "3251331 CL:2000095 newborn human stage \n", "3251332 CL:2000095 newborn human stage \n", "3251333 CL:2000095 newborn human stage \n", "\n", " development_stage_ontology_term_id disease disease_ontology_term_id \\\n", "0 HsapDv:0000046 normal PATO:0000461 \n", "1 HsapDv:0000046 normal PATO:0000461 \n", "2 HsapDv:0000046 normal PATO:0000461 \n", "3 HsapDv:0000046 normal PATO:0000461 \n", "4 HsapDv:0000046 normal PATO:0000461 \n", "... ... ... ... \n", "3251329 HsapDv:0000082 normal PATO:0000461 \n", "3251330 HsapDv:0000082 normal PATO:0000461 \n", "3251331 HsapDv:0000082 normal PATO:0000461 \n", "3251332 HsapDv:0000082 normal PATO:0000461 \n", "3251333 HsapDv:0000082 normal PATO:0000461 \n", "\n", " ... is_primary_data self_reported_ethnicity \\\n", "0 ... False unknown \n", "1 ... False unknown \n", "2 ... False unknown \n", "3 ... False unknown \n", "4 ... False unknown \n", "... ... ... ... \n", "3251329 ... True Han Chinese \n", "3251330 ... True Han Chinese \n", "3251331 ... True Han Chinese \n", "3251332 ... True Han Chinese \n", "3251333 ... True Han Chinese \n", "\n", " self_reported_ethnicity_ontology_term_id sex \\\n", "0 unknown unknown \n", "1 unknown unknown \n", "2 unknown unknown \n", "3 unknown unknown \n", "4 unknown unknown \n", "... ... ... \n", "3251329 HANCESTRO:0027 unknown \n", "3251330 HANCESTRO:0027 unknown \n", "3251331 HANCESTRO:0027 unknown \n", "3251332 HANCESTRO:0027 unknown \n", "3251333 HANCESTRO:0027 unknown \n", "\n", " sex_ontology_term_id suspension_type tissue \\\n", "0 unknown nucleus decidua basalis \n", "1 unknown nucleus decidua basalis \n", "2 unknown nucleus decidua basalis \n", "3 unknown nucleus decidua basalis \n", "4 unknown nucleus decidua basalis \n", "... ... ... ... \n", "3251329 unknown cell umbilical cord blood \n", "3251330 unknown cell umbilical cord blood \n", "3251331 unknown cell umbilical cord blood \n", "3251332 unknown cell umbilical cord blood \n", "3251333 unknown cell umbilical cord blood \n", "\n", " tissue_ontology_term_id tissue_general tissue_general_ontology_term_id \n", "0 UBERON:0000453 placenta UBERON:0001987 \n", "1 UBERON:0000453 placenta UBERON:0001987 \n", "2 UBERON:0000453 placenta UBERON:0001987 \n", "3 UBERON:0000453 placenta UBERON:0001987 \n", "4 UBERON:0000453 placenta UBERON:0001987 \n", "... ... ... ... \n", "3251329 UBERON:0012168 blood UBERON:0000178 \n", "3251330 UBERON:0012168 blood UBERON:0000178 \n", "3251331 UBERON:0012168 blood UBERON:0000178 \n", "3251332 UBERON:0012168 blood UBERON:0000178 \n", "3251333 UBERON:0012168 blood UBERON:0000178 \n", "\n", "[3251334 rows x 21 columns]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cell_metadata_all_unknown_sex = (\n", " census[\"census_data\"][\"homo_sapiens\"].obs.read(value_filter=\"sex == 'unknown'\").concat().to_pandas()\n", ")\n", "\n", "cell_metadata_all_unknown_sex" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can use both `column_names` and `value_filter` to perform specific queries. For example let's fetch the `disease` columns for the `cell_type` `\"B cell\"` in the `tissue_general` `\"lung\"` and from non-duplicated cells. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:17:00.447314Z", "iopub.status.busy": "2023-07-28T16:17:00.447043Z", "iopub.status.idle": "2023-07-28T16:17:01.929686Z", "shell.execute_reply": "2023-07-28T16:17:01.929083Z" } }, "outputs": [ { "data": { "text/plain": [ "disease cell_type tissue_general is_primary_data\n", "lung adenocarcinoma B cell lung True 42720\n", "squamous cell lung carcinoma B cell lung True 10631\n", "non-small cell lung carcinoma B cell lung True 8742\n", "normal B cell lung True 8187\n", "COVID-19 B cell lung True 2313\n", "chronic obstructive pulmonary disease B cell lung True 2083\n", "lung large cell carcinoma B cell lung True 1534\n", "pulmonary emphysema B cell lung True 1512\n", "pulmonary fibrosis B cell lung True 1474\n", "pleomorphic carcinoma B cell lung True 1210\n", "interstitial lung disease B cell lung True 332\n", "small cell lung carcinoma B cell lung True 204\n", "lymphangioleiomyomatosis B cell lung True 133\n", "pneumonia B cell lung True 50\n", "Name: count, dtype: int64" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cell_metadata_b_cell = (\n", " census[\"census_data\"][\"homo_sapiens\"]\n", " .obs.read(\n", " value_filter=\"cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data==True\",\n", " column_names=[\"disease\"],\n", " )\n", " .concat()\n", " .to_pandas()\n", ")\n", "\n", "cell_metadata_b_cell.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Querying gene metadata (var)\n", "\n", "The human gene metadata of the Census is located at `census[\"census_data\"][\"homo_sapiens\"].ms[\"RNA\"].var`. Similarly to the cell metadata, it is a `SOMADataFrame` and thus we can also use its method `read()`.\n", "\n", "The mouse gene metadata is at `census[\"census_data\"][\"mus_musculus\"].ms[\"RNA\"].var`.\n", "\n", "Let's take a look at the metadata available for column selection and row filtering." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:17:01.932237Z", "iopub.status.busy": "2023-07-28T16:17:01.931961Z", "iopub.status.idle": "2023-07-28T16:17:02.338545Z", "shell.execute_reply": "2023-07-28T16:17:02.337996Z" } }, "outputs": [ { "data": { "text/plain": [ "['soma_joinid', 'feature_id', 'feature_name', 'feature_length']" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "keys = list(census[\"census_data\"][\"homo_sapiens\"].ms[\"RNA\"].var.keys())\n", "\n", "keys" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "With the exception of `soma_joinid` these columns are defined in the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cell_census_schema_0.1.0.md). Similarly to the cell metadata, we can use the same operations to learn and fetch gene metadata.\n", "\n", "For example, to get the `feature_name` and `feature_length` of the genes `\"ENSG00000161798\"` and `\"ENSG00000188229\"` we can do the following." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:17:02.341116Z", "iopub.status.busy": "2023-07-28T16:17:02.340856Z", "iopub.status.idle": "2023-07-28T16:17:02.611956Z", "shell.execute_reply": "2023-07-28T16:17:02.611388Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
feature_namefeature_lengthfeature_id
0AQP51884ENSG00000161798
1TUBB4B2037ENSG00000188229
\n", "
" ], "text/plain": [ " feature_name feature_length feature_id\n", "0 AQP5 1884 ENSG00000161798\n", "1 TUBB4B 2037 ENSG00000188229" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "gene_metadata = (\n", " census[\"census_data\"][\"homo_sapiens\"]\n", " .ms[\"RNA\"]\n", " .var.read(\n", " value_filter=\"feature_id in ['ENSG00000161798', 'ENSG00000188229']\",\n", " column_names=[\"feature_name\", \"feature_length\"],\n", " )\n", " .concat()\n", " .to_pandas()\n", ")\n", "\n", "gene_metadata" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T16:17:25.294413Z", "iopub.status.busy": "2023-07-28T16:17:25.294099Z", "iopub.status.idle": "2023-07-28T16:17:25.296921Z", "shell.execute_reply": "2023-07-28T16:17:25.296417Z" } }, "outputs": [], "source": [ "census.close()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.10" }, "vscode": { "interpreter": { "hash": "3da8ec1c162cd849e59e6ea2824b2e353dce799884e910aae99411be5277f953" } } }, "nbformat": 4, "nbformat_minor": 2 }