{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Querying and fetching the single-cell data and cell/gene metadata.\n",
    "\n",
    "This tutorial showcases the easiest ways to query the expression data and cell/gene metadata from the Census, and load them into common in-memory Python objects, including `pandas.DataFrame` and `anndata.AnnData`.\n",
    "\n",
    "**Contents**\n",
    "\n",
    "1. Opening the census.\n",
    "2. Querying expression data.\n",
    "3. Querying cell metadata (obs).\n",
    "4. Querying gene metadata (var).\n",
    "\n",
    "⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable `is_primary_data` which is described in the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#repeated-data).\n",
    "\n",
    "## Opening the census\n",
    "\n",
    "The `cellxgene_census` python package contains a convenient API to open the latest version of the Census."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T16:16:47.858717Z",
     "iopub.status.busy": "2023-07-28T16:16:47.858416Z",
     "iopub.status.idle": "2023-07-28T16:16:50.417890Z",
     "shell.execute_reply": "2023-07-28T16:16:50.417280Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n"
     ]
    }
   ],
   "source": [
    "import cellxgene_census\n",
    "\n",
    "census = cellxgene_census.open_soma()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can learn more about the `cellxgene_census` methods by accessing their corresponding documentation via `help()`. For example `help(cellxgene_census.open_soma)`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Querying expression data\n",
    "\n",
    "A convenient way to query and fetch expression data is to use the `get_anndata` method of the `cellxgene_census` API. This is a method that combines the column selection and value filtering we described above to obtain slices of the expression data based on metadata queries.\n",
    "\n",
    "The method will return an `anndata.AnnData` object, it takes as an input a census object, the string for an organism, and for both cell and gene metadata we can specify filters and column selection as described above but with the following arguments:\n",
    "\n",
    "- `obs_column_names` and `var_column_names` — a pair of arguments whose values are lists of strings indicating the columns to select for cell (`obs`) and gene (`var`) metadata respectively.\n",
    "- `obs_value_filter` —  python expression with selection conditions to fetch **cells** meeting a criteria. For full details see [tiledb.QueryCondition](https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#query-condition).\n",
    "- `var_value_filter` —  python expression with selection conditions to fetch **genes** meeting a criteria. Details as above.  For full details see [tiledb.QueryCondition](https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#query-condition).\n",
    "\n",
    "\n",
    "For example if we want to fetch the expression data for:\n",
    "\n",
    "- Genes `\"ENSG00000161798\"` and `\"ENSG00000188229\"`.\n",
    "- All `\"B cells\"` of `\"lung\"` with `\"COVID-19\"` from non-duplicated cells.\n",
    "- With all gene metadata and adding `sex` cell metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T16:17:02.614453Z",
     "iopub.status.busy": "2023-07-28T16:17:02.614195Z",
     "iopub.status.idle": "2023-07-28T16:17:25.266817Z",
     "shell.execute_reply": "2023-07-28T16:17:25.266197Z"
    }
   },
   "outputs": [],
   "source": [
    "adata = cellxgene_census.get_anndata(\n",
    "    census=census,\n",
    "    organism=\"Homo sapiens\",\n",
    "    var_value_filter=\"feature_id in ['ENSG00000161798', 'ENSG00000188229']\",\n",
    "    obs_value_filter=\"cell_type == 'B cell' and tissue_general == 'lung' and disease == 'COVID-19' and is_primary_data == True\",\n",
    "    obs_column_names=[\"sex\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "And now we can take a look at the results."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T16:17:25.269991Z",
     "iopub.status.busy": "2023-07-28T16:17:25.269539Z",
     "iopub.status.idle": "2023-07-28T16:17:25.273729Z",
     "shell.execute_reply": "2023-07-28T16:17:25.273120Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "AnnData object with n_obs × n_vars = 2313 × 2\n",
       "    obs: 'sex', 'cell_type', 'tissue_general', 'disease', 'is_primary_data'\n",
       "    var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T16:17:25.276258Z",
     "iopub.status.busy": "2023-07-28T16:17:25.275933Z",
     "iopub.status.idle": "2023-07-28T16:17:25.283846Z",
     "shell.execute_reply": "2023-07-28T16:17:25.283232Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sex</th>\n",
       "      <th>cell_type</th>\n",
       "      <th>tissue_general</th>\n",
       "      <th>disease</th>\n",
       "      <th>is_primary_data</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>male</td>\n",
       "      <td>B cell</td>\n",
       "      <td>lung</td>\n",
       "      <td>COVID-19</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>male</td>\n",
       "      <td>B cell</td>\n",
       "      <td>lung</td>\n",
       "      <td>COVID-19</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>unknown</td>\n",
       "      <td>B cell</td>\n",
       "      <td>lung</td>\n",
       "      <td>COVID-19</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>male</td>\n",
       "      <td>B cell</td>\n",
       "      <td>lung</td>\n",
       "      <td>COVID-19</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>unknown</td>\n",
       "      <td>B cell</td>\n",
       "      <td>lung</td>\n",
       "      <td>COVID-19</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2308</th>\n",
       "      <td>male</td>\n",
       "      <td>B cell</td>\n",
       "      <td>lung</td>\n",
       "      <td>COVID-19</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2309</th>\n",
       "      <td>male</td>\n",
       "      <td>B cell</td>\n",
       "      <td>lung</td>\n",
       "      <td>COVID-19</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2310</th>\n",
       "      <td>male</td>\n",
       "      <td>B cell</td>\n",
       "      <td>lung</td>\n",
       "      <td>COVID-19</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2311</th>\n",
       "      <td>male</td>\n",
       "      <td>B cell</td>\n",
       "      <td>lung</td>\n",
       "      <td>COVID-19</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2312</th>\n",
       "      <td>male</td>\n",
       "      <td>B cell</td>\n",
       "      <td>lung</td>\n",
       "      <td>COVID-19</td>\n",
       "      <td>True</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>2313 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "          sex cell_type tissue_general   disease  is_primary_data\n",
       "0        male    B cell           lung  COVID-19             True\n",
       "1        male    B cell           lung  COVID-19             True\n",
       "2     unknown    B cell           lung  COVID-19             True\n",
       "3        male    B cell           lung  COVID-19             True\n",
       "4     unknown    B cell           lung  COVID-19             True\n",
       "...       ...       ...            ...       ...              ...\n",
       "2308     male    B cell           lung  COVID-19             True\n",
       "2309     male    B cell           lung  COVID-19             True\n",
       "2310     male    B cell           lung  COVID-19             True\n",
       "2311     male    B cell           lung  COVID-19             True\n",
       "2312     male    B cell           lung  COVID-19             True\n",
       "\n",
       "[2313 rows x 5 columns]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata.obs"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T16:17:25.286287Z",
     "iopub.status.busy": "2023-07-28T16:17:25.285970Z",
     "iopub.status.idle": "2023-07-28T16:17:25.291899Z",
     "shell.execute_reply": "2023-07-28T16:17:25.291285Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>feature_id</th>\n",
       "      <th>feature_name</th>\n",
       "      <th>feature_length</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>8626</td>\n",
       "      <td>ENSG00000161798</td>\n",
       "      <td>AQP5</td>\n",
       "      <td>1884</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>27047</td>\n",
       "      <td>ENSG00000188229</td>\n",
       "      <td>TUBB4B</td>\n",
       "      <td>2037</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   soma_joinid       feature_id feature_name  feature_length\n",
       "0         8626  ENSG00000161798         AQP5            1884\n",
       "1        27047  ENSG00000188229       TUBB4B            2037"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "adata.var"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a full description of `get_anndata()` refer to `help(cellxgene_census.get_anndata)`\n",
    "\n",
    "Don't forget to close the census!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Querying cell metadata (obs)\n",
    "\n",
    "The human gene metadata of the Census, for RNA assays, is located at `census[\"census_data\"][\"homo_sapiens\"].obs`. This is a `SOMADataFrame` and as such it can be materialized as a `pandas.DataFrame` via the methods `read().concat().to_pandas()`. See also, the helper function `cellxgene_census.get_obs` which removes some boiler plate.\n",
    "\n",
    "The mouse cell metadata is at `census[\"census_data\"][\"mus_musculus\"].obs`.\n",
    "\n",
    "For slicing the cell metadata there are two relevant arguments that can be passed through `read()`:\n",
    "\n",
    "- `column_names` — list of strings indicating what metadata columns to fetch. \n",
    "- `value_filter` — Python expression with selection conditions to fetch rows, it is similar to [pandas.DataFrame.query()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html), for full details see [tiledb.QueryCondition](https://tiledb-inc-tiledb.readthedocs-hosted.com/projects/tiledb-py/en/stable/python-api.html#query-condition) shortly:\n",
    "   - Expressions are one or more comparisons\n",
    "   - Comparisons are one of `<column> <op> <value>` or `<column> <op> <column>`\n",
    "   - Expressions can combine comparisons using and, or, & or |\n",
    "   - op is one of < | > | <= | >= | == | != or in\n",
    "\n",
    "To learn what metadata columns are available for fetching and filtering we can directly look at the keys of the cell metadata."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T16:16:50.420972Z",
     "iopub.status.busy": "2023-07-28T16:16:50.420548Z",
     "iopub.status.idle": "2023-07-28T16:16:50.757450Z",
     "shell.execute_reply": "2023-07-28T16:16:50.756926Z"
    },
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['soma_joinid',\n",
       " 'dataset_id',\n",
       " 'assay',\n",
       " 'assay_ontology_term_id',\n",
       " 'cell_type',\n",
       " 'cell_type_ontology_term_id',\n",
       " 'development_stage',\n",
       " 'development_stage_ontology_term_id',\n",
       " 'disease',\n",
       " 'disease_ontology_term_id',\n",
       " 'donor_id',\n",
       " 'is_primary_data',\n",
       " 'self_reported_ethnicity',\n",
       " 'self_reported_ethnicity_ontology_term_id',\n",
       " 'sex',\n",
       " 'sex_ontology_term_id',\n",
       " 'suspension_type',\n",
       " 'tissue',\n",
       " 'tissue_ontology_term_id',\n",
       " 'tissue_general',\n",
       " 'tissue_general_ontology_term_id']"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "keys = list(census[\"census_data\"][\"homo_sapiens\"].obs.keys())\n",
    "\n",
    "keys"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`soma_joinid` is a special `SOMADataFrame` column that is used for join operations. The definition for all other columns can be found at the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cell_census_schema.md#cell-metadata--census_objcensus_dataorganismobs--somadataframe).\n",
    "\n",
    "All of these can be used to fetch specific columns or specific rows matching a condition. For the latter we need to know the values we are looking for _a priori_.\n",
    "\n",
    "For example let's see what are the possible values available for `sex`. To this we can load all cell metadata but fetching only for the column `sex`. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T16:16:50.760164Z",
     "iopub.status.busy": "2023-07-28T16:16:50.759744Z",
     "iopub.status.idle": "2023-07-28T16:16:53.840821Z",
     "shell.execute_reply": "2023-07-28T16:16:53.840248Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>sex</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>unknown</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>669</th>\n",
       "      <td>female</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>385437</th>\n",
       "      <td>male</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "            sex\n",
       "0       unknown\n",
       "669      female\n",
       "385437     male"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sex_cell_metadata = cellxgene_census.get_obs(census, \"homo_sapiens\", column_names=[\"sex\"])\n",
    "\n",
    "sex_cell_metadata.drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see there are only three different values for `sex`, that is `\"male\"`, `\"female\"` and `\"unknown\"`. \n",
    "\n",
    "With this information we can fetch all cell metatadata for a specific `sex` value, for example `\"unknown\"`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T16:16:53.843663Z",
     "iopub.status.busy": "2023-07-28T16:16:53.843138Z",
     "iopub.status.idle": "2023-07-28T16:17:00.444626Z",
     "shell.execute_reply": "2023-07-28T16:17:00.444078Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>soma_joinid</th>\n",
       "      <th>dataset_id</th>\n",
       "      <th>assay</th>\n",
       "      <th>assay_ontology_term_id</th>\n",
       "      <th>cell_type</th>\n",
       "      <th>cell_type_ontology_term_id</th>\n",
       "      <th>development_stage</th>\n",
       "      <th>development_stage_ontology_term_id</th>\n",
       "      <th>disease</th>\n",
       "      <th>disease_ontology_term_id</th>\n",
       "      <th>...</th>\n",
       "      <th>is_primary_data</th>\n",
       "      <th>self_reported_ethnicity</th>\n",
       "      <th>self_reported_ethnicity_ontology_term_id</th>\n",
       "      <th>sex</th>\n",
       "      <th>sex_ontology_term_id</th>\n",
       "      <th>suspension_type</th>\n",
       "      <th>tissue</th>\n",
       "      <th>tissue_ontology_term_id</th>\n",
       "      <th>tissue_general</th>\n",
       "      <th>tissue_general_ontology_term_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>f171db61-e57e-4535-a06a-35d8b6ef8f2b</td>\n",
       "      <td>10x 3' v3</td>\n",
       "      <td>EFO:0009922</td>\n",
       "      <td>syncytiotrophoblast cell</td>\n",
       "      <td>CL:0000525</td>\n",
       "      <td>9th week post-fertilization human stage</td>\n",
       "      <td>HsapDv:0000046</td>\n",
       "      <td>normal</td>\n",
       "      <td>PATO:0000461</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>nucleus</td>\n",
       "      <td>decidua basalis</td>\n",
       "      <td>UBERON:0000453</td>\n",
       "      <td>placenta</td>\n",
       "      <td>UBERON:0001987</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1</td>\n",
       "      <td>f171db61-e57e-4535-a06a-35d8b6ef8f2b</td>\n",
       "      <td>10x 3' v3</td>\n",
       "      <td>EFO:0009922</td>\n",
       "      <td>placental villous trophoblast</td>\n",
       "      <td>CL:2000060</td>\n",
       "      <td>9th week post-fertilization human stage</td>\n",
       "      <td>HsapDv:0000046</td>\n",
       "      <td>normal</td>\n",
       "      <td>PATO:0000461</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>nucleus</td>\n",
       "      <td>decidua basalis</td>\n",
       "      <td>UBERON:0000453</td>\n",
       "      <td>placenta</td>\n",
       "      <td>UBERON:0001987</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2</td>\n",
       "      <td>f171db61-e57e-4535-a06a-35d8b6ef8f2b</td>\n",
       "      <td>10x 3' v3</td>\n",
       "      <td>EFO:0009922</td>\n",
       "      <td>syncytiotrophoblast cell</td>\n",
       "      <td>CL:0000525</td>\n",
       "      <td>9th week post-fertilization human stage</td>\n",
       "      <td>HsapDv:0000046</td>\n",
       "      <td>normal</td>\n",
       "      <td>PATO:0000461</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>nucleus</td>\n",
       "      <td>decidua basalis</td>\n",
       "      <td>UBERON:0000453</td>\n",
       "      <td>placenta</td>\n",
       "      <td>UBERON:0001987</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>3</td>\n",
       "      <td>f171db61-e57e-4535-a06a-35d8b6ef8f2b</td>\n",
       "      <td>10x 3' v3</td>\n",
       "      <td>EFO:0009922</td>\n",
       "      <td>syncytiotrophoblast cell</td>\n",
       "      <td>CL:0000525</td>\n",
       "      <td>9th week post-fertilization human stage</td>\n",
       "      <td>HsapDv:0000046</td>\n",
       "      <td>normal</td>\n",
       "      <td>PATO:0000461</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>nucleus</td>\n",
       "      <td>decidua basalis</td>\n",
       "      <td>UBERON:0000453</td>\n",
       "      <td>placenta</td>\n",
       "      <td>UBERON:0001987</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>4</td>\n",
       "      <td>f171db61-e57e-4535-a06a-35d8b6ef8f2b</td>\n",
       "      <td>10x 3' v3</td>\n",
       "      <td>EFO:0009922</td>\n",
       "      <td>extravillous trophoblast</td>\n",
       "      <td>CL:0008036</td>\n",
       "      <td>9th week post-fertilization human stage</td>\n",
       "      <td>HsapDv:0000046</td>\n",
       "      <td>normal</td>\n",
       "      <td>PATO:0000461</td>\n",
       "      <td>...</td>\n",
       "      <td>False</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>nucleus</td>\n",
       "      <td>decidua basalis</td>\n",
       "      <td>UBERON:0000453</td>\n",
       "      <td>placenta</td>\n",
       "      <td>UBERON:0001987</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3251329</th>\n",
       "      <td>56274573</td>\n",
       "      <td>2adb1f8a-a6b1-4909-8ee8-484814e2d4bf</td>\n",
       "      <td>microwell-seq</td>\n",
       "      <td>EFO:0030002</td>\n",
       "      <td>cord blood hematopoietic stem cell</td>\n",
       "      <td>CL:2000095</td>\n",
       "      <td>newborn human stage</td>\n",
       "      <td>HsapDv:0000082</td>\n",
       "      <td>normal</td>\n",
       "      <td>PATO:0000461</td>\n",
       "      <td>...</td>\n",
       "      <td>True</td>\n",
       "      <td>Han Chinese</td>\n",
       "      <td>HANCESTRO:0027</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>cell</td>\n",
       "      <td>umbilical cord blood</td>\n",
       "      <td>UBERON:0012168</td>\n",
       "      <td>blood</td>\n",
       "      <td>UBERON:0000178</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3251330</th>\n",
       "      <td>56274574</td>\n",
       "      <td>2adb1f8a-a6b1-4909-8ee8-484814e2d4bf</td>\n",
       "      <td>microwell-seq</td>\n",
       "      <td>EFO:0030002</td>\n",
       "      <td>cord blood hematopoietic stem cell</td>\n",
       "      <td>CL:2000095</td>\n",
       "      <td>newborn human stage</td>\n",
       "      <td>HsapDv:0000082</td>\n",
       "      <td>normal</td>\n",
       "      <td>PATO:0000461</td>\n",
       "      <td>...</td>\n",
       "      <td>True</td>\n",
       "      <td>Han Chinese</td>\n",
       "      <td>HANCESTRO:0027</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>cell</td>\n",
       "      <td>umbilical cord blood</td>\n",
       "      <td>UBERON:0012168</td>\n",
       "      <td>blood</td>\n",
       "      <td>UBERON:0000178</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3251331</th>\n",
       "      <td>56274575</td>\n",
       "      <td>2adb1f8a-a6b1-4909-8ee8-484814e2d4bf</td>\n",
       "      <td>microwell-seq</td>\n",
       "      <td>EFO:0030002</td>\n",
       "      <td>cord blood hematopoietic stem cell</td>\n",
       "      <td>CL:2000095</td>\n",
       "      <td>newborn human stage</td>\n",
       "      <td>HsapDv:0000082</td>\n",
       "      <td>normal</td>\n",
       "      <td>PATO:0000461</td>\n",
       "      <td>...</td>\n",
       "      <td>True</td>\n",
       "      <td>Han Chinese</td>\n",
       "      <td>HANCESTRO:0027</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>cell</td>\n",
       "      <td>umbilical cord blood</td>\n",
       "      <td>UBERON:0012168</td>\n",
       "      <td>blood</td>\n",
       "      <td>UBERON:0000178</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3251332</th>\n",
       "      <td>56274576</td>\n",
       "      <td>2adb1f8a-a6b1-4909-8ee8-484814e2d4bf</td>\n",
       "      <td>microwell-seq</td>\n",
       "      <td>EFO:0030002</td>\n",
       "      <td>cord blood hematopoietic stem cell</td>\n",
       "      <td>CL:2000095</td>\n",
       "      <td>newborn human stage</td>\n",
       "      <td>HsapDv:0000082</td>\n",
       "      <td>normal</td>\n",
       "      <td>PATO:0000461</td>\n",
       "      <td>...</td>\n",
       "      <td>True</td>\n",
       "      <td>Han Chinese</td>\n",
       "      <td>HANCESTRO:0027</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>cell</td>\n",
       "      <td>umbilical cord blood</td>\n",
       "      <td>UBERON:0012168</td>\n",
       "      <td>blood</td>\n",
       "      <td>UBERON:0000178</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3251333</th>\n",
       "      <td>56274577</td>\n",
       "      <td>2adb1f8a-a6b1-4909-8ee8-484814e2d4bf</td>\n",
       "      <td>microwell-seq</td>\n",
       "      <td>EFO:0030002</td>\n",
       "      <td>cord blood hematopoietic stem cell</td>\n",
       "      <td>CL:2000095</td>\n",
       "      <td>newborn human stage</td>\n",
       "      <td>HsapDv:0000082</td>\n",
       "      <td>normal</td>\n",
       "      <td>PATO:0000461</td>\n",
       "      <td>...</td>\n",
       "      <td>True</td>\n",
       "      <td>Han Chinese</td>\n",
       "      <td>HANCESTRO:0027</td>\n",
       "      <td>unknown</td>\n",
       "      <td>unknown</td>\n",
       "      <td>cell</td>\n",
       "      <td>umbilical cord blood</td>\n",
       "      <td>UBERON:0012168</td>\n",
       "      <td>blood</td>\n",
       "      <td>UBERON:0000178</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3251334 rows × 21 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "         soma_joinid                            dataset_id          assay  \\\n",
       "0                  0  f171db61-e57e-4535-a06a-35d8b6ef8f2b      10x 3' v3   \n",
       "1                  1  f171db61-e57e-4535-a06a-35d8b6ef8f2b      10x 3' v3   \n",
       "2                  2  f171db61-e57e-4535-a06a-35d8b6ef8f2b      10x 3' v3   \n",
       "3                  3  f171db61-e57e-4535-a06a-35d8b6ef8f2b      10x 3' v3   \n",
       "4                  4  f171db61-e57e-4535-a06a-35d8b6ef8f2b      10x 3' v3   \n",
       "...              ...                                   ...            ...   \n",
       "3251329     56274573  2adb1f8a-a6b1-4909-8ee8-484814e2d4bf  microwell-seq   \n",
       "3251330     56274574  2adb1f8a-a6b1-4909-8ee8-484814e2d4bf  microwell-seq   \n",
       "3251331     56274575  2adb1f8a-a6b1-4909-8ee8-484814e2d4bf  microwell-seq   \n",
       "3251332     56274576  2adb1f8a-a6b1-4909-8ee8-484814e2d4bf  microwell-seq   \n",
       "3251333     56274577  2adb1f8a-a6b1-4909-8ee8-484814e2d4bf  microwell-seq   \n",
       "\n",
       "        assay_ontology_term_id                           cell_type  \\\n",
       "0                  EFO:0009922            syncytiotrophoblast cell   \n",
       "1                  EFO:0009922       placental villous trophoblast   \n",
       "2                  EFO:0009922            syncytiotrophoblast cell   \n",
       "3                  EFO:0009922            syncytiotrophoblast cell   \n",
       "4                  EFO:0009922            extravillous trophoblast   \n",
       "...                        ...                                 ...   \n",
       "3251329            EFO:0030002  cord blood hematopoietic stem cell   \n",
       "3251330            EFO:0030002  cord blood hematopoietic stem cell   \n",
       "3251331            EFO:0030002  cord blood hematopoietic stem cell   \n",
       "3251332            EFO:0030002  cord blood hematopoietic stem cell   \n",
       "3251333            EFO:0030002  cord blood hematopoietic stem cell   \n",
       "\n",
       "        cell_type_ontology_term_id                        development_stage  \\\n",
       "0                       CL:0000525  9th week post-fertilization human stage   \n",
       "1                       CL:2000060  9th week post-fertilization human stage   \n",
       "2                       CL:0000525  9th week post-fertilization human stage   \n",
       "3                       CL:0000525  9th week post-fertilization human stage   \n",
       "4                       CL:0008036  9th week post-fertilization human stage   \n",
       "...                            ...                                      ...   \n",
       "3251329                 CL:2000095                      newborn human stage   \n",
       "3251330                 CL:2000095                      newborn human stage   \n",
       "3251331                 CL:2000095                      newborn human stage   \n",
       "3251332                 CL:2000095                      newborn human stage   \n",
       "3251333                 CL:2000095                      newborn human stage   \n",
       "\n",
       "        development_stage_ontology_term_id disease disease_ontology_term_id  \\\n",
       "0                           HsapDv:0000046  normal             PATO:0000461   \n",
       "1                           HsapDv:0000046  normal             PATO:0000461   \n",
       "2                           HsapDv:0000046  normal             PATO:0000461   \n",
       "3                           HsapDv:0000046  normal             PATO:0000461   \n",
       "4                           HsapDv:0000046  normal             PATO:0000461   \n",
       "...                                    ...     ...                      ...   \n",
       "3251329                     HsapDv:0000082  normal             PATO:0000461   \n",
       "3251330                     HsapDv:0000082  normal             PATO:0000461   \n",
       "3251331                     HsapDv:0000082  normal             PATO:0000461   \n",
       "3251332                     HsapDv:0000082  normal             PATO:0000461   \n",
       "3251333                     HsapDv:0000082  normal             PATO:0000461   \n",
       "\n",
       "         ... is_primary_data  self_reported_ethnicity  \\\n",
       "0        ...           False                  unknown   \n",
       "1        ...           False                  unknown   \n",
       "2        ...           False                  unknown   \n",
       "3        ...           False                  unknown   \n",
       "4        ...           False                  unknown   \n",
       "...      ...             ...                      ...   \n",
       "3251329  ...            True              Han Chinese   \n",
       "3251330  ...            True              Han Chinese   \n",
       "3251331  ...            True              Han Chinese   \n",
       "3251332  ...            True              Han Chinese   \n",
       "3251333  ...            True              Han Chinese   \n",
       "\n",
       "        self_reported_ethnicity_ontology_term_id      sex  \\\n",
       "0                                        unknown  unknown   \n",
       "1                                        unknown  unknown   \n",
       "2                                        unknown  unknown   \n",
       "3                                        unknown  unknown   \n",
       "4                                        unknown  unknown   \n",
       "...                                          ...      ...   \n",
       "3251329                           HANCESTRO:0027  unknown   \n",
       "3251330                           HANCESTRO:0027  unknown   \n",
       "3251331                           HANCESTRO:0027  unknown   \n",
       "3251332                           HANCESTRO:0027  unknown   \n",
       "3251333                           HANCESTRO:0027  unknown   \n",
       "\n",
       "        sex_ontology_term_id suspension_type                tissue  \\\n",
       "0                    unknown         nucleus       decidua basalis   \n",
       "1                    unknown         nucleus       decidua basalis   \n",
       "2                    unknown         nucleus       decidua basalis   \n",
       "3                    unknown         nucleus       decidua basalis   \n",
       "4                    unknown         nucleus       decidua basalis   \n",
       "...                      ...             ...                   ...   \n",
       "3251329              unknown            cell  umbilical cord blood   \n",
       "3251330              unknown            cell  umbilical cord blood   \n",
       "3251331              unknown            cell  umbilical cord blood   \n",
       "3251332              unknown            cell  umbilical cord blood   \n",
       "3251333              unknown            cell  umbilical cord blood   \n",
       "\n",
       "        tissue_ontology_term_id tissue_general tissue_general_ontology_term_id  \n",
       "0                UBERON:0000453       placenta                  UBERON:0001987  \n",
       "1                UBERON:0000453       placenta                  UBERON:0001987  \n",
       "2                UBERON:0000453       placenta                  UBERON:0001987  \n",
       "3                UBERON:0000453       placenta                  UBERON:0001987  \n",
       "4                UBERON:0000453       placenta                  UBERON:0001987  \n",
       "...                         ...            ...                             ...  \n",
       "3251329          UBERON:0012168          blood                  UBERON:0000178  \n",
       "3251330          UBERON:0012168          blood                  UBERON:0000178  \n",
       "3251331          UBERON:0012168          blood                  UBERON:0000178  \n",
       "3251332          UBERON:0012168          blood                  UBERON:0000178  \n",
       "3251333          UBERON:0012168          blood                  UBERON:0000178  \n",
       "\n",
       "[3251334 rows x 21 columns]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cell_metadata_all_unknown_sex = cellxgene_census.get_obs(census, \"homo_sapiens\", value_filter=\"sex == 'unknown'\")\n",
    "\n",
    "cell_metadata_all_unknown_sex"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use both `column_names` and `value_filter` to perform specific queries. For example let's fetch the `disease` columns for the `cell_type` `\"B cell\"` in the `tissue_general` `\"lung\"` and from non-duplicated cells. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T16:17:00.447314Z",
     "iopub.status.busy": "2023-07-28T16:17:00.447043Z",
     "iopub.status.idle": "2023-07-28T16:17:01.929686Z",
     "shell.execute_reply": "2023-07-28T16:17:01.929083Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "disease                                cell_type  tissue_general  is_primary_data\n",
       "lung adenocarcinoma                    B cell     lung            True               42720\n",
       "squamous cell lung carcinoma           B cell     lung            True               10631\n",
       "non-small cell lung carcinoma          B cell     lung            True                8742\n",
       "normal                                 B cell     lung            True                8187\n",
       "COVID-19                               B cell     lung            True                2313\n",
       "chronic obstructive pulmonary disease  B cell     lung            True                2083\n",
       "lung large cell carcinoma              B cell     lung            True                1534\n",
       "pulmonary emphysema                    B cell     lung            True                1512\n",
       "pulmonary fibrosis                     B cell     lung            True                1474\n",
       "pleomorphic carcinoma                  B cell     lung            True                1210\n",
       "interstitial lung disease              B cell     lung            True                 332\n",
       "small cell lung carcinoma              B cell     lung            True                 204\n",
       "lymphangioleiomyomatosis               B cell     lung            True                 133\n",
       "pneumonia                              B cell     lung            True                  50\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "cell_metadata_b_cell = cellxgene_census.get_obs(\n",
    "    census,\n",
    "    \"homo_sapiens\",\n",
    "    value_filter=\"cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data==True\",\n",
    "    column_names=[\"disease\"],\n",
    ")\n",
    "\n",
    "cell_metadata_b_cell.value_counts()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Querying gene metadata (var)\n",
    "\n",
    "The human gene metadata of the Census is located at `census[\"census_data\"][\"homo_sapiens\"].ms[\"RNA\"].var`. Similarly to the cell metadata, it is a `SOMADataFrame` and thus we can also use its method `read()`.\n",
    "\n",
    "The mouse gene metadata is at `census[\"census_data\"][\"mus_musculus\"].ms[\"RNA\"].var`.\n",
    "\n",
    "Let's take a look at the metadata available for column selection and row filtering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T16:17:01.932237Z",
     "iopub.status.busy": "2023-07-28T16:17:01.931961Z",
     "iopub.status.idle": "2023-07-28T16:17:02.338545Z",
     "shell.execute_reply": "2023-07-28T16:17:02.337996Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['soma_joinid', 'feature_id', 'feature_name', 'feature_length']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "keys = list(census[\"census_data\"][\"homo_sapiens\"].ms[\"RNA\"].var.keys())\n",
    "\n",
    "keys"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With the exception of `soma_joinid` these columns are defined in the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cell_census_schema_0.1.0.md). Similarly to the cell metadata, we can use the same operations to learn and fetch gene metadata.\n",
    "\n",
    "For example, to get the `feature_name` and `feature_length` of the genes `\"ENSG00000161798\"` and `\"ENSG00000188229\"` we can do the following."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T16:17:02.341116Z",
     "iopub.status.busy": "2023-07-28T16:17:02.340856Z",
     "iopub.status.idle": "2023-07-28T16:17:02.611956Z",
     "shell.execute_reply": "2023-07-28T16:17:02.611388Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>feature_name</th>\n",
       "      <th>feature_length</th>\n",
       "      <th>feature_id</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AQP5</td>\n",
       "      <td>1884</td>\n",
       "      <td>ENSG00000161798</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>TUBB4B</td>\n",
       "      <td>2037</td>\n",
       "      <td>ENSG00000188229</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "  feature_name  feature_length       feature_id\n",
       "0         AQP5            1884  ENSG00000161798\n",
       "1       TUBB4B            2037  ENSG00000188229"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "gene_metadata = cellxgene_census.get_var(\n",
    "    census,\n",
    "    \"homo_sapiens\",\n",
    "    value_filter=\"feature_id in ['ENSG00000161798', 'ENSG00000188229']\",\n",
    "    column_names=[\"feature_name\", \"feature_length\"],\n",
    ")\n",
    "\n",
    "gene_metadata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T16:17:25.294413Z",
     "iopub.status.busy": "2023-07-28T16:17:25.294099Z",
     "iopub.status.idle": "2023-07-28T16:17:25.296921Z",
     "shell.execute_reply": "2023-07-28T16:17:25.296417Z"
    }
   },
   "outputs": [],
   "source": [
    "census.close()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  },
  "vscode": {
   "interpreter": {
    "hash": "3da8ec1c162cd849e59e6ea2824b2e353dce799884e910aae99411be5277f953"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}