{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Computing on X using online (incremental) algorithms\n",
    "\n",
    "This tutorial showcases computing a variety of per-gene and per-cell statistics for a user-defined query using out-of-core operations.\n",
    "\n",
    "*NOTE*: when query results are small enough to fit in memory, it may be easier to use the `SOMAExperiment` Query class to extract an AnnData, and then just compute over that. This tutorial shows means of incrementally processing larger-than-core (RAM) data, where incremental (online) algorithms are used.\n",
    "\n",
    "**Contents**\n",
    "\n",
    "1. Incremental count and mean calculation.\n",
    "2. Incremental variance calculation.\n",
    "3. Counting cells per gene, grouped by `dataset_id`.\n",
    "\n",
    "⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable `is_primary_data` which is described in the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#repeated-data)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:30:56.921713Z",
     "iopub.status.busy": "2023-07-28T14:30:56.921303Z",
     "iopub.status.idle": "2023-07-28T14:30:59.013417Z",
     "shell.execute_reply": "2023-07-28T14:30:59.012839Z"
    }
   },
   "outputs": [],
   "source": [
    "import cellxgene_census\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import tiledbsoma as soma\n",
    "from tiledbsoma.experiment_query import X_as_series"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Incremental count and mean calculation.\n",
    "\n",
    "Many statistics, such as `mean`, are easy to calculate incrementally.  This cell demonstrates a query on the `X['raw']` sparse nD array, which will return results in batches. Accumulate the sum and count incrementally, into `raw_sum` and `raw_n`, and then compute mean.\n",
    "\n",
    "First define a query - in this case a slice over the obs axis for cells with a specific tissue & sex value, and all genes on the var axis.  The `query.X()` method returns an iterator of results, each as a PyArrow Table.  Each table will contain the sparse X data and obs/var coordinates, using standard SOMA names:\n",
    "\n",
    "* `soma_data` - the X value (float32)\n",
    "* `soma_dim_0` - the obs coordinate (int64)\n",
    "* `soma_dim_1` - the var coordinate (int64)\n",
    "\n",
    "**Important**: the X matrices are joined to var/obs axis DataFrames by an integer join \"id\" (aka `soma_joinid`). They are *NOT* positionally indexed, and any given cell or gene may have a `soma_joinid` of any value (e.g., a large integer). In other words, for any given `X` value, the `soma_dim_0` corresponds to the `soma_joinid` in the `obs` dataframe, and the `soma_dim_1` coordinate corresponds to the `soma_joinid` in the `var` dataframe.\n",
    "\n",
    "For convenience, the query package contains a utility function to simplify operations on query slices.  `query.indexer` returns an indexer that can be used to wrap the output of `query.X()`, converting from `soma_joinids` to positional indexing. Positions are `[0, N)`, where `N` are the number of results on the query for any given axis (equivalent to the Pandas `.iloc` of the axis dataframe).\n",
    "\n",
    "Key points:\n",
    "\n",
    "* it is expensive to query and read the results - so rather than make multiple passes over the data, read it once and perform multiple computations.\n",
    "* by default, data in the census is indexed by `soma_joinid` and not positionally. Use `query.indexer` if you want positions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:30:59.016475Z",
     "iopub.status.busy": "2023-07-28T14:30:59.016041Z",
     "iopub.status.idle": "2023-07-28T14:31:21.558944Z",
     "shell.execute_reply": "2023-07-28T14:31:21.558251Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>feature_id</th>\n",
       "      <th>feature_name</th>\n",
       "      <th>feature_length</th>\n",
       "      <th>raw_n</th>\n",
       "      <th>raw_mean</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>soma_joinid</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ENSMUSG00000051951</td>\n",
       "      <td>Xkr4</td>\n",
       "      <td>6094</td>\n",
       "      <td>202</td>\n",
       "      <td>1.032743</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ENSMUSG00000089699</td>\n",
       "      <td>Gm1992</td>\n",
       "      <td>250</td>\n",
       "      <td>0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ENSMUSG00000102343</td>\n",
       "      <td>Gm37381</td>\n",
       "      <td>1364</td>\n",
       "      <td>0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ENSMUSG00000025900</td>\n",
       "      <td>Rp1</td>\n",
       "      <td>12311</td>\n",
       "      <td>106</td>\n",
       "      <td>0.236265</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ENSMUSG00000025902</td>\n",
       "      <td>Sox17</td>\n",
       "      <td>4772</td>\n",
       "      <td>3259</td>\n",
       "      <td>48.991975</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52387</th>\n",
       "      <td>ENSMUSG00000081591</td>\n",
       "      <td>Btf3-ps9</td>\n",
       "      <td>496</td>\n",
       "      <td>0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52388</th>\n",
       "      <td>ENSMUSG00000118710</td>\n",
       "      <td>mmu-mir-467a-3_ENSMUSG00000118710</td>\n",
       "      <td>83</td>\n",
       "      <td>0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52389</th>\n",
       "      <td>ENSMUSG00000119584</td>\n",
       "      <td>Rn18s</td>\n",
       "      <td>1849</td>\n",
       "      <td>0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52390</th>\n",
       "      <td>ENSMUSG00000118538</td>\n",
       "      <td>Gm18218</td>\n",
       "      <td>970</td>\n",
       "      <td>0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52391</th>\n",
       "      <td>ENSMUSG00000084217</td>\n",
       "      <td>Setd9-ps</td>\n",
       "      <td>670</td>\n",
       "      <td>0</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>52392 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                     feature_id                       feature_name   \n",
       "soma_joinid                                                          \n",
       "0            ENSMUSG00000051951                               Xkr4  \\\n",
       "1            ENSMUSG00000089699                             Gm1992   \n",
       "2            ENSMUSG00000102343                            Gm37381   \n",
       "3            ENSMUSG00000025900                                Rp1   \n",
       "4            ENSMUSG00000025902                              Sox17   \n",
       "...                         ...                                ...   \n",
       "52387        ENSMUSG00000081591                           Btf3-ps9   \n",
       "52388        ENSMUSG00000118710  mmu-mir-467a-3_ENSMUSG00000118710   \n",
       "52389        ENSMUSG00000119584                              Rn18s   \n",
       "52390        ENSMUSG00000118538                            Gm18218   \n",
       "52391        ENSMUSG00000084217                           Setd9-ps   \n",
       "\n",
       "             feature_length  raw_n   raw_mean  \n",
       "soma_joinid                                    \n",
       "0                      6094    202   1.032743  \n",
       "1                       250      0   0.000000  \n",
       "2                      1364      0   0.000000  \n",
       "3                     12311    106   0.236265  \n",
       "4                      4772   3259  48.991975  \n",
       "...                     ...    ...        ...  \n",
       "52387                   496      0   0.000000  \n",
       "52388                    83      0   0.000000  \n",
       "52389                  1849      0   0.000000  \n",
       "52390                   970      0   0.000000  \n",
       "52391                   670      0   0.000000  \n",
       "\n",
       "[52392 rows x 5 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "with cellxgene_census.open_soma() as census:\n",
    "    mouse = census[\"census_data\"][\"mus_musculus\"]\n",
    "    with mouse.axis_query(\n",
    "        measurement_name=\"RNA\",\n",
    "        obs_query=soma.AxisQuery(value_filter=\"tissue=='brain' and sex=='male' and is_primary_data==True\"),\n",
    "    ) as query:\n",
    "        var_df = query.var().concat().to_pandas().set_index(\"soma_joinid\")\n",
    "        n_vars = len(var_df)\n",
    "\n",
    "        raw_n = np.zeros((n_vars,), dtype=np.int64)  # accumulate number of non-zero X values\n",
    "        raw_sum = np.zeros((n_vars,), dtype=np.float64)  # accumulate the sum of expression\n",
    "\n",
    "        # query.X() returns an iterator of pyarrow.Table, with X data in COO format.\n",
    "        # You can request an indexer from the query that will map it to positional indices\n",
    "        indexer = query.indexer\n",
    "        for arrow_tbl in query.X(\"raw\").tables():\n",
    "            var_dim = indexer.by_var(arrow_tbl[\"soma_dim_1\"])\n",
    "            data = arrow_tbl[\"soma_data\"]\n",
    "            np.add.at(raw_n, var_dim, 1)\n",
    "            np.add.at(raw_sum, var_dim, data)\n",
    "\n",
    "    with np.errstate(divide=\"ignore\", invalid=\"ignore\"):\n",
    "        raw_mean = raw_sum / query.n_obs\n",
    "    raw_mean[np.isnan(raw_mean)] = 0\n",
    "\n",
    "    var_df = var_df.assign(raw_n=pd.Series(data=raw_n, index=var_df.index))\n",
    "    var_df = var_df.assign(raw_mean=pd.Series(data=raw_mean, index=var_df.index))\n",
    "\n",
    "    display(var_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Incremental variance calculation\n",
    "\n",
    "Other statistics are not as simple when implemented as an online algorithm. This cell demonstrates an implementation of an online computation of `variance`, using [Welford's online calculation of mean and variance](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:31:21.561533Z",
     "iopub.status.busy": "2023-07-28T14:31:21.561273Z",
     "iopub.status.idle": "2023-07-28T14:31:31.340478Z",
     "shell.execute_reply": "2023-07-28T14:31:31.339829Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>feature_id</th>\n",
       "      <th>feature_name</th>\n",
       "      <th>feature_length</th>\n",
       "      <th>raw_mean</th>\n",
       "      <th>raw_variance</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>soma_joinid</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>ENSMUSG00000051951</td>\n",
       "      <td>Xkr4</td>\n",
       "      <td>6094</td>\n",
       "      <td>1.032743</td>\n",
       "      <td>848.312801</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>ENSMUSG00000089699</td>\n",
       "      <td>Gm1992</td>\n",
       "      <td>250</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>ENSMUSG00000102343</td>\n",
       "      <td>Gm37381</td>\n",
       "      <td>1364</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>ENSMUSG00000025900</td>\n",
       "      <td>Rp1</td>\n",
       "      <td>12311</td>\n",
       "      <td>0.236265</td>\n",
       "      <td>169.182975</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>ENSMUSG00000025902</td>\n",
       "      <td>Sox17</td>\n",
       "      <td>4772</td>\n",
       "      <td>48.991975</td>\n",
       "      <td>279575.656207</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52387</th>\n",
       "      <td>ENSMUSG00000081591</td>\n",
       "      <td>Btf3-ps9</td>\n",
       "      <td>496</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52388</th>\n",
       "      <td>ENSMUSG00000118710</td>\n",
       "      <td>mmu-mir-467a-3_ENSMUSG00000118710</td>\n",
       "      <td>83</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52389</th>\n",
       "      <td>ENSMUSG00000119584</td>\n",
       "      <td>Rn18s</td>\n",
       "      <td>1849</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52390</th>\n",
       "      <td>ENSMUSG00000118538</td>\n",
       "      <td>Gm18218</td>\n",
       "      <td>970</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>52391</th>\n",
       "      <td>ENSMUSG00000084217</td>\n",
       "      <td>Setd9-ps</td>\n",
       "      <td>670</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>52392 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                     feature_id                       feature_name   \n",
       "soma_joinid                                                          \n",
       "0            ENSMUSG00000051951                               Xkr4  \\\n",
       "1            ENSMUSG00000089699                             Gm1992   \n",
       "2            ENSMUSG00000102343                            Gm37381   \n",
       "3            ENSMUSG00000025900                                Rp1   \n",
       "4            ENSMUSG00000025902                              Sox17   \n",
       "...                         ...                                ...   \n",
       "52387        ENSMUSG00000081591                           Btf3-ps9   \n",
       "52388        ENSMUSG00000118710  mmu-mir-467a-3_ENSMUSG00000118710   \n",
       "52389        ENSMUSG00000119584                              Rn18s   \n",
       "52390        ENSMUSG00000118538                            Gm18218   \n",
       "52391        ENSMUSG00000084217                           Setd9-ps   \n",
       "\n",
       "             feature_length   raw_mean   raw_variance  \n",
       "soma_joinid                                            \n",
       "0                      6094   1.032743     848.312801  \n",
       "1                       250   0.000000       0.000000  \n",
       "2                      1364   0.000000       0.000000  \n",
       "3                     12311   0.236265     169.182975  \n",
       "4                      4772  48.991975  279575.656207  \n",
       "...                     ...        ...            ...  \n",
       "52387                   496   0.000000       0.000000  \n",
       "52388                    83   0.000000       0.000000  \n",
       "52389                  1849   0.000000       0.000000  \n",
       "52390                   970   0.000000       0.000000  \n",
       "52391                   670   0.000000       0.000000  \n",
       "\n",
       "[52392 rows x 5 columns]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "import numba\n",
    "import numpy.typing as npt\n",
    "\n",
    "\n",
    "class OnlineMatrixMeanVariance:\n",
    "    n_samples: int\n",
    "    n_variables: int\n",
    "\n",
    "    def __init__(self, n_samples: int, n_variables: int):\n",
    "        \"\"\"Compute mean and variance for n_variables over n_samples, encoded\n",
    "        in a COO format. Equivalent to:\n",
    "            numpy.mean(data, axis=0)\n",
    "            numpy.var(data, axix=0)\n",
    "        where the input `data` is of shape (n_samples, n_variables)\n",
    "        \"\"\"\n",
    "        self.n_samples = n_samples\n",
    "        self.n_variables = n_variables\n",
    "\n",
    "        self.n_a = np.zeros((n_variables,), dtype=np.int32)\n",
    "        self.u_a = np.zeros((n_variables,), dtype=np.float64)\n",
    "        self.M2_a = np.zeros((n_variables,), dtype=np.float64)\n",
    "\n",
    "    def update(self, coord_vec: npt.NDArray[np.int64], value_vec: npt.NDArray[np.float32]) -> None:\n",
    "        _mean_variance_update(coord_vec, value_vec, self.n_a, self.u_a, self.M2_a)\n",
    "\n",
    "    def finalize(self) -> tuple[npt.NDArray[np.float64], npt.NDArray[np.float64]]:\n",
    "        \"\"\"Returns tuple containing mean and variance\"\"\"\n",
    "        u, M2 = _mean_variance_finalize(self.n_samples, self.n_a, self.u_a, self.M2_a)\n",
    "\n",
    "        # compute sample variance\n",
    "        var = M2 / max(1, (self.n_samples - 1))\n",
    "\n",
    "        return u, var\n",
    "\n",
    "\n",
    "@numba.jit(nopython=True)\n",
    "def _mean_variance_update(\n",
    "    col_arr: npt.NDArray[np.int64],\n",
    "    val_arr: npt.NDArray[np.float32],\n",
    "    n: npt.NDArray[np.int32],\n",
    "    u: npt.NDArray[np.float64],\n",
    "    M2: npt.NDArray[np.float64],\n",
    "):\n",
    "    \"\"\"Incrementally accumulate mean and sum of square of distance from mean using\n",
    "    Welford's online method.\n",
    "    \"\"\"\n",
    "    for col, val in zip(col_arr, val_arr):\n",
    "        u_prev = u[col]\n",
    "        M2_prev = M2[col]\n",
    "        n[col] += 1\n",
    "        u[col] = u_prev + (val - u_prev) / n[col]\n",
    "        M2[col] = M2_prev + (val - u_prev) * (val - u[col])\n",
    "\n",
    "\n",
    "@numba.jit(nopython=True)\n",
    "def _mean_variance_finalize(\n",
    "    n_samples: int,\n",
    "    n_a: npt.NDArray[np.int32],\n",
    "    u_a: npt.NDArray[np.float64],\n",
    "    M2_a: npt.NDArray[np.float64],\n",
    "):\n",
    "    \"\"\"Finalize incremental values, acconting for missing elements (due to sparse input).\n",
    "    Non-sparse and sparse combined using Chan's parallel adaptation of Welford's.\n",
    "    The code assumes the sparse elements are all zero and ignores those terms.\n",
    "    \"\"\"\n",
    "    n_b = n_samples - n_a\n",
    "    delta = -u_a  # assumes u_b == 0\n",
    "    u = (n_a * u_a) / n_samples\n",
    "    M2 = M2_a + delta**2 * n_a * n_b / n_samples  # assumes M2_b == 0\n",
    "    return u, M2\n",
    "\n",
    "\n",
    "with cellxgene_census.open_soma() as census:\n",
    "    mouse = census[\"census_data\"][\"mus_musculus\"]\n",
    "    with mouse.axis_query(\n",
    "        measurement_name=\"RNA\",\n",
    "        obs_query=soma.AxisQuery(value_filter=\"tissue=='brain' and sex=='male' and is_primary_data==True\"),\n",
    "    ) as query:\n",
    "        var_df = query.var().concat().to_pandas().set_index(\"soma_joinid\")\n",
    "        n_vars = len(var_df)\n",
    "\n",
    "        indexer = query.indexer\n",
    "        mvn = OnlineMatrixMeanVariance(query.n_obs, n_vars)\n",
    "        for arrow_tbl in query.X(\"raw\").tables():\n",
    "            var_dim = indexer.by_var(arrow_tbl[\"soma_dim_1\"])\n",
    "            data = arrow_tbl[\"soma_data\"].to_numpy()\n",
    "            mvn.update(var_dim, data)\n",
    "\n",
    "        u, v = mvn.finalize()\n",
    "\n",
    "    var_df = var_df.assign(raw_mean=pd.Series(data=u, index=var_df.index))\n",
    "    var_df = var_df.assign(raw_variance=pd.Series(data=v, index=var_df.index))\n",
    "\n",
    "    display(var_df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Counting cells per gene, grouped by `dataset_id`\n",
    "\n",
    "This example demonstrates a more complex example where the goal is to count the number of cells per gene, grouped by cell dataset_id.  The result is a Pandas DataFrame indexed by `obs.dataset_id` and `var.feature_id`, containing the number of cells per pair.\n",
    "\n",
    "This example does not use positional indexing, but rather demonstrates the use of Pandas DataFrame `join` to join on the `soma_joinid`. For the sake of this example we will query only 4 genes, but this can be expanded to all genes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-07-28T14:31:31.343083Z",
     "iopub.status.busy": "2023-07-28T14:31:31.342809Z",
     "iopub.status.idle": "2023-07-28T14:31:40.130486Z",
     "shell.execute_reply": "2023-07-28T14:31:40.129921Z"
    }
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n"
     ]
    },
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>n_cells</th>\n",
       "      <th>feature_name</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>dataset_id</th>\n",
       "      <th>feature_id</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3bbb6cf9-72b9-41be-b568-656de6eb18b5</th>\n",
       "      <th>ENSMUSG00000028399</th>\n",
       "      <td>79578</td>\n",
       "      <td>Ptprd</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58b01044-c5e5-4b0f-8a2d-6ebf951e01ff</th>\n",
       "      <th>ENSMUSG00000028399</th>\n",
       "      <td>474</td>\n",
       "      <td>Ptprd</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3bbb6cf9-72b9-41be-b568-656de6eb18b5</th>\n",
       "      <th>ENSMUSG00000052572</th>\n",
       "      <td>79513</td>\n",
       "      <td>Dlg2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58b01044-c5e5-4b0f-8a2d-6ebf951e01ff</th>\n",
       "      <th>ENSMUSG00000052572</th>\n",
       "      <td>81</td>\n",
       "      <td>Dlg2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98e5ea9f-16d6-47ec-a529-686e76515e39</th>\n",
       "      <th>ENSMUSG00000052572</th>\n",
       "      <td>908</td>\n",
       "      <td>Dlg2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>66ff82b4-9380-469c-bc4b-cfa08eacd325</th>\n",
       "      <th>ENSMUSG00000052572</th>\n",
       "      <td>856</td>\n",
       "      <td>Dlg2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c08f8441-4a10-4748-872a-e70c0bcccdba</th>\n",
       "      <th>ENSMUSG00000052572</th>\n",
       "      <td>52</td>\n",
       "      <td>Dlg2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3bbb6cf9-72b9-41be-b568-656de6eb18b5</th>\n",
       "      <th>ENSMUSG00000055421</th>\n",
       "      <td>79476</td>\n",
       "      <td>Pcdh9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58b01044-c5e5-4b0f-8a2d-6ebf951e01ff</th>\n",
       "      <th>ENSMUSG00000055421</th>\n",
       "      <td>125</td>\n",
       "      <td>Pcdh9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98e5ea9f-16d6-47ec-a529-686e76515e39</th>\n",
       "      <th>ENSMUSG00000055421</th>\n",
       "      <td>3027</td>\n",
       "      <td>Pcdh9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>66ff82b4-9380-469c-bc4b-cfa08eacd325</th>\n",
       "      <th>ENSMUSG00000055421</th>\n",
       "      <td>2910</td>\n",
       "      <td>Pcdh9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c08f8441-4a10-4748-872a-e70c0bcccdba</th>\n",
       "      <th>ENSMUSG00000055421</th>\n",
       "      <td>117</td>\n",
       "      <td>Pcdh9</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3bbb6cf9-72b9-41be-b568-656de6eb18b5</th>\n",
       "      <th>ENSMUSG00000092341</th>\n",
       "      <td>79667</td>\n",
       "      <td>Malat1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>58b01044-c5e5-4b0f-8a2d-6ebf951e01ff</th>\n",
       "      <th>ENSMUSG00000092341</th>\n",
       "      <td>12622</td>\n",
       "      <td>Malat1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>98e5ea9f-16d6-47ec-a529-686e76515e39</th>\n",
       "      <th>ENSMUSG00000092341</th>\n",
       "      <td>20094</td>\n",
       "      <td>Malat1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>66ff82b4-9380-469c-bc4b-cfa08eacd325</th>\n",
       "      <th>ENSMUSG00000092341</th>\n",
       "      <td>7102</td>\n",
       "      <td>Malat1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>c08f8441-4a10-4748-872a-e70c0bcccdba</th>\n",
       "      <th>ENSMUSG00000092341</th>\n",
       "      <td>12992</td>\n",
       "      <td>Malat1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                                                         n_cells feature_name\n",
       "dataset_id                           feature_id                              \n",
       "3bbb6cf9-72b9-41be-b568-656de6eb18b5 ENSMUSG00000028399    79578        Ptprd\n",
       "58b01044-c5e5-4b0f-8a2d-6ebf951e01ff ENSMUSG00000028399      474        Ptprd\n",
       "3bbb6cf9-72b9-41be-b568-656de6eb18b5 ENSMUSG00000052572    79513         Dlg2\n",
       "58b01044-c5e5-4b0f-8a2d-6ebf951e01ff ENSMUSG00000052572       81         Dlg2\n",
       "98e5ea9f-16d6-47ec-a529-686e76515e39 ENSMUSG00000052572      908         Dlg2\n",
       "66ff82b4-9380-469c-bc4b-cfa08eacd325 ENSMUSG00000052572      856         Dlg2\n",
       "c08f8441-4a10-4748-872a-e70c0bcccdba ENSMUSG00000052572       52         Dlg2\n",
       "3bbb6cf9-72b9-41be-b568-656de6eb18b5 ENSMUSG00000055421    79476        Pcdh9\n",
       "58b01044-c5e5-4b0f-8a2d-6ebf951e01ff ENSMUSG00000055421      125        Pcdh9\n",
       "98e5ea9f-16d6-47ec-a529-686e76515e39 ENSMUSG00000055421     3027        Pcdh9\n",
       "66ff82b4-9380-469c-bc4b-cfa08eacd325 ENSMUSG00000055421     2910        Pcdh9\n",
       "c08f8441-4a10-4748-872a-e70c0bcccdba ENSMUSG00000055421      117        Pcdh9\n",
       "3bbb6cf9-72b9-41be-b568-656de6eb18b5 ENSMUSG00000092341    79667       Malat1\n",
       "58b01044-c5e5-4b0f-8a2d-6ebf951e01ff ENSMUSG00000092341    12622       Malat1\n",
       "98e5ea9f-16d6-47ec-a529-686e76515e39 ENSMUSG00000092341    20094       Malat1\n",
       "66ff82b4-9380-469c-bc4b-cfa08eacd325 ENSMUSG00000092341     7102       Malat1\n",
       "c08f8441-4a10-4748-872a-e70c0bcccdba ENSMUSG00000092341    12992       Malat1"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "with cellxgene_census.open_soma() as census:\n",
    "    mouse = census[\"census_data\"][\"mus_musculus\"]\n",
    "\n",
    "    with mouse.axis_query(\n",
    "        measurement_name=\"RNA\",\n",
    "        obs_query=soma.AxisQuery(value_filter=\"tissue=='brain'\"),\n",
    "        var_query=soma.AxisQuery(value_filter=\"feature_name in ['Malat1', 'Ptprd', 'Dlg2', 'Pcdh9']\"),\n",
    "    ) as query:\n",
    "        obs_df = query.obs(column_names=[\"soma_joinid\", \"dataset_id\"]).concat().to_pandas().set_index(\"soma_joinid\")\n",
    "        var_df = query.var().concat().to_pandas().set_index(\"soma_joinid\")\n",
    "        n_cells_by_dataset = pd.Series(\n",
    "            0,\n",
    "            index=pd.MultiIndex.from_product(\n",
    "                (var_df.index, obs_df.dataset_id.unique()),\n",
    "                names=[\"soma_joinid\", \"dataset_id\"],\n",
    "            ),\n",
    "            dtype=np.int64,\n",
    "            name=\"n_cells\",\n",
    "        )\n",
    "\n",
    "        for X_tbl in query.X(\"raw\").tables():\n",
    "            # Group by dataset_id and count unique (genes, dataset_id)\n",
    "            value_counts = (\n",
    "                X_as_series(X_tbl)\n",
    "                .to_frame()\n",
    "                .join(obs_df[[\"dataset_id\"]], on=\"soma_dim_0\")\n",
    "                .reset_index(level=1)\n",
    "                .drop(columns=[\"soma_data\"])\n",
    "                .value_counts()\n",
    "            )\n",
    "            np.add.at(\n",
    "                n_cells_by_dataset,\n",
    "                n_cells_by_dataset.index.get_indexer(value_counts.index),\n",
    "                value_counts.to_numpy(),\n",
    "            )\n",
    "\n",
    "    # drop any combinations that are not observed\n",
    "    n_cells_by_dataset = n_cells_by_dataset[n_cells_by_dataset > 0]\n",
    "\n",
    "    # and join with var_df to pick up feature_id and feature_name\n",
    "    n_cells_by_dataset = (\n",
    "        n_cells_by_dataset.to_frame()\n",
    "        .reset_index(level=1)\n",
    "        .join(var_df[[\"feature_id\", \"feature_name\"]])\n",
    "        .set_index([\"dataset_id\", \"feature_id\"])\n",
    "    )\n",
    "\n",
    "    display(n_cells_by_dataset)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  },
  "vscode": {
   "interpreter": {
    "hash": "3da8ec1c162cd849e59e6ea2824b2e353dce799884e910aae99411be5277f953"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}