{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Computing on X using online (incremental) algorithms\n", "\n", "This tutorial showcases computing a variety of per-gene and per-cell statistics for a user-defined query using out-of-core operations.\n", "\n", "*NOTE*: when query results are small enough to fit in memory, it may be easier to use the `SOMAExperiment` Query class to extract an AnnData, and then just compute over that. This tutorial shows means of incrementally processing larger-than-core (RAM) data, where incremental (online) algorithms are used.\n", "\n", "**Contents**\n", "\n", "1. Incremental count and mean calculation.\n", "2. Incremental variance calculation.\n", "3. Counting cells per gene, grouped by `dataset_id`.\n", "\n", "⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable `is_primary_data` which is described in the [Census schema](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#repeated-data)." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:30:56.921713Z", "iopub.status.busy": "2023-07-28T14:30:56.921303Z", "iopub.status.idle": "2023-07-28T14:30:59.013417Z", "shell.execute_reply": "2023-07-28T14:30:59.012839Z" } }, "outputs": [], "source": [ "import cellxgene_census\n", "import numpy as np\n", "import pandas as pd\n", "import tiledbsoma as soma\n", "from tiledbsoma.experiment_query import X_as_series" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Incremental count and mean calculation.\n", "\n", "Many statistics, such as `mean`, are easy to calculate incrementally. This cell demonstrates a query on the `X['raw']` sparse nD array, which will return results in batches. Accumulate the sum and count incrementally, into `raw_sum` and `raw_n`, and then compute mean.\n", "\n", "First define a query - in this case a slice over the obs axis for cells with a specific tissue & sex value, and all genes on the var axis. The `query.X()` method returns an iterator of results, each as a PyArrow Table. Each table will contain the sparse X data and obs/var coordinates, using standard SOMA names:\n", "\n", "* `soma_data` - the X value (float32)\n", "* `soma_dim_0` - the obs coordinate (int64)\n", "* `soma_dim_1` - the var coordinate (int64)\n", "\n", "**Important**: the X matrices are joined to var/obs axis DataFrames by an integer join \"id\" (aka `soma_joinid`). They are *NOT* positionally indexed, and any given cell or gene may have a `soma_joinid` of any value (e.g., a large integer). In other words, for any given `X` value, the `soma_dim_0` corresponds to the `soma_joinid` in the `obs` dataframe, and the `soma_dim_1` coordinate corresponds to the `soma_joinid` in the `var` dataframe.\n", "\n", "For convenience, the query package contains a utility function to simplify operations on query slices. `query.indexer` returns an indexer that can be used to wrap the output of `query.X()`, converting from `soma_joinids` to positional indexing. Positions are `[0, N)`, where `N` are the number of results on the query for any given axis (equivalent to the Pandas `.iloc` of the axis dataframe).\n", "\n", "Key points:\n", "\n", "* it is expensive to query and read the results - so rather than make multiple passes over the data, read it once and perform multiple computations.\n", "* by default, data in the census is indexed by `soma_joinid` and not positionally. Use `query.indexer` if you want positions." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2023-07-28T14:30:59.016475Z", "iopub.status.busy": "2023-07-28T14:30:59.016041Z", "iopub.status.idle": "2023-07-28T14:31:21.558944Z", "shell.execute_reply": "2023-07-28T14:31:21.558251Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "The \"stable\" release is currently 2023-07-25. Specify 'census_version=\"2023-07-25\"' in future calls to open_soma() to ensure data consistency.\n" ] }, { "data": { "text/html": [ "
\n", " | feature_id | \n", "feature_name | \n", "feature_length | \n", "raw_n | \n", "raw_mean | \n", "
---|---|---|---|---|---|
soma_joinid | \n", "\n", " | \n", " | \n", " | \n", " | \n", " |
0 | \n", "ENSMUSG00000051951 | \n", "Xkr4 | \n", "6094 | \n", "202 | \n", "1.032743 | \n", "
1 | \n", "ENSMUSG00000089699 | \n", "Gm1992 | \n", "250 | \n", "0 | \n", "0.000000 | \n", "
2 | \n", "ENSMUSG00000102343 | \n", "Gm37381 | \n", "1364 | \n", "0 | \n", "0.000000 | \n", "
3 | \n", "ENSMUSG00000025900 | \n", "Rp1 | \n", "12311 | \n", "106 | \n", "0.236265 | \n", "
4 | \n", "ENSMUSG00000025902 | \n", "Sox17 | \n", "4772 | \n", "3259 | \n", "48.991975 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
52387 | \n", "ENSMUSG00000081591 | \n", "Btf3-ps9 | \n", "496 | \n", "0 | \n", "0.000000 | \n", "
52388 | \n", "ENSMUSG00000118710 | \n", "mmu-mir-467a-3_ENSMUSG00000118710 | \n", "83 | \n", "0 | \n", "0.000000 | \n", "
52389 | \n", "ENSMUSG00000119584 | \n", "Rn18s | \n", "1849 | \n", "0 | \n", "0.000000 | \n", "
52390 | \n", "ENSMUSG00000118538 | \n", "Gm18218 | \n", "970 | \n", "0 | \n", "0.000000 | \n", "
52391 | \n", "ENSMUSG00000084217 | \n", "Setd9-ps | \n", "670 | \n", "0 | \n", "0.000000 | \n", "
52392 rows × 5 columns
\n", "\n", " | feature_id | \n", "feature_name | \n", "feature_length | \n", "raw_mean | \n", "raw_variance | \n", "
---|---|---|---|---|---|
soma_joinid | \n", "\n", " | \n", " | \n", " | \n", " | \n", " |
0 | \n", "ENSMUSG00000051951 | \n", "Xkr4 | \n", "6094 | \n", "1.032743 | \n", "848.312801 | \n", "
1 | \n", "ENSMUSG00000089699 | \n", "Gm1992 | \n", "250 | \n", "0.000000 | \n", "0.000000 | \n", "
2 | \n", "ENSMUSG00000102343 | \n", "Gm37381 | \n", "1364 | \n", "0.000000 | \n", "0.000000 | \n", "
3 | \n", "ENSMUSG00000025900 | \n", "Rp1 | \n", "12311 | \n", "0.236265 | \n", "169.182975 | \n", "
4 | \n", "ENSMUSG00000025902 | \n", "Sox17 | \n", "4772 | \n", "48.991975 | \n", "279575.656207 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
52387 | \n", "ENSMUSG00000081591 | \n", "Btf3-ps9 | \n", "496 | \n", "0.000000 | \n", "0.000000 | \n", "
52388 | \n", "ENSMUSG00000118710 | \n", "mmu-mir-467a-3_ENSMUSG00000118710 | \n", "83 | \n", "0.000000 | \n", "0.000000 | \n", "
52389 | \n", "ENSMUSG00000119584 | \n", "Rn18s | \n", "1849 | \n", "0.000000 | \n", "0.000000 | \n", "
52390 | \n", "ENSMUSG00000118538 | \n", "Gm18218 | \n", "970 | \n", "0.000000 | \n", "0.000000 | \n", "
52391 | \n", "ENSMUSG00000084217 | \n", "Setd9-ps | \n", "670 | \n", "0.000000 | \n", "0.000000 | \n", "
52392 rows × 5 columns
\n", "\n", " | \n", " | n_cells | \n", "feature_name | \n", "
---|---|---|---|
dataset_id | \n", "feature_id | \n", "\n", " | \n", " |
3bbb6cf9-72b9-41be-b568-656de6eb18b5 | \n", "ENSMUSG00000028399 | \n", "79578 | \n", "Ptprd | \n", "
58b01044-c5e5-4b0f-8a2d-6ebf951e01ff | \n", "ENSMUSG00000028399 | \n", "474 | \n", "Ptprd | \n", "
3bbb6cf9-72b9-41be-b568-656de6eb18b5 | \n", "ENSMUSG00000052572 | \n", "79513 | \n", "Dlg2 | \n", "
58b01044-c5e5-4b0f-8a2d-6ebf951e01ff | \n", "ENSMUSG00000052572 | \n", "81 | \n", "Dlg2 | \n", "
98e5ea9f-16d6-47ec-a529-686e76515e39 | \n", "ENSMUSG00000052572 | \n", "908 | \n", "Dlg2 | \n", "
66ff82b4-9380-469c-bc4b-cfa08eacd325 | \n", "ENSMUSG00000052572 | \n", "856 | \n", "Dlg2 | \n", "
c08f8441-4a10-4748-872a-e70c0bcccdba | \n", "ENSMUSG00000052572 | \n", "52 | \n", "Dlg2 | \n", "
3bbb6cf9-72b9-41be-b568-656de6eb18b5 | \n", "ENSMUSG00000055421 | \n", "79476 | \n", "Pcdh9 | \n", "
58b01044-c5e5-4b0f-8a2d-6ebf951e01ff | \n", "ENSMUSG00000055421 | \n", "125 | \n", "Pcdh9 | \n", "
98e5ea9f-16d6-47ec-a529-686e76515e39 | \n", "ENSMUSG00000055421 | \n", "3027 | \n", "Pcdh9 | \n", "
66ff82b4-9380-469c-bc4b-cfa08eacd325 | \n", "ENSMUSG00000055421 | \n", "2910 | \n", "Pcdh9 | \n", "
c08f8441-4a10-4748-872a-e70c0bcccdba | \n", "ENSMUSG00000055421 | \n", "117 | \n", "Pcdh9 | \n", "
3bbb6cf9-72b9-41be-b568-656de6eb18b5 | \n", "ENSMUSG00000092341 | \n", "79667 | \n", "Malat1 | \n", "
58b01044-c5e5-4b0f-8a2d-6ebf951e01ff | \n", "ENSMUSG00000092341 | \n", "12622 | \n", "Malat1 | \n", "
98e5ea9f-16d6-47ec-a529-686e76515e39 | \n", "ENSMUSG00000092341 | \n", "20094 | \n", "Malat1 | \n", "
66ff82b4-9380-469c-bc4b-cfa08eacd325 | \n", "ENSMUSG00000092341 | \n", "7102 | \n", "Malat1 | \n", "
c08f8441-4a10-4748-872a-e70c0bcccdba | \n", "ENSMUSG00000092341 | \n", "12992 | \n", "Malat1 | \n", "