Genes measured in each cell (dataset presence matrix)

The Census is a compilation of cells from multiple datasets that may differ by the sets of genes they measure. This notebook describes the way to identify the genes measured per dataset.

The presence matrix is a sparse boolean array, indicating which features (var) were present in each dataset. The array has dimensions [n_datasets, n_var], and is stored in the SOMA Measurement varp collection. The first dimension is indexed by the soma_joinid in the census_datasets dataframe. The second is indexed by the soma_joinid in the var dataframe of the measurement.

As a reminder the obs data frame has a column dataset_id that can be used to link any cell in the Census to the presence matrix.

Contents

  1. Opening the Census.

  2. Fetching the IDs of the Census datasets.

  3. Fetching the dataset presence matrix.

  4. Identifying genes measured in a specific dataset.

  5. Identifying datasets that measured specific genes.

  6. Identifying all genes measured in a dataset.

⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable is_primary_data which is described in the Census schema.

Opening the Census

The cellxgene_census python package contains a convenient API to open the latest version of the Census.

[1]:
import cellxgene_census

census = cellxgene_census.open_soma()
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.

Fetching the IDs of the Census datasets

Let’s grab a table of all the datasets included in the Census and use this table in combination with the presence matrix below.

[2]:
# Grab the experiment containing human data, and the measurement therein with RNA
human = census["census_data"]["homo_sapiens"]
human_rna = human.ms["RNA"]

# The census-wide datasets
datasets_df = census["census_info"]["datasets"].read().concat().to_pandas()

datasets_df
[2]:
soma_joinid collection_id collection_name collection_doi dataset_id dataset_title dataset_h5ad_path dataset_total_cell_count
0 0 e2c257e7-6f79-487c-b81c-39451cd4ab3c Spatial multiomics map of trophoblast developm... 10.1038/s41586-023-05869-0 f171db61-e57e-4535-a06a-35d8b6ef8f2b donor_p13_trophoblasts f171db61-e57e-4535-a06a-35d8b6ef8f2b.h5ad 31497
1 1 e2c257e7-6f79-487c-b81c-39451cd4ab3c Spatial multiomics map of trophoblast developm... 10.1038/s41586-023-05869-0 ecf2e08e-2032-4a9e-b466-b65b395f4a02 All donors trophoblasts ecf2e08e-2032-4a9e-b466-b65b395f4a02.h5ad 67070
2 2 e2c257e7-6f79-487c-b81c-39451cd4ab3c Spatial multiomics map of trophoblast developm... 10.1038/s41586-023-05869-0 74cff64f-9da9-4b2a-9b3b-8a04a1598040 All donors all cell states (in vivo) 74cff64f-9da9-4b2a-9b3b-8a04a1598040.h5ad 286326
3 3 f7cecffa-00b4-4560-a29a-8ad626b8ee08 Mapping single-cell transcriptomes in the intr... 10.1016/j.ccell.2022.11.001 5af90777-6760-4003-9dba-8f945fec6fdf Single-cell transcriptomic datasets of Renal c... 5af90777-6760-4003-9dba-8f945fec6fdf.h5ad 270855
4 4 3f50314f-bdc9-40c6-8e4a-b0901ebfbe4c Single-cell sequencing links multiregional imm... 10.1016/j.ccell.2021.03.007 bd65a70f-b274-4133-b9dd-0d1431b6af34 Single-cell sequencing links multiregional imm... bd65a70f-b274-4133-b9dd-0d1431b6af34.h5ad 167283
... ... ... ... ... ... ... ... ...
588 588 180bff9c-c8a5-4539-b13b-ddbc00d643e6 Molecular characterization of selectively vuln... 10.1038/s41593-020-00764-7 f9ad5649-f372-43e1-a3a8-423383e5a8a2 Molecular characterization of selectively vuln... f9ad5649-f372-43e1-a3a8-423383e5a8a2.h5ad 8168
589 589 a72afd53-ab92-4511-88da-252fb0e26b9a Single-cell atlas of peripheral immune respons... 10.1038/s41591-020-0944-y 456e8b9b-f872-488b-871d-94534090a865 Single-cell atlas of peripheral immune respons... 456e8b9b-f872-488b-871d-94534090a865.h5ad 44721
590 590 38833785-fac5-48fd-944a-0f62a4c23ed1 Construction of a human cell landscape at sing... 10.1038/s41586-020-2157-4 2adb1f8a-a6b1-4909-8ee8-484814e2d4bf Construction of a human cell landscape at sing... 2adb1f8a-a6b1-4909-8ee8-484814e2d4bf.h5ad 598266
591 591 5d445965-6f1a-4b68-ba3a-b8f765155d3a A molecular cell atlas of the human lung from ... 10.1038/s41586-020-2922-4 e04daea4-4412-45b5-989e-76a9be070a89 Krasnow Lab Human Lung Cell Atlas, Smart-seq2 e04daea4-4412-45b5-989e-76a9be070a89.h5ad 9409
592 592 5d445965-6f1a-4b68-ba3a-b8f765155d3a A molecular cell atlas of the human lung from ... 10.1038/s41586-020-2922-4 8c42cfd0-0b0a-46d5-910c-fc833d83c45e Krasnow Lab Human Lung Cell Atlas, 10X 8c42cfd0-0b0a-46d5-910c-fc833d83c45e.h5ad 65662

593 rows × 8 columns

Fetching the dataset presence matrix

Now let’s fetch the dataset presence matrix.

For convenience, read the entire presence matrix (for Homo sapiens) into a SciPy array. There is a convenience API providing this capability, returning the matrix in a scipy.sparse.array.

[3]:
presence_matrix = cellxgene_census.get_presence_matrix(census, organism="Homo sapiens", measurement_name="RNA")

presence_matrix
[3]:
<593x60664 sparse matrix of type '<class 'numpy.uint8'>'
        with 16133717 stored elements in Compressed Sparse Row format>

We also need the var dataframe, which is read into a Pandas DataFrame for convenient manipulation:

[4]:
var_df = human_rna.var.read().concat().to_pandas()

var_df
[4]:
soma_joinid feature_id feature_name feature_length
0 0 ENSG00000121410 A1BG 3999
1 1 ENSG00000268895 A1BG-AS1 3374
2 2 ENSG00000148584 A1CF 9603
3 3 ENSG00000175899 A2M 6318
4 4 ENSG00000245105 A2M-AS1 2948
... ... ... ... ...
60659 60659 ENSG00000288719 RP4-669P10.21 4252
60660 60660 ENSG00000288720 RP11-852E15.3 7007
60661 60661 ENSG00000288721 RP5-973N23.5 7765
60662 60662 ENSG00000288723 RP11-553N16.6 1015
60663 60663 ENSG00000288724 RP13-546I2.2 625

60664 rows × 4 columns

Identifying genes measured in a specific dataset.

Now that we have the dataset table, the genes metadata table, and the dataset presence matrix, we can check if a gene or set of genes were measured in a specific dataset.

Important: the presence matrix is indexed by soma_joinid, and is NOT positionally indexed. In other words:

  • the first dimension of the presence matrix is the dataset’s soma_joinid, as stored in the census_datasets dataframe.

  • the second dimension of the presence matrix is the feature’s soma_joinid, as stored in the var dataframe.

Let’s find out if the the gene "ENSG00000286096" was measured in the dataset with id "97a17473-e2b1-4f31-a544-44a60773e2dd".

[5]:
var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid
dataset_joinid = datasets_df.loc[datasets_df.dataset_id == "97a17473-e2b1-4f31-a544-44a60773e2dd"].soma_joinid
is_present = presence_matrix[dataset_joinid, var_joinid][0, 0]
print(f'Feature is {"present" if is_present else "not present"}.')
Feature is present.

Identifying datasets that measured specific genes

Similarly, we can determine the datasets that measured a specific gene or set of genes.

[6]:
# Grab the feature's soma_joinid from the var dataframe
var_joinid = var_df.loc[var_df.feature_id == "ENSG00000286096"].soma_joinid

# The presence matrix is indexed by the joinids of the dataset and var dataframes,
# so slice out the feature of interest by its joinid.
dataset_joinids = presence_matrix[:, var_joinid].tocoo().row

# From the datasets dataframe, slice out the datasets which have a joinid in the list
datasets_df.loc[datasets_df.soma_joinid.isin(dataset_joinids)]
[6]:
soma_joinid collection_id collection_name collection_doi dataset_id dataset_title dataset_h5ad_path dataset_total_cell_count
5 5 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 ff45e623-7f5f-46e3-b47d-56be0341f66b Tabula Sapiens - Pancreas ff45e623-7f5f-46e3-b47d-56be0341f66b.h5ad 13497
6 6 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 f01bdd17-4902-40f5-86e3-240d66dd2587 Tabula Sapiens - Salivary_Gland f01bdd17-4902-40f5-86e3-240d66dd2587.h5ad 27199
7 7 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 e6a11140-2545-46bc-929e-da243eed2cae Tabula Sapiens - Heart e6a11140-2545-46bc-929e-da243eed2cae.h5ad 11505
8 8 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 e5c63d94-593c-4338-a489-e1048599e751 Tabula Sapiens - Bladder e5c63d94-593c-4338-a489-e1048599e751.h5ad 24583
9 9 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 d8732da6-8d1d-42d9-b625-f2416c30054b Tabula Sapiens - Trachea d8732da6-8d1d-42d9-b625-f2416c30054b.h5ad 9522
11 11 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 cee11228-9f0b-4e57-afe2-cfe15ee56312 Tabula Sapiens - Spleen cee11228-9f0b-4e57-afe2-cfe15ee56312.h5ad 34004
12 12 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 a357414d-2042-4eb5-95f0-c58604a18bdd Tabula Sapiens - Small_Intestine a357414d-2042-4eb5-95f0-c58604a18bdd.h5ad 12467
14 14 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 a0754256-f44b-4c4a-962c-a552e47d3fdc Tabula Sapiens - Eye a0754256-f44b-4c4a-962c-a552e47d3fdc.h5ad 10650
15 15 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 983d5ec9-40e8-4512-9e65-a572a9c486cb Tabula Sapiens - Blood 983d5ec9-40e8-4512-9e65-a572a9c486cb.h5ad 50115
19 19 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 5e5e7a2f-8f1c-42ac-90dc-b4f80f38e84c Tabula Sapiens - Fat 5e5e7a2f-8f1c-42ac-90dc-b4f80f38e84c.h5ad 20263
20 20 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 55cf0ea3-9d2b-4294-871e-bb4b49a79fc7 Tabula Sapiens - Tongue 55cf0ea3-9d2b-4294-871e-bb4b49a79fc7.h5ad 15020
21 21 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 4f1555bc-4664-46c3-a606-78d34dd10d92 Tabula Sapiens - Bone_Marrow 4f1555bc-4664-46c3-a606-78d34dd10d92.h5ad 12297
23 23 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 2423ce2c-3149-4cca-a2ff-cf682ea29b5f Tabula Sapiens - Kidney 2423ce2c-3149-4cca-a2ff-cf682ea29b5f.h5ad 9641
24 24 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 1c9eb291-6d31-47e1-96b2-129b5e1ae64f Tabula Sapiens - Muscle 1c9eb291-6d31-47e1-96b2-129b5e1ae64f.h5ad 30746
25 25 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 18eb630b-a754-4111-8cd4-c24ec80aa5ec Tabula Sapiens - Lymph_Node 18eb630b-a754-4111-8cd4-c24ec80aa5ec.h5ad 53275
26 26 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 0d2ee4ac-05ee-40b2-afb6-ebb584caa867 Tabula Sapiens - Lung 0d2ee4ac-05ee-40b2-afb6-ebb584caa867.h5ad 35682
27 27 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 0ced5e76-6040-47ff-8a72-93847965afc0 Tabula Sapiens - Thymus 0ced5e76-6040-47ff-8a72-93847965afc0.h5ad 33664
43 43 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 8e10f1c4-8e98-41e5-b65f-8cd89a887122 All neurons 8e10f1c4-8e98-41e5-b65f-8cd89a887122.h5ad 2480956
139 139 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 fe1a73ab-a203-45fd-84e9-0f7fd19efcbd Dissection: Amygdaloid complex (AMY) - basolat... fe1a73ab-a203-45fd-84e9-0f7fd19efcbd.h5ad 35285
143 143 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 f8dda921-5fb4-4c94-a654-c6fc346bfd6d Dissection: Cerebral cortex (Cx) - Occipitotem... f8dda921-5fb4-4c94-a654-c6fc346bfd6d.h5ad 31899
160 160 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 dd03ce70-3243-4c96-9561-330cc461e4d7 Dissection: Cerebral cortex (Cx) - Perirhinal ... dd03ce70-3243-4c96-9561-330cc461e4d7.h5ad 23732
165 165 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 d2b5efc1-14c6-4b5f-bd98-40f9084872d7 Dissection: Tail of Hippocampus (HiT) - Caudal... d2b5efc1-14c6-4b5f-bd98-40f9084872d7.h5ad 36886
175 175 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 c4b03352-af8d-492a-8d6b-40f304e0a122 Supercluster: Medium spiny neuron c4b03352-af8d-492a-8d6b-40f304e0a122.h5ad 152189
176 176 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771 Dissection: Cerebral cortex (Cx) - Temporal po... c2aad8fc-b63b-4f9b-9cfd-baf7bc9c1771.h5ad 37642
177 177 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 c202b243-1aa1-4b16-bc9a-b36241f3b1e3 Supercluster: Amygdala excitatory c202b243-1aa1-4b16-bc9a-b36241f3b1e3.h5ad 109452
178 178 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 bdb26abd-f4ba-4ea3-8862-c2340e7a4f55 Supercluster: CGE interneuron bdb26abd-f4ba-4ea3-8862-c2340e7a4f55.h5ad 227671
183 183 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 acae7679-d077-461c-b857-ee6ccfeb267f Dissection: Head of hippocampus (HiH) - CA1 acae7679-d077-461c-b857-ee6ccfeb267f.h5ad 39147
196 196 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 9372df2d-13d6-4fac-980b-919a5b7eb483 Dissection: Midbrain (M) - Periaqueductal gray... 9372df2d-13d6-4fac-980b-919a5b7eb483.h5ad 33794
197 197 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 93131426-0124-4ab4-a013-9dfbcd99d467 Dissection: Epithalamus - ETH 93131426-0124-4ab4-a013-9dfbcd99d467.h5ad 24327
206 206 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 7c1c3d47-3166-43e5-9a95-65ceb2d45f78 Dissection: Pons (Pn) - Pontine reticular form... 7c1c3d47-3166-43e5-9a95-65ceb2d45f78.h5ad 49512
208 208 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 7a0a8891-9a22-4549-a55b-c2aca23c3a2a Supercluster: Hippocampal CA1-3 7a0a8891-9a22-4549-a55b-c2aca23c3a2a.h5ad 74979
220 220 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 5e5ab909-f73f-4b57-98a0-6d2c5662f6a4 Dissection: Midbrain (M) - Inferior colliculus... 5e5ab909-f73f-4b57-98a0-6d2c5662f6a4.h5ad 32306
243 243 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 3f56901c-dd4a-47d6-b60b-7b0c0111cfb2 Dissection: Head of hippocampus (HiH) - CA1-3 3f56901c-dd4a-47d6-b60b-7b0c0111cfb2.h5ad 37911
245 245 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 3a7f3ab4-a280-4b3b-b2c0-6dd05614a78c Supercluster: Splatter 3a7f3ab4-a280-4b3b-b2c0-6dd05614a78c.h5ad 291833
249 249 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 35c8a04c-8639-4d15-8228-765d8d93fc96 Dissection: Hypothalamus (HTH) - supraoptic re... 35c8a04c-8639-4d15-8228-765d8d93fc96.h5ad 16753
270 270 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 07b1d7c8-5c2e-42f7-9246-26f746cd6013 Dissection: Myelencephalon (medulla oblongata)... 07b1d7c8-5c2e-42f7-9246-26f746cd6013.h5ad 27210
273 273 283d65eb-dd53-496d-adb7-7570c7caa443 Transcriptomic diversity of cell types across ... 10.1101/2022.10.12.511898 0325478a-9b52-45b5-b40a-2e2ab0d72eb1 Supercluster: Upper-layer intratelencephalic 0325478a-9b52-45b5-b40a-2e2ab0d72eb1.h5ad 455006
475 475 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 53d208b0-2cfd-4366-9866-c3c6114081bc Tabula Sapiens - All Cells 53d208b0-2cfd-4366-9866-c3c6114081bc.h5ad 483152
476 476 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 a68b64d8-aee3-4947-81b7-36b8fe5a44d2 Tabula Sapiens - Stromal a68b64d8-aee3-4947-81b7-36b8fe5a44d2.h5ad 82478
477 477 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 c5d88abe-f23a-45fa-a534-788985e93dad Tabula Sapiens - Immune c5d88abe-f23a-45fa-a534-788985e93dad.h5ad 264824
478 478 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 5a11f879-d1ef-458a-910c-9b0bdfca5ebf Tabula Sapiens - Endothelial 5a11f879-d1ef-458a-910c-9b0bdfca5ebf.h5ad 31691
479 479 e5f58829-1a66-40b5-a624-9046778e74f5 Tabula Sapiens 10.1126/science.abl4896 97a17473-e2b1-4f31-a544-44a60773e2dd Tabula Sapiens - Epithelial 97a17473-e2b1-4f31-a544-44a60773e2dd.h5ad 104148

Identifying all genes measured in a dataset

Finally, we can find the set of genes that were measured in the cells of a given dataset.

[7]:
# Slice the dataset(s) of interest, and get the joinid(s)
dataset_joinids = datasets_df.loc[datasets_df.collection_id == "17481d16-ee44-49e5-bcf0-28c0780d8c4a"].soma_joinid

# Slice the presence matrix by the first dimension, i.e., by dataset
var_joinids = presence_matrix[dataset_joinids, :].tocoo().col

# From the feature (var) dataframe, slice out features which have a joinid in the list.
var_df.loc[var_df.soma_joinid.isin(var_joinids)]
[7]:
soma_joinid feature_id feature_name feature_length
0 0 ENSG00000121410 A1BG 3999
1 1 ENSG00000268895 A1BG-AS1 3374
2 2 ENSG00000148584 A1CF 9603
3 3 ENSG00000175899 A2M 6318
4 4 ENSG00000245105 A2M-AS1 2948
... ... ... ... ...
58109 58109 ENSG00000277745 H2AB3 591
58354 58354 ENSG00000233522 FAM224A 2031
58411 58411 ENSG00000183146 PRORY 878
58523 58523 ENSG00000279274 RP11-533E23.2 75
58632 58632 ENSG00000277836 ENSG00000277836.1 288

27211 rows × 4 columns