Genes measured in each cell (dataset presence matrix)
Source:vignettes/census_dataset_presence.Rmd
census_dataset_presence.Rmd
The Census is a compilation of cells from multiple datasets that may differ by the sets of genes they measure. This notebook describes the way to identify the genes measured per dataset.
The presence matrix is a sparse boolean array, indicating which features (var) were present in each dataset. The array has dimensions [n_datasets, n_var], and is stored in the SOMA Measurement varp
collection. The first dimension is indexed by the soma_joinid
in the census_datasets
dataframe. The second is indexed by the soma_joinid
in the var
dataframe of the measurement.
As a reminder the obs
data frame has a column dataset_id
that can be used to link any cell in the Census to the presence matrix.
Contents
- Opening the Census.
- Fetching the IDs of the Census datasets.
- Fetching the dataset presence matrix.
- Identifying genes measured in a specific dataset.
- Identifying datasets that measured specific genes.
- Identifying all genes measured in a dataset.
Opening the Census
The cellxgene.census
R package contains a convenient API to open any version of the Census (by default, the newest stable version).
library("cellxgene.census")
census <- open_soma()
Fetching the IDs of the Census datasets
Let’s grab a table of all the datasets included in the Census and use this table in combination with the presence matrix below.
# Grab the experiment containing human data, and the measurement therein with RNA
human <- census$get("census_data")$get("homo_sapiens")
human_rna <- human$ms$get("RNA")
# The census-wide datasets
datasets_df <- as.data.frame(census$get("census_info")$get("datasets")$read()$concat())
print(datasets_df)
#> soma_joinid collection_id
#> 1 0 4dca242c-d302-4dba-a68f-4c61e7bad553
#> 2 1 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 3 2 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 4 3 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 5 4 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 6 5 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 7 6 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 8 7 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 9 8 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 10 9 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 11 10 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> collection_name
#> 1 Comparative transcriptomics reveals human-specific cortical features
#> 2 Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 3 Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 4 Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 5 Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 6 Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 7 Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 8 Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 9 Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 10 Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 11 Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> collection_doi dataset_id
#> 1 10.1126/science.ade9516 2bdd3a2c-2ff4-4314-adf3-8a06b797a33a
#> 2 10.1126/science.adf6812 f5b0810c-1664-4a62-ad06-be1d9964aa8b
#> 3 10.1126/science.adf6812 e4ddac12-f48f-4455-8e8d-c2a48a683437
#> 4 10.1126/science.adf6812 e2808a6e-e2ea-41b9-b38c-4a08f1677f02
#> 5 10.1126/science.adf6812 d01c9dff-abd1-4825-bf30-2eb2ba74597e
#> 6 10.1126/science.adf6812 c3aa4f95-7a18-4a7d-8dd8-ca324d714363
#> 7 10.1126/science.adf6812 be401db3-d732-408a-b0c4-71af0458b8ab
#> 8 10.1126/science.adf6812 a5d5c529-8a1f-40b5-bda3-35208970070d
#> 9 10.1126/science.adf6812 9c63201d-bfd9-41a8-bbbc-18d947556f3d
#> 10 10.1126/science.adf6812 93cb76aa-a84b-4a92-8e6c-66a914e26d4c
#> 11 10.1126/science.adf6812 8d1dd010-5cbc-43fb-83f8-e0de8e8517da
#> dataset_version_id
#> 1 7eb7f2fd-fd74-4c99-863c-97836415652e
#> 2 d4427196-7876-4bdd-a929-ae4d177ec776
#> 3 3280113b-7148-4a3e-98d4-015f443aab8a
#> 4 dc092185-3b8e-4fcb-ae21-1dc106d683ac
#> 5 c4959ded-83dc-4442-aac7-9a59bdb47801
#> 6 0476ef54-aefe-4754-b0e9-d9fcd75adff4
#> 7 ee027704-72aa-4195-a467-0754db1ed65d
#> 8 d47c0742-cea2-46c1-9e72-4d479214041c
#> 9 8b09695a-1426-4867-961e-c40a1fbcc2da
#> 10 98ad7381-f464-4f49-b850-5321b4f98be6
#> 11 c56683d2-452a-45dc-b402-35397e27e325
#> dataset_title
#> 1 Human: Great apes study
#> 2 Dissection: Angular gyrus (AnG)
#> 3 Supercluster: CGE-derived interneurons
#> 4 Dissection: Primary auditory cortex(A1)
#> 5 Supercluster: Deep layer (non-IT) excitatory neurons
#> 6 Supercluster: IT-projecting excitatory neurons
#> 7 Dissection: Anterior cingulate cortex (ACC)
#> 8 Human Multiple Cortical Areas SMART-seq
#> 9 Supercluster: MGE-derived interneurons
#> 10 Dissection: Primary somatosensory cortex (S1)
#> 11 Dissection: Primary visual cortex(V1)
#> dataset_h5ad_path dataset_total_cell_count
#> 1 2bdd3a2c-2ff4-4314-adf3-8a06b797a33a.h5ad 156285
#> 2 f5b0810c-1664-4a62-ad06-be1d9964aa8b.h5ad 110752
#> 3 e4ddac12-f48f-4455-8e8d-c2a48a683437.h5ad 129495
#> 4 e2808a6e-e2ea-41b9-b38c-4a08f1677f02.h5ad 139054
#> 5 d01c9dff-abd1-4825-bf30-2eb2ba74597e.h5ad 92969
#> 6 c3aa4f95-7a18-4a7d-8dd8-ca324d714363.h5ad 638941
#> 7 be401db3-d732-408a-b0c4-71af0458b8ab.h5ad 135462
#> 8 a5d5c529-8a1f-40b5-bda3-35208970070d.h5ad 49417
#> 9 9c63201d-bfd9-41a8-bbbc-18d947556f3d.h5ad 185477
#> 10 93cb76aa-a84b-4a92-8e6c-66a914e26d4c.h5ad 153159
#> 11 8d1dd010-5cbc-43fb-83f8-e0de8e8517da.h5ad 241077
#> [ reached 'max' / getOption("max.print") -- omitted 640 rows ]
Fetching the dataset presence matrix
Now let’s fetch the dataset presence matrix.
For convenience, read the entire presence matrix (for Homo sapiens) into a sparse matrix. There is a convenience function providing this capability:
presence_matrix <- get_presence_matrix(census, "Homo sapiens", "RNA")
print(dim(presence_matrix))
#> NULL
We also need the var
dataframe, which is read into an R data frame for convenient manipulation:
var_df <- as.data.frame(human_rna$var$read()$concat())
print(var_df)
#> soma_joinid feature_id feature_name feature_length nnz n_measured_obs
#> 1 0 ENSG00000233576 HTR3C2P 1057 69370 19581263
#> 2 1 ENSG00000121410 A1BG 3999 5640476 62641311
#> 3 2 ENSG00000268895 A1BG-AS1 3374 3071864 61946057
#> 4 3 ENSG00000148584 A1CF 9603 734347 58195911
#> 5 4 ENSG00000175899 A2M 6318 7894261 62704378
#> 6 5 ENSG00000245105 A2M-AS1 2948 1637794 62086816
#> 7 6 ENSG00000166535 A2ML1 7156 2156616 60911688
#> 8 7 ENSG00000256069 A2MP1 4657 835384 23554778
#> 9 8 ENSG00000184389 A3GALT2 1023 439067 53780311
#> 10 9 ENSG00000128274 A4GALT 3358 2432348 62706770
#> 11 10 ENSG00000118017 A4GNT 1779 52430 56117399
#> 12 11 ENSG00000265544 AA06 632 220755 22545140
#> 13 12 ENSG00000081760 AACS 16039 11280800 62842909
#> 14 13 ENSG00000250420 AACSP1 3380 211588 22831831
#> 15 14 ENSG00000114771 AADAC 1632 552258 54941618
#> 16 15 ENSG00000188984 AADACL3 4055 24626 43074608
#> [ reached 'max' / getOption("max.print") -- omitted 60648 rows ]
Identifying genes measured in a specific dataset
Now that we have the dataset table, the genes metadata table, and the dataset presence matrix, we can check if a gene or set of genes were measured in a specific dataset.
Important: the presence matrix is indexed by soma_joinid
, and is NOT positionally indexed. In other words:
- the first dimension of the presence matrix is the dataset’s
soma_joinid
, as stored in thecensus_datasets
dataframe. - the second dimension of the presence matrix is the feature’s
soma_joinid
, as stored in thevar
dataframe.
The presence matrix has a method $take()
that lets you slice it by soma_joinid
s from census_datasets
and var
. And the full presence matrix, or slices of it, can then be exported to a regular matrix with the method $get_one_based_matrix()
Let’s find out if the the gene "ENSG00000286096"
was measured in the dataset with id "97a17473-e2b1-4f31-a544-44a60773e2dd"
.
# Get soma_joinid for datasets and genes of interest
var_joinid <- var_df$soma_joinid[var_df$feature_id == "ENSG00000286096"]
dataset_joinid <- datasets_df$soma_joinid[datasets_df$dataset_id == "97a17473-e2b1-4f31-a544-44a60773e2dd"]
# Slice presence matrix with datasets and genes of interest
presence_matrix_slice <- presence_matrix$take(i = dataset_joinid, j = var_joinid)
# Convert presence matrix to regular matrix
presence_matrix_slice <- presence_matrix_slice$get_one_based_matrix()
# Find how if the gene is present in this dataset
is_present <- presence_matrix_slice[, , drop = TRUE]
cat(paste("Feature is", if (is_present) "present." else "not present."))
#> Feature is present.
Identifying datasets that measured specific genes
Similarly, we can determine the datasets that measured a specific gene or set of genes.
# Grab the feature's soma_joinid from the var dataframe
var_joinid <- var_df$soma_joinid[var_df$feature_id == "ENSG00000286096"]
# The presence matrix is indexed by the joinids of the dataset and var dataframes,
# so slice out the feature of interest by its joinid.
presence_matrix_slice <- presence_matrix$take(j = var_joinid)$get_one_based_matrix()
measured_datasets <- presence_matrix_slice[, , drop = TRUE] != 0
dataset_joinids <- datasets_df$soma_joinid[measured_datasets]
# From the datasets dataframe, slice out the datasets which have a joinid in the list
print(datasets_df[dataset_joinids, ])
#> soma_joinid collection_id
#> 63 62 3f50314f-bdc9-40c6-8e4a-b0901ebfbe4c
#> 64 63 e5f58829-1a66-40b5-a624-9046778e74f5
#> 65 64 e5f58829-1a66-40b5-a624-9046778e74f5
#> 66 65 e5f58829-1a66-40b5-a624-9046778e74f5
#> 67 66 e5f58829-1a66-40b5-a624-9046778e74f5
#> 69 68 e5f58829-1a66-40b5-a624-9046778e74f5
#> 70 69 e5f58829-1a66-40b5-a624-9046778e74f5
#> 72 71 e5f58829-1a66-40b5-a624-9046778e74f5
#> 73 72 e5f58829-1a66-40b5-a624-9046778e74f5
#> 77 76 e5f58829-1a66-40b5-a624-9046778e74f5
#> 78 77 e5f58829-1a66-40b5-a624-9046778e74f5
#> collection_name
#> 63 Single-cell sequencing links multiregional immune landscapes and tissue-resident T cells in ccRCC to tumor topology and therapy efficacy
#> 64 Tabula Sapiens
#> 65 Tabula Sapiens
#> 66 Tabula Sapiens
#> 67 Tabula Sapiens
#> 69 Tabula Sapiens
#> 70 Tabula Sapiens
#> 72 Tabula Sapiens
#> 73 Tabula Sapiens
#> 77 Tabula Sapiens
#> 78 Tabula Sapiens
#> collection_doi dataset_id
#> 63 10.1016/j.ccell.2021.03.007 bd65a70f-b274-4133-b9dd-0d1431b6af34
#> 64 10.1126/science.abl4896 ff45e623-7f5f-46e3-b47d-56be0341f66b
#> 65 10.1126/science.abl4896 f01bdd17-4902-40f5-86e3-240d66dd2587
#> 66 10.1126/science.abl4896 e6a11140-2545-46bc-929e-da243eed2cae
#> 67 10.1126/science.abl4896 e5c63d94-593c-4338-a489-e1048599e751
#> 69 10.1126/science.abl4896 d77ec7d6-ef2e-49d6-9e79-05b7f8881484
#> 70 10.1126/science.abl4896 cee11228-9f0b-4e57-afe2-cfe15ee56312
#> 72 10.1126/science.abl4896 a2d4d33e-4c62-4361-b80a-9be53d2e50e8
#> 73 10.1126/science.abl4896 a0754256-f44b-4c4a-962c-a552e47d3fdc
#> 77 10.1126/science.abl4896 6d41668c-168c-4500-b06a-4674ccf3e19d
#> 78 10.1126/science.abl4896 5e5e7a2f-8f1c-42ac-90dc-b4f80f38e84c
#> dataset_version_id
#> 63 71815674-a8cf-4add-95dd-c5d5d1631597
#> 64 0b29f4ce-5e72-4356-b74b-b54714979234
#> 65 bd13c169-af97-4d8f-ba45-7588808c2e48
#> 66 47615a3d-0a9f-4a78-88ef-5cce2a84637d
#> 67 ac7714f0-dce2-40ba-9912-324de6c9a77f
#> 69 c7679ec2-652d-437a-bded-3ec2344829e4
#> 70 f89fa18f-c32b-4bae-9511-1a4d18f200e1
#> 72 37ada0d2-9970-4ff2-8bcd-41e80ab6e081
#> 73 1cda78aa-f0d9-4d50-96bf-8bc309318802
#> 77 5297a910-453f-4e3f-af16-e18fd5a79090
#> 78 b783b036-c837-4290-a07d-f6b79a301f59
#> dataset_title
#> 63 Single-cell sequencing links multiregional immune landscapes and tissue-resident T cells in ccRCC to tumor topology and therapy efficacy
#> 64 Tabula Sapiens - Pancreas
#> 65 Tabula Sapiens - Salivary_Gland
#> 66 Tabula Sapiens - Heart
#> 67 Tabula Sapiens - Bladder
#> 69 Tabula Sapiens - Prostate
#> 70 Tabula Sapiens - Spleen
#> 72 Tabula Sapiens - Vasculature
#> 73 Tabula Sapiens - Eye
#> 77 Tabula Sapiens - Liver
#> 78 Tabula Sapiens - Fat
#> dataset_h5ad_path dataset_total_cell_count
#> 63 bd65a70f-b274-4133-b9dd-0d1431b6af34.h5ad 167283
#> 64 ff45e623-7f5f-46e3-b47d-56be0341f66b.h5ad 13497
#> 65 f01bdd17-4902-40f5-86e3-240d66dd2587.h5ad 27199
#> 66 e6a11140-2545-46bc-929e-da243eed2cae.h5ad 11505
#> 67 e5c63d94-593c-4338-a489-e1048599e751.h5ad 24583
#> 69 d77ec7d6-ef2e-49d6-9e79-05b7f8881484.h5ad 16375
#> 70 cee11228-9f0b-4e57-afe2-cfe15ee56312.h5ad 34004
#> 72 a2d4d33e-4c62-4361-b80a-9be53d2e50e8.h5ad 16037
#> 73 a0754256-f44b-4c4a-962c-a552e47d3fdc.h5ad 10650
#> 77 6d41668c-168c-4500-b06a-4674ccf3e19d.h5ad 5007
#> 78 5e5e7a2f-8f1c-42ac-90dc-b4f80f38e84c.h5ad 20263
#> [ reached 'max' / getOption("max.print") -- omitted 31 rows ]
Identifying all genes measured in a dataset
Finally, we can find the set of genes that were measured in the cells of a given dataset.
# Slice the dataset(s) of interest, and get the joinid(s)
dataset_joinids <- datasets_df$soma_joinid[datasets_df$collection_id == "17481d16-ee44-49e5-bcf0-28c0780d8c4a"]
# Slice the presence matrix by the first dimension, i.e., by dataset
presence_matrix_slice <- presence_matrix$take(i = dataset_joinids)$get_one_based_matrix()
genes_measured <- Matrix::colSums(presence_matrix_slice) > 0
var_joinids <- var_df$soma_joinid[genes_measured]
print(var_df[var_joinids, ])
#> soma_joinid feature_id feature_name feature_length nnz n_measured_obs
#> 1 0 ENSG00000233576 HTR3C2P 1057 69370 19581263
#> 2 1 ENSG00000121410 A1BG 3999 5640476 62641311
#> 3 2 ENSG00000268895 A1BG-AS1 3374 3071864 61946057
#> 4 3 ENSG00000148584 A1CF 9603 734347 58195911
#> 5 4 ENSG00000175899 A2M 6318 7894261 62704378
#> 6 5 ENSG00000245105 A2M-AS1 2948 1637794 62086816
#> 9 8 ENSG00000184389 A3GALT2 1023 439067 53780311
#> 10 9 ENSG00000128274 A4GALT 3358 2432348 62706770
#> 12 11 ENSG00000265544 AA06 632 220755 22545140
#> 14 13 ENSG00000250420 AACSP1 3380 211588 22831831
#> 16 15 ENSG00000188984 AADACL3 4055 24626 43074608
#> 18 17 ENSG00000240602 AADACP1 2012 29491 23133490
#> 19 18 ENSG00000109576 AADAT 2970 4524608 61559099
#> 20 19 ENSG00000158122 PRXL2C 3098 5424472 55618144
#> 21 20 ENSG00000103591 AAGAB 4138 12427442 62843055
#> 22 21 ENSG00000115977 AAK1 24843 29280566 62664775
#> [ reached 'max' / getOption("max.print") -- omitted 27195 rows ]