Skip to contents

The Census is a compilation of cells from multiple datasets that may differ by the sets of genes they measure. This notebook describes the way to identify the genes measured per dataset.

The presence matrix is a sparse boolean array, indicating which features (var) were present in each dataset. The array has dimensions [n_datasets, n_var], and is stored in the SOMA Measurement varp collection. The first dimension is indexed by the soma_joinid in the census_datasets dataframe. The second is indexed by the soma_joinid in the var dataframe of the measurement.

As a reminder the obs data frame has a column dataset_id that can be used to link any cell in the Census to the presence matrix.

Contents

  1. Opening the Census.
  2. Fetching the IDs of the Census datasets.
  3. Fetching the dataset presence matrix.
  4. Identifying genes measured in a specific dataset.
  5. Identifying datasets that measured specific genes.
  6. Identifying all genes measured in a dataset.

Opening the Census

The cellxgene.census R package contains a convenient API to open any version of the Census (by default, the newest stable version).

Fetching the IDs of the Census datasets

Let’s grab a table of all the datasets included in the Census and use this table in combination with the presence matrix below.

# Grab the experiment containing human data, and the measurement therein with RNA
human <- census$get("census_data")$get("homo_sapiens")
human_rna <- human$ms$get("RNA")

# The census-wide datasets
datasets_df <- as.data.frame(census$get("census_info")$get("datasets")$read()$concat())
print(datasets_df)
#>    soma_joinid                        collection_id
#> 1            0 4dca242c-d302-4dba-a68f-4c61e7bad553
#> 2            1 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 3            2 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 4            3 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 5            4 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 6            5 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 7            6 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 8            7 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 9            8 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 10           9 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#> 11          10 d17249d2-0e6e-4500-abb8-e6c93fa1ac6f
#>                                                                       collection_name
#> 1                Comparative transcriptomics reveals human-specific cortical features
#> 2  Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 3  Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 4  Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 5  Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 6  Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 7  Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 8  Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 9  Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 10 Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#> 11 Transcriptomic cytoarchitecture reveals principles of human neocortex organization
#>             collection_doi                           dataset_id
#> 1  10.1126/science.ade9516 2bdd3a2c-2ff4-4314-adf3-8a06b797a33a
#> 2  10.1126/science.adf6812 f5b0810c-1664-4a62-ad06-be1d9964aa8b
#> 3  10.1126/science.adf6812 e4ddac12-f48f-4455-8e8d-c2a48a683437
#> 4  10.1126/science.adf6812 e2808a6e-e2ea-41b9-b38c-4a08f1677f02
#> 5  10.1126/science.adf6812 d01c9dff-abd1-4825-bf30-2eb2ba74597e
#> 6  10.1126/science.adf6812 c3aa4f95-7a18-4a7d-8dd8-ca324d714363
#> 7  10.1126/science.adf6812 be401db3-d732-408a-b0c4-71af0458b8ab
#> 8  10.1126/science.adf6812 a5d5c529-8a1f-40b5-bda3-35208970070d
#> 9  10.1126/science.adf6812 9c63201d-bfd9-41a8-bbbc-18d947556f3d
#> 10 10.1126/science.adf6812 93cb76aa-a84b-4a92-8e6c-66a914e26d4c
#> 11 10.1126/science.adf6812 8d1dd010-5cbc-43fb-83f8-e0de8e8517da
#>                      dataset_version_id
#> 1  7eb7f2fd-fd74-4c99-863c-97836415652e
#> 2  d4427196-7876-4bdd-a929-ae4d177ec776
#> 3  3280113b-7148-4a3e-98d4-015f443aab8a
#> 4  dc092185-3b8e-4fcb-ae21-1dc106d683ac
#> 5  c4959ded-83dc-4442-aac7-9a59bdb47801
#> 6  0476ef54-aefe-4754-b0e9-d9fcd75adff4
#> 7  ee027704-72aa-4195-a467-0754db1ed65d
#> 8  d47c0742-cea2-46c1-9e72-4d479214041c
#> 9  8b09695a-1426-4867-961e-c40a1fbcc2da
#> 10 98ad7381-f464-4f49-b850-5321b4f98be6
#> 11 c56683d2-452a-45dc-b402-35397e27e325
#>                                           dataset_title
#> 1                               Human: Great apes study
#> 2                       Dissection: Angular gyrus (AnG)
#> 3                Supercluster: CGE-derived interneurons
#> 4               Dissection: Primary auditory cortex(A1)
#> 5  Supercluster: Deep layer (non-IT) excitatory neurons
#> 6        Supercluster: IT-projecting excitatory neurons
#> 7           Dissection: Anterior cingulate cortex (ACC)
#> 8               Human Multiple Cortical Areas SMART-seq
#> 9                Supercluster: MGE-derived interneurons
#> 10        Dissection: Primary somatosensory cortex (S1)
#> 11                Dissection: Primary visual cortex(V1)
#>                            dataset_h5ad_path dataset_total_cell_count
#> 1  2bdd3a2c-2ff4-4314-adf3-8a06b797a33a.h5ad                   156285
#> 2  f5b0810c-1664-4a62-ad06-be1d9964aa8b.h5ad                   110752
#> 3  e4ddac12-f48f-4455-8e8d-c2a48a683437.h5ad                   129495
#> 4  e2808a6e-e2ea-41b9-b38c-4a08f1677f02.h5ad                   139054
#> 5  d01c9dff-abd1-4825-bf30-2eb2ba74597e.h5ad                    92969
#> 6  c3aa4f95-7a18-4a7d-8dd8-ca324d714363.h5ad                   638941
#> 7  be401db3-d732-408a-b0c4-71af0458b8ab.h5ad                   135462
#> 8  a5d5c529-8a1f-40b5-bda3-35208970070d.h5ad                    49417
#> 9  9c63201d-bfd9-41a8-bbbc-18d947556f3d.h5ad                   185477
#> 10 93cb76aa-a84b-4a92-8e6c-66a914e26d4c.h5ad                   153159
#> 11 8d1dd010-5cbc-43fb-83f8-e0de8e8517da.h5ad                   241077
#>  [ reached 'max' / getOption("max.print") -- omitted 640 rows ]

Fetching the dataset presence matrix

Now let’s fetch the dataset presence matrix.

For convenience, read the entire presence matrix (for Homo sapiens) into a sparse matrix. There is a convenience function providing this capability:

presence_matrix <- get_presence_matrix(census, "Homo sapiens", "RNA")
print(dim(presence_matrix))
#> NULL

We also need the var dataframe, which is read into an R data frame for convenient manipulation:

var_df <- as.data.frame(human_rna$var$read()$concat())
print(var_df)
#>    soma_joinid      feature_id feature_name feature_length      nnz n_measured_obs
#> 1            0 ENSG00000233576      HTR3C2P           1057    69370       19581263
#> 2            1 ENSG00000121410         A1BG           3999  5640476       62641311
#> 3            2 ENSG00000268895     A1BG-AS1           3374  3071864       61946057
#> 4            3 ENSG00000148584         A1CF           9603   734347       58195911
#> 5            4 ENSG00000175899          A2M           6318  7894261       62704378
#> 6            5 ENSG00000245105      A2M-AS1           2948  1637794       62086816
#> 7            6 ENSG00000166535        A2ML1           7156  2156616       60911688
#> 8            7 ENSG00000256069        A2MP1           4657   835384       23554778
#> 9            8 ENSG00000184389      A3GALT2           1023   439067       53780311
#> 10           9 ENSG00000128274       A4GALT           3358  2432348       62706770
#> 11          10 ENSG00000118017        A4GNT           1779    52430       56117399
#> 12          11 ENSG00000265544         AA06            632   220755       22545140
#> 13          12 ENSG00000081760         AACS          16039 11280800       62842909
#> 14          13 ENSG00000250420       AACSP1           3380   211588       22831831
#> 15          14 ENSG00000114771        AADAC           1632   552258       54941618
#> 16          15 ENSG00000188984      AADACL3           4055    24626       43074608
#>  [ reached 'max' / getOption("max.print") -- omitted 60648 rows ]

Identifying genes measured in a specific dataset

Now that we have the dataset table, the genes metadata table, and the dataset presence matrix, we can check if a gene or set of genes were measured in a specific dataset.

Important: the presence matrix is indexed by soma_joinid, and is NOT positionally indexed. In other words:

  • the first dimension of the presence matrix is the dataset’s soma_joinid, as stored in the census_datasets dataframe.
  • the second dimension of the presence matrix is the feature’s soma_joinid, as stored in the var dataframe.

The presence matrix has a method $take() that lets you slice it by soma_joinids from census_datasets and var. And the full presence matrix, or slices of it, can then be exported to a regular matrix with the method $get_one_based_matrix()

Let’s find out if the the gene "ENSG00000286096" was measured in the dataset with id "97a17473-e2b1-4f31-a544-44a60773e2dd".

# Get soma_joinid for datasets and genes of interest
var_joinid <- var_df$soma_joinid[var_df$feature_id == "ENSG00000286096"]
dataset_joinid <- datasets_df$soma_joinid[datasets_df$dataset_id == "97a17473-e2b1-4f31-a544-44a60773e2dd"]

# Slice presence matrix with datasets and genes of interest
presence_matrix_slice <- presence_matrix$take(i = dataset_joinid, j = var_joinid)

# Convert presence matrix to regular matrix
presence_matrix_slice <- presence_matrix_slice$get_one_based_matrix()

# Find how if the gene is present in this dataset
is_present <- presence_matrix_slice[, , drop = TRUE]
cat(paste("Feature is", if (is_present) "present." else "not present."))
#> Feature is present.

Identifying datasets that measured specific genes

Similarly, we can determine the datasets that measured a specific gene or set of genes.

# Grab the feature's soma_joinid from the var dataframe
var_joinid <- var_df$soma_joinid[var_df$feature_id == "ENSG00000286096"]

# The presence matrix is indexed by the joinids of the dataset and var dataframes,
# so slice out the feature of interest by its joinid.
presence_matrix_slice  <- presence_matrix$take(j = var_joinid)$get_one_based_matrix()
measured_datasets <- presence_matrix_slice[, , drop = TRUE] != 0
dataset_joinids <- datasets_df$soma_joinid[measured_datasets]

# From the datasets dataframe, slice out the datasets which have a joinid in the list
print(datasets_df[dataset_joinids, ])
#>    soma_joinid                        collection_id
#> 63          62 3f50314f-bdc9-40c6-8e4a-b0901ebfbe4c
#> 64          63 e5f58829-1a66-40b5-a624-9046778e74f5
#> 65          64 e5f58829-1a66-40b5-a624-9046778e74f5
#> 66          65 e5f58829-1a66-40b5-a624-9046778e74f5
#> 67          66 e5f58829-1a66-40b5-a624-9046778e74f5
#> 69          68 e5f58829-1a66-40b5-a624-9046778e74f5
#> 70          69 e5f58829-1a66-40b5-a624-9046778e74f5
#> 72          71 e5f58829-1a66-40b5-a624-9046778e74f5
#> 73          72 e5f58829-1a66-40b5-a624-9046778e74f5
#> 77          76 e5f58829-1a66-40b5-a624-9046778e74f5
#> 78          77 e5f58829-1a66-40b5-a624-9046778e74f5
#>                                                                                                                             collection_name
#> 63 Single-cell sequencing links multiregional immune landscapes and tissue-resident T cells in ccRCC to tumor topology and therapy efficacy
#> 64                                                                                                                           Tabula Sapiens
#> 65                                                                                                                           Tabula Sapiens
#> 66                                                                                                                           Tabula Sapiens
#> 67                                                                                                                           Tabula Sapiens
#> 69                                                                                                                           Tabula Sapiens
#> 70                                                                                                                           Tabula Sapiens
#> 72                                                                                                                           Tabula Sapiens
#> 73                                                                                                                           Tabula Sapiens
#> 77                                                                                                                           Tabula Sapiens
#> 78                                                                                                                           Tabula Sapiens
#>                 collection_doi                           dataset_id
#> 63 10.1016/j.ccell.2021.03.007 bd65a70f-b274-4133-b9dd-0d1431b6af34
#> 64     10.1126/science.abl4896 ff45e623-7f5f-46e3-b47d-56be0341f66b
#> 65     10.1126/science.abl4896 f01bdd17-4902-40f5-86e3-240d66dd2587
#> 66     10.1126/science.abl4896 e6a11140-2545-46bc-929e-da243eed2cae
#> 67     10.1126/science.abl4896 e5c63d94-593c-4338-a489-e1048599e751
#> 69     10.1126/science.abl4896 d77ec7d6-ef2e-49d6-9e79-05b7f8881484
#> 70     10.1126/science.abl4896 cee11228-9f0b-4e57-afe2-cfe15ee56312
#> 72     10.1126/science.abl4896 a2d4d33e-4c62-4361-b80a-9be53d2e50e8
#> 73     10.1126/science.abl4896 a0754256-f44b-4c4a-962c-a552e47d3fdc
#> 77     10.1126/science.abl4896 6d41668c-168c-4500-b06a-4674ccf3e19d
#> 78     10.1126/science.abl4896 5e5e7a2f-8f1c-42ac-90dc-b4f80f38e84c
#>                      dataset_version_id
#> 63 71815674-a8cf-4add-95dd-c5d5d1631597
#> 64 0b29f4ce-5e72-4356-b74b-b54714979234
#> 65 bd13c169-af97-4d8f-ba45-7588808c2e48
#> 66 47615a3d-0a9f-4a78-88ef-5cce2a84637d
#> 67 ac7714f0-dce2-40ba-9912-324de6c9a77f
#> 69 c7679ec2-652d-437a-bded-3ec2344829e4
#> 70 f89fa18f-c32b-4bae-9511-1a4d18f200e1
#> 72 37ada0d2-9970-4ff2-8bcd-41e80ab6e081
#> 73 1cda78aa-f0d9-4d50-96bf-8bc309318802
#> 77 5297a910-453f-4e3f-af16-e18fd5a79090
#> 78 b783b036-c837-4290-a07d-f6b79a301f59
#>                                                                                                                               dataset_title
#> 63 Single-cell sequencing links multiregional immune landscapes and tissue-resident T cells in ccRCC to tumor topology and therapy efficacy
#> 64                                                                                                                Tabula Sapiens - Pancreas
#> 65                                                                                                          Tabula Sapiens - Salivary_Gland
#> 66                                                                                                                   Tabula Sapiens - Heart
#> 67                                                                                                                 Tabula Sapiens - Bladder
#> 69                                                                                                                Tabula Sapiens - Prostate
#> 70                                                                                                                  Tabula Sapiens - Spleen
#> 72                                                                                                             Tabula Sapiens - Vasculature
#> 73                                                                                                                     Tabula Sapiens - Eye
#> 77                                                                                                                   Tabula Sapiens - Liver
#> 78                                                                                                                     Tabula Sapiens - Fat
#>                            dataset_h5ad_path dataset_total_cell_count
#> 63 bd65a70f-b274-4133-b9dd-0d1431b6af34.h5ad                   167283
#> 64 ff45e623-7f5f-46e3-b47d-56be0341f66b.h5ad                    13497
#> 65 f01bdd17-4902-40f5-86e3-240d66dd2587.h5ad                    27199
#> 66 e6a11140-2545-46bc-929e-da243eed2cae.h5ad                    11505
#> 67 e5c63d94-593c-4338-a489-e1048599e751.h5ad                    24583
#> 69 d77ec7d6-ef2e-49d6-9e79-05b7f8881484.h5ad                    16375
#> 70 cee11228-9f0b-4e57-afe2-cfe15ee56312.h5ad                    34004
#> 72 a2d4d33e-4c62-4361-b80a-9be53d2e50e8.h5ad                    16037
#> 73 a0754256-f44b-4c4a-962c-a552e47d3fdc.h5ad                    10650
#> 77 6d41668c-168c-4500-b06a-4674ccf3e19d.h5ad                     5007
#> 78 5e5e7a2f-8f1c-42ac-90dc-b4f80f38e84c.h5ad                    20263
#>  [ reached 'max' / getOption("max.print") -- omitted 31 rows ]

Identifying all genes measured in a dataset

Finally, we can find the set of genes that were measured in the cells of a given dataset.

# Slice the dataset(s) of interest, and get the joinid(s)
dataset_joinids <- datasets_df$soma_joinid[datasets_df$collection_id == "17481d16-ee44-49e5-bcf0-28c0780d8c4a"]

# Slice the presence matrix by the first dimension, i.e., by dataset
presence_matrix_slice <- presence_matrix$take(i = dataset_joinids)$get_one_based_matrix()
genes_measured <- Matrix::colSums(presence_matrix_slice) > 0
var_joinids <- var_df$soma_joinid[genes_measured]

print(var_df[var_joinids, ])
#>    soma_joinid      feature_id feature_name feature_length      nnz n_measured_obs
#> 1            0 ENSG00000233576      HTR3C2P           1057    69370       19581263
#> 2            1 ENSG00000121410         A1BG           3999  5640476       62641311
#> 3            2 ENSG00000268895     A1BG-AS1           3374  3071864       61946057
#> 4            3 ENSG00000148584         A1CF           9603   734347       58195911
#> 5            4 ENSG00000175899          A2M           6318  7894261       62704378
#> 6            5 ENSG00000245105      A2M-AS1           2948  1637794       62086816
#> 9            8 ENSG00000184389      A3GALT2           1023   439067       53780311
#> 10           9 ENSG00000128274       A4GALT           3358  2432348       62706770
#> 12          11 ENSG00000265544         AA06            632   220755       22545140
#> 14          13 ENSG00000250420       AACSP1           3380   211588       22831831
#> 16          15 ENSG00000188984      AADACL3           4055    24626       43074608
#> 18          17 ENSG00000240602      AADACP1           2012    29491       23133490
#> 19          18 ENSG00000109576        AADAT           2970  4524608       61559099
#> 20          19 ENSG00000158122       PRXL2C           3098  5424472       55618144
#> 21          20 ENSG00000103591        AAGAB           4138 12427442       62843055
#> 22          21 ENSG00000115977         AAK1          24843 29280566       62664775
#>  [ reached 'max' / getOption("max.print") -- omitted 27195 rows ]

Close the census

After use, the census object should be closed to release memory and other resources.

census$close()

This also closes all SOMA objects accessed via the top-level census. Closing can be automated using on.exit(census$close(), add = TRUE) immediately after census <- open_soma().