Skip to contents

This notebook showcases the Census contents and how to obtain high-level information about it. It covers the organization of data within the Census, what cell and gene metadata are available, and it provides simple demonstrations to summarize cell counts across cell metadata.

Contents

  • Opening the census
  • Census organization
  • Cell metadata
  • Gene metadata
  • Census summary content tables
  • Understanding Census contents beyond the summary tables

Opening the Census

The cellxgene.census R package contains a convenient open_soma() API to open any version of the Census (stable by default).

You can learn more about the cellxgene.census methods by accessing their corresponding documentation, for example ?cellxgene.census::open_soma.

Census organization

The Census schema defines the structure of the Census. In short, you can think of the Census as a structured collection of items that stores different pieces of information. All of these items and the parent collection are SOMA objects of various types and can all be accessed with the TileDB-SOMA API (documentation).

The cellxgene.census package contains some convenient wrappers of the TileDB-SOMA API. An example of this is the function we used to open the Census: cellxgene_census.open_soma().

Main Census components

With the command above you created census, which is a SOMACollection, an R6 class providing a key-value associative map. Its get() method can access the two top-level collection members, census_info and census_data, each themselves instances of SOMACollection.

Census summary info

  • census$get("census_info"): A collection of data frames providing information of the census as a whole.
    • census$get("census_info")$get("summary"): A data frame with high-level information of this Census, e.g. build date, total cell count, etc.
    • census$get("census_info")$get("datasets"): A data frame with all datasets from CELLxGENE Discover used to create the Census.
    • census$get("census_info")$get("summary_cell_counts"): A data frame with cell counts stratified by relevant cell metadata
  • Census data

Data for each organism is stored in independent SOMAExperiment objects which are a specialized form of a SOMACollection. Each of these store a data matrix (cell by genes), cell metadata, gene metadata, and some other useful components not covered in this notebook.

This is how the data is organized for one organism – Homo sapiens:

  • census$get("census_data")$get("homo_sapiens")$obs: Cell metadata
  • census$get("census_data")$get("homo_sapiens")$ms$get("RNA"): Data matrices, currently only…
  • census$get("census_data")$get("homo_sapiens")$ms$get("RNA")$X$get("raw"): a matrix of raw counts as a SOMASparseNDArray
  • census$get("census_data")$get("homo_sapiens")$ms$get("RNA")$var: Gene Metadata

Cell metadata

You can obtain all cell metadata variables by directly querying the columns of the corresponding SOMADataFrame.

All of these variables can be used for querying the Census in case you want to work with specific cells.

census$get("census_data")$get("homo_sapiens")$obs$colnames()
#>  [1] "soma_joinid"                             
#>  [2] "dataset_id"                              
#>  [3] "assay"                                   
#>  [4] "assay_ontology_term_id"                  
#>  [5] "cell_type"                               
#>  [6] "cell_type_ontology_term_id"              
#>  [7] "development_stage"                       
#>  [8] "development_stage_ontology_term_id"      
#>  [9] "disease"                                 
#> [10] "disease_ontology_term_id"                
#> [11] "donor_id"                                
#> [12] "is_primary_data"                         
#> [13] "self_reported_ethnicity"                 
#> [14] "self_reported_ethnicity_ontology_term_id"
#> [15] "sex"                                     
#> [16] "sex_ontology_term_id"                    
#> [17] "suspension_type"                         
#> [18] "tissue"                                  
#> [19] "tissue_ontology_term_id"                 
#> [20] "tissue_general"                          
#> [21] "tissue_general_ontology_term_id"         
#> [22] "raw_sum"                                 
#> [23] "nnz"                                     
#> [24] "raw_mean_nnz"                            
#> [25] "raw_variance_nnz"                        
#> [26] "n_measured_vars"

All of these variables are defined in the CELLxGENE dataset schema except for the following:

  • soma_joinid: a SOMA-defined value use for join operations.
  • dataset_id: the dataset id as encoded in census$get("census_info")$get("datasets").
  • tissue_general and tissue_general_ontology_term_id: the high-level tissue mapping.

Gene metadata

Similarly, we can obtain all gene metadata variables by directly querying the columns of the corresponding SOMADataFrame.

These are the variables you can use for querying the Census in case there are specific genes you are interested in.

census$get("census_data")$get("homo_sapiens")$ms$get("RNA")$var$colnames()
#> [1] "soma_joinid"    "feature_id"     "feature_name"   "feature_length" "nnz"           
#> [6] "n_measured_obs"

All of these variables are defined in the CELLxGENE dataset schema except for the following:

  • soma_joinid: a SOMA-defined value use for join operations.
  • feature_length: the length in base pairs of the gene.

Census summary content tables

You can take a quick look at the high-level Census information by looking at census$get("census_info")$get("summary"):

as.data.frame(census$get("census_info")$get("summary")$read()$concat())
#>   soma_joinid                      label      value
#> 1           0      census_schema_version      1.2.0
#> 2           1          census_build_date 2023-10-23
#> 3           2     dataset_schema_version      3.1.0
#> 4           3           total_cell_count   68683222
#> 5           4          unique_cell_count   40356133
#> 6           5 number_donors_homo_sapiens      15588
#> 7           6 number_donors_mus_musculus       1990

Of special interest are the label-value combinations for:

  • total_cell_count is the total number of cells in the Census.
  • unique_cell_count is the number of unique cells, as some cells may be present twice due to meta-analysis or consortia-like data.
  • number_donors_homo_sapiens and number_donors_mus_musculus are the number of individuals for human and mouse. These are not guaranteed to be unique as one individual ID may be present or identical in different datasets.

Cell counts by cell metadata

By looking at census$get("census_info)$get("summary_cell_counts") you can get a general idea of cell counts stratified by some relevant cell metadata. Not all cell metadata is included in this table, you can take a look at all cell and gene metadata available in the sections below “Cell metadata” and “Gene metadata”.

The line below retrieves this table and casts it into an R data frame:

census_counts <- as.data.frame(census$get("census_info")$get("summary_cell_counts")$read()$concat())
head(census_counts)
#>   soma_joinid     organism category ontology_term_id unique_cell_count total_cell_count
#> 1           0 Homo sapiens      all               na          36227903         62998417
#> 2           1 Homo sapiens    assay      EFO:0008722            264166           279635
#> 3           2 Homo sapiens    assay      EFO:0008780             25652            51304
#> 4           3 Homo sapiens    assay      EFO:0008796             54753            54753
#> 5           4 Homo sapiens    assay      EFO:0008919             89477           206754
#> 6           5 Homo sapiens    assay      EFO:0008931             78750           188248
#>        label
#> 1         na
#> 2   Drop-seq
#> 3     inDrop
#> 4   MARS-seq
#> 5   Seq-Well
#> 6 Smart-seq2

For each combination of organism and values for each category of cell metadata you can take a look at total_cell_count and unique_cell_count for the cell counts of that combination.

The values for each category are specified in ontology_term_id and label, which are the value’s IDs and labels, respectively.

Example: cell metadata included in the summary counts table

To get all the available cell metadata in the summary counts table you can do the following. Remember this is not all the cell metadata available, as some variables were omitted in the creation of this table.

t(table(census_counts$organism, census_counts$category))
#>                          
#>                           Homo sapiens Mus musculus
#>   all                                1            1
#>   assay                             20           10
#>   cell_type                        631          248
#>   disease                           72            5
#>   self_reported_ethnicity           30            1
#>   sex                                3            3
#>   suspension_type                    1            1
#>   tissue                           230           74
#>   tissue_general                    53           27

Example: cell counts for each sequencing assay in human data

To get the cell counts for each sequencing assay type in human data, you can perform the following operations:

human_assay_counts <- census_counts[census_counts$organism == "Homo sapiens" & census_counts$category == "assay", ]
human_assay_counts <- human_assay_counts[order(human_assay_counts$total_cell_count, decreasing = TRUE), ]

Example: number of microglial cells in the Census

If you have a specific term from any of the categories shown above you can directly find out the number of cells for that term.

census_counts[census_counts$label == "microglial cell", ]
#>      soma_joinid     organism  category ontology_term_id unique_cell_count
#> 72            71 Homo sapiens cell_type       CL:0000129            359243
#> 1080        1079 Mus musculus cell_type       CL:0000129             48998
#>      total_cell_count           label
#> 72             544977 microglial cell
#> 1080            75885 microglial cell

Understanding Census contents beyond the summary tables

While using the pre-computed tables in census$get("census_info") is an easy and quick way to understand the contents of the Census, it falls short if you want to learn more about certain slices of the Census.

For example, you may want to learn more about:

  • What are the cell types available for human liver?
  • What are the total number of cells in all lung datasets stratified by sequencing technology?
  • What is the sex distribution of all cells from brain in mouse?
  • What are the diseases available for T cells?

All of these questions can be answered by directly querying the cell metadata as shown in the examples below.

Example: all cell types available in human

To exemplify the process of accessing and slicing cell metadata for summary stats, let’s start with a trivial example and take a look at all human cell types available in the Census:

obs_df <- census$get("census_data")$get("homo_sapiens")$obs$read(column_names = c("cell_type", "is_primary_data"))
as.data.frame(obs_df$concat())
#>                            cell_type is_primary_data
#> 1                    oligodendrocyte           FALSE
#> 2     oligodendrocyte precursor cell           FALSE
#> 3   astrocyte of the cerebral cortex           FALSE
#> 4   astrocyte of the cerebral cortex           FALSE
#> 5   astrocyte of the cerebral cortex           FALSE
#> 6     oligodendrocyte precursor cell           FALSE
#> 7   astrocyte of the cerebral cortex           FALSE
#> 8                    microglial cell           FALSE
#> 9   astrocyte of the cerebral cortex           FALSE
#> 10  astrocyte of the cerebral cortex           FALSE
#> 11  astrocyte of the cerebral cortex           FALSE
#> 12  astrocyte of the cerebral cortex           FALSE
#> 13  astrocyte of the cerebral cortex           FALSE
#> 14  astrocyte of the cerebral cortex           FALSE
#> 15  astrocyte of the cerebral cortex           FALSE
#> 16    oligodendrocyte precursor cell           FALSE
#> 17                   oligodendrocyte           FALSE
#> 18  astrocyte of the cerebral cortex           FALSE
#> 19  astrocyte of the cerebral cortex           FALSE
#> 20  astrocyte of the cerebral cortex           FALSE
#> 21  astrocyte of the cerebral cortex           FALSE
#> 22  astrocyte of the cerebral cortex           FALSE
#> 23    oligodendrocyte precursor cell           FALSE
#> 24  astrocyte of the cerebral cortex           FALSE
#> 25  astrocyte of the cerebral cortex           FALSE
#> 26    oligodendrocyte precursor cell           FALSE
#> 27                   microglial cell           FALSE
#> 28                   oligodendrocyte           FALSE
#> 29  astrocyte of the cerebral cortex           FALSE
#> 30  cerebral cortex endothelial cell           FALSE
#> 31                   microglial cell           FALSE
#> 32                   microglial cell           FALSE
#> 33                   microglial cell           FALSE
#> 34                   oligodendrocyte           FALSE
#> 35                   oligodendrocyte           FALSE
#> 36                   microglial cell           FALSE
#> 37                   oligodendrocyte           FALSE
#> 38                   oligodendrocyte           FALSE
#> 39  astrocyte of the cerebral cortex           FALSE
#> 40                   oligodendrocyte           FALSE
#> 41  astrocyte of the cerebral cortex           FALSE
#> 42                   oligodendrocyte           FALSE
#> 43    oligodendrocyte precursor cell           FALSE
#> 44                   oligodendrocyte           FALSE
#> 45  astrocyte of the cerebral cortex           FALSE
#> 46    oligodendrocyte precursor cell           FALSE
#> 47                   oligodendrocyte           FALSE
#> 48    oligodendrocyte precursor cell           FALSE
#> 49  astrocyte of the cerebral cortex           FALSE
#> 50  astrocyte of the cerebral cortex           FALSE
#> 51  astrocyte of the cerebral cortex           FALSE
#> 52                   oligodendrocyte           FALSE
#> 53                   oligodendrocyte           FALSE
#> 54                   oligodendrocyte           FALSE
#> 55  astrocyte of the cerebral cortex           FALSE
#> 56  cerebral cortex endothelial cell           FALSE
#> 57                   oligodendrocyte           FALSE
#> 58                   oligodendrocyte           FALSE
#> 59                   oligodendrocyte           FALSE
#> 60                   microglial cell           FALSE
#> 61                   microglial cell           FALSE
#> 62    oligodendrocyte precursor cell           FALSE
#> 63    oligodendrocyte precursor cell           FALSE
#> 64                   oligodendrocyte           FALSE
#> 65    oligodendrocyte precursor cell           FALSE
#> 66                   oligodendrocyte           FALSE
#> 67  astrocyte of the cerebral cortex           FALSE
#> 68                   oligodendrocyte           FALSE
#> 69    oligodendrocyte precursor cell           FALSE
#> 70                   oligodendrocyte           FALSE
#> 71  astrocyte of the cerebral cortex           FALSE
#> 72  astrocyte of the cerebral cortex           FALSE
#> 73  astrocyte of the cerebral cortex           FALSE
#> 74    oligodendrocyte precursor cell           FALSE
#> 75  astrocyte of the cerebral cortex           FALSE
#> 76    oligodendrocyte precursor cell           FALSE
#> 77                   microglial cell           FALSE
#> 78                   microglial cell           FALSE
#> 79    oligodendrocyte precursor cell           FALSE
#> 80                   oligodendrocyte           FALSE
#> 81                   oligodendrocyte           FALSE
#> 82  astrocyte of the cerebral cortex           FALSE
#> 83                   oligodendrocyte           FALSE
#> 84  astrocyte of the cerebral cortex           FALSE
#> 85  astrocyte of the cerebral cortex           FALSE
#> 86                   oligodendrocyte           FALSE
#> 87  astrocyte of the cerebral cortex           FALSE
#> 88                   oligodendrocyte           FALSE
#> 89    oligodendrocyte precursor cell           FALSE
#> 90    oligodendrocyte precursor cell           FALSE
#> 91  astrocyte of the cerebral cortex           FALSE
#> 92  astrocyte of the cerebral cortex           FALSE
#> 93  astrocyte of the cerebral cortex           FALSE
#> 94                   oligodendrocyte           FALSE
#> 95  astrocyte of the cerebral cortex           FALSE
#> 96  astrocyte of the cerebral cortex           FALSE
#> 97                   oligodendrocyte           FALSE
#> 98                   oligodendrocyte           FALSE
#> 99    oligodendrocyte precursor cell           FALSE
#> 100                  oligodendrocyte           FALSE
#> 101                  oligodendrocyte           FALSE
#> 102                  oligodendrocyte           FALSE
#> 103 astrocyte of the cerebral cortex           FALSE
#> 104   oligodendrocyte precursor cell           FALSE
#> 105                  oligodendrocyte           FALSE
#> 106   oligodendrocyte precursor cell           FALSE
#> 107                  oligodendrocyte           FALSE
#> 108                  oligodendrocyte           FALSE
#> 109                  oligodendrocyte           FALSE
#> 110                  oligodendrocyte           FALSE
#> 111   oligodendrocyte precursor cell           FALSE
#> 112                  oligodendrocyte           FALSE
#> 113                  oligodendrocyte           FALSE
#> 114 astrocyte of the cerebral cortex           FALSE
#> 115                  oligodendrocyte           FALSE
#> 116 astrocyte of the cerebral cortex           FALSE
#> 117                  oligodendrocyte           FALSE
#> 118                  oligodendrocyte           FALSE
#> 119                  oligodendrocyte           FALSE
#> 120 astrocyte of the cerebral cortex           FALSE
#> 121 astrocyte of the cerebral cortex           FALSE
#> 122   oligodendrocyte precursor cell           FALSE
#> 123                  microglial cell           FALSE
#> 124 astrocyte of the cerebral cortex           FALSE
#> 125 astrocyte of the cerebral cortex           FALSE
#> 126                  microglial cell           FALSE
#> 127 cerebral cortex endothelial cell           FALSE
#> 128   oligodendrocyte precursor cell           FALSE
#>  [ reached 'max' / getOption("max.print") -- omitted 62998289 rows ]

The number of rows is the total number of cells for humans. Now, if you wish to get the cell counts per cell type we can work with this data frame.

In addition, we will only focus on cells that are marked with is_primary_data=TRUE as this ensures we de-duplicate cells that appear more than once in CELLxGENE Discover.

obs_df <- census$get("census_data")$get("homo_sapiens")$obs$read(
  column_names = "cell_type",
  value_filter = "is_primary_data == TRUE"
)

obs_df <- as.data.frame(obs_df$concat())
nrow(obs_df)
#> [1] 36227903

This is the number of unique cells. Now let’s look at the counts per cell type:

human_cell_type_counts <- table(obs_df$cell_type)
sort(human_cell_type_counts, decreasing = TRUE)[1:10]
#> 
#>                                                             neuron 
#>                                                            2815336 
#>                                               glutamatergic neuron 
#>                                                            1563446 
#>                                    CD4-positive, alpha-beta T cell 
#>                                                            1243885 
#>                                    CD8-positive, alpha-beta T cell 
#>                                                            1197715 
#> L2/3-6 intratelencephalic projecting glutamatergic cortical neuron 
#>                                                            1123360 
#>                                                    oligodendrocyte 
#>                                                            1063874 
#>                                                 classical monocyte 
#>                                                            1030996 
#>                                                        native cell 
#>                                                            1011949 
#>                                                             B cell 
#>                                                             934060 
#>                                                natural killer cell 
#>                                                             770637

This shows you that the most abundant cell types are “glutamatergic neuron”, “CD8-positive, alpha-beta T cell”, and “CD4-positive, alpha-beta T cell”.

Now let’s take a look at the number of unique cell types:

length(human_cell_type_counts)
#> [1] 610

That is the total number of different cell types for human.

All the information in this example can be quickly obtained from the summary table at census$get("census-info")$get("summary_cell_counts").

The examples below are more complex and can only be achieved by accessing the cell metadata.

Example: cell types available in human liver

Similar to the example above, we can learn what cell types are available for a specific tissue, e.g. liver.

To achieve this goal we just need to limit our cell metadata to that tissue. We will use the information in the cell metadata variable tissue_general. This variable contains the high-level tissue label for all cells in the Census:

obs_liver_df <- census$get("census_data")$get("homo_sapiens")$obs$read(
  column_names = "cell_type",
  value_filter = "is_primary_data == TRUE & tissue_general == 'liver'"
)

obs_liver_df <- as.data.frame(obs_liver_df$concat())

sort(table(obs_liver_df$cell_type), decreasing = TRUE)[1:10]
#> 
#>                          T cell                     hepatoblast 
#>                           85739                           58447 
#>                 neoplastic cell                    erythroblast 
#>                           52431                           45605 
#>                        monocyte                      hepatocyte 
#>                           31388                           28309 
#>             natural killer cell    periportal region hepatocyte 
#>                           26871                           23509 
#>                      macrophage centrilobular region hepatocyte 
#>                           16707                           15819

These are the cell types and their cell counts in the human liver.

Example: diseased T cells in human tissues

In this example we are going to get the counts for all diseased cells annotated as T cells. For the sake of the example we will focus on “CD8-positive, alpha-beta T cell” and “CD4-positive, alpha-beta T cell”:

obs_t_cells_df <- census$get("census_data")$get("homo_sapiens")$obs$read(
  column_names = c("disease", "tissue_general"),
  value_filter = "is_primary_data == TRUE & disease != 'normal' & cell_type %in% c('CD8-positive, alpha-beta T cell', 'CD4-positive, alpha-beta T cell')"
)

obs_t_cells_df <- as.data.frame(obs_t_cells_df$concat())

print(table(obs_t_cells_df))
#>                                        tissue_general
#> disease                                 adrenal gland  blood bone marrow  brain breast
#>   COVID-19                                          0 819428           0      0      0
#>   Crohn disease                                     0      0           0      0      0
#>   Down syndrome                                     0      0         181      0      0
#>   breast cancer                                     0      0           0      0   1850
#>   chronic obstructive pulmonary disease             0      0           0      0      0
#>   chronic rhinitis                                  0      0           0      0      0
#>   clear cell renal carcinoma                        0   6548           0      0      0
#>   cystic fibrosis                                   0      0           0      0      0
#>   follicular lymphoma                               0      0           0      0      0
#>   influenza                                         0   8871           0      0      0
#>   interstitial lung disease                         0      0           0      0      0
#>   kidney benign neoplasm                            0      0           0      0      0
#>   kidney oncocytoma                                 0      0           0      0      0
#>   lung adenocarcinoma                             205      0           0   3274      0
#>   lung large cell carcinoma                         0      0           0      0      0
#>   lymphangioleiomyomatosis                          0      0           0      0      0
#>                                        tissue_general
#> disease                                  colon kidney  liver   lung lymph node   nose
#>   COVID-19                                   0      0      0  30578          0     13
#>   Crohn disease                          17490      0      0      0          0      0
#>   Down syndrome                              0      0      0      0          0      0
#>   breast cancer                              0      0      0      0          0      0
#>   chronic obstructive pulmonary disease      0      0      0   9382          0      0
#>   chronic rhinitis                           0      0      0      0          0    909
#>   clear cell renal carcinoma                 0  20540      0      0         36      0
#>   cystic fibrosis                            0      0      0      7          0      0
#>   follicular lymphoma                        0      0      0      0       1089      0
#>   influenza                                  0      0      0      0          0      0
#>   interstitial lung disease                  0      0      0   1803          0      0
#>   kidney benign neoplasm                     0     10      0      0          0      0
#>   kidney oncocytoma                          0   2303      0      0          0      0
#>   lung adenocarcinoma                        0      0    507 215013      24969      0
#>   lung large cell carcinoma                  0      0      0   5922          0      0
#>   lymphangioleiomyomatosis                   0      0      0    513          0      0
#>                                        tissue_general
#> disease                                 pleural fluid respiratory system saliva
#>   COVID-19                                          0                  4     41
#>   Crohn disease                                     0                  0      0
#>   Down syndrome                                     0                  0      0
#>   breast cancer                                     0                  0      0
#>   chronic obstructive pulmonary disease             0                  0      0
#>   chronic rhinitis                                  0                  0      0
#>   clear cell renal carcinoma                        0                  0      0
#>   cystic fibrosis                                   0                  0      0
#>   follicular lymphoma                               0                  0      0
#>   influenza                                         0                  0      0
#>   interstitial lung disease                         0                  0      0
#>   kidney benign neoplasm                            0                  0      0
#>   kidney oncocytoma                                 0                  0      0
#>   lung adenocarcinoma                           11558                  0      0
#>   lung large cell carcinoma                         0                  0      0
#>   lymphangioleiomyomatosis                          0                  0      0
#>                                        tissue_general
#> disease                                 small intestine vasculature
#>   COVID-19                                            0           0
#>   Crohn disease                                   52029           0
#>   Down syndrome                                       0           0
#>   breast cancer                                       0           0
#>   chronic obstructive pulmonary disease               0           0
#>   chronic rhinitis                                    0           0
#>   clear cell renal carcinoma                          0           0
#>   cystic fibrosis                                     0           0
#>   follicular lymphoma                                 0           0
#>   influenza                                           0           0
#>   interstitial lung disease                           0           0
#>   kidney benign neoplasm                              0           0
#>   kidney oncocytoma                                   0           0
#>   lung adenocarcinoma                                 0           0
#>   lung large cell carcinoma                           0           0
#>   lymphangioleiomyomatosis                            0           0
#>  [ reached getOption("max.print") -- omitted 8 rows ]

These are the cell counts annotated with the indicated disease across human tissues for “CD8-positive, alpha-beta T cell” or “CD4-positive, alpha-beta T cell”.