Skip to contents

This tutorial showcases the easiest ways to query the expression data and cell/gene metadata from the Census, and load them into R data frames, Seurat assays, and SingleCellExperiment objects.

Contents

  1. Opening the census.
  2. Querying cell metadata (obs).
  3. Querying gene metadata (var).
  4. Querying expression data as Seurat.
  5. Querying expression data as SingleCellExperiment.

Opening the census

The cellxgene.census R package contains a convenient API to open any version of the Census (by default, the newest stable version).

You can learn more about the cellxgene.census methods by accessing their corresponding documentation, for example ?cellxgene.census::open_soma.

Querying cell metadata (obs)

The human gene metadata of the Census, for RNA assays, is located at census$get("census_data")$get("homo_sapiens")$obs. This is a SOMADataFrame and as such it can be materialized as an R data frame (tibble) using as.data.frame(obs$read()$concat()).

The mouse cell metadata is at census$get("census_data")$get("mus_musculus").obs.

For slicing the cell metadata there are two relevant arguments that can be passed through read():

  • column_names — character vector indicating what metadata columns to fetch.
  • value_filter — R expression with selection conditions to fetch rows.
    • Expressions are one or more comparisons
    • Comparisons are one of <column> <op> <value> or <column> <op> <column>
    • Expressions can combine comparisons using && or ||
    • op is one of < | > | <= | >= | == | != or %in%

To learn what metadata columns are available for fetching and filtering we can directly look at the keys of the cell metadata.

census$get("census_data")$get("homo_sapiens")$obs$colnames()
#>  [1] "soma_joinid"                             
#>  [2] "dataset_id"                              
#>  [3] "assay"                                   
#>  [4] "assay_ontology_term_id"                  
#>  [5] "cell_type"                               
#>  [6] "cell_type_ontology_term_id"              
#>  [7] "development_stage"                       
#>  [8] "development_stage_ontology_term_id"      
#>  [9] "disease"                                 
#> [10] "disease_ontology_term_id"                
#> [11] "donor_id"                                
#> [12] "is_primary_data"                         
#> [13] "self_reported_ethnicity"                 
#> [14] "self_reported_ethnicity_ontology_term_id"
#> [15] "sex"                                     
#> [16] "sex_ontology_term_id"                    
#> [17] "suspension_type"                         
#> [18] "tissue"                                  
#> [19] "tissue_ontology_term_id"                 
#> [20] "tissue_general"                          
#> [21] "tissue_general_ontology_term_id"         
#> [22] "raw_sum"                                 
#> [23] "nnz"                                     
#> [24] "raw_mean_nnz"                            
#> [25] "raw_variance_nnz"                        
#> [26] "n_measured_vars"

soma_joinid is a special SOMADataFrame column that is used for join operations. The definition for all other columns can be found at the Census schema.

All of these can be used to fetch specific columns or specific rows matching a condition. For the latter we need to know the values we are looking for a priori.

For example let’s see what are the possible values available for sex. To this we can load all cell metadata but fetching only for the column sex.

unique(as.data.frame(census$get("census_data")$get("homo_sapiens")$obs$read(column_names = "sex")$concat()))
#>             sex
#> 1          male
#> 224      female
#> 3747640 unknown

As you can see there are only three different values for sex, that is "male", "female" and "unknown".

With this information we can fetch all cell metatadata for a specific sex value, for example "unknown".

as.data.frame(census$get("census_data")$get("homo_sapiens")$obs$read(value_filter = "sex == 'unknown'")$concat())
#>   soma_joinid                           dataset_id     assay assay_ontology_term_id
#> 1     3747639 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2            EFO:0009899
#> 2     3747640 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2            EFO:0009899
#> 3     3747641 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2            EFO:0009899
#> 4     3747642 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2            EFO:0009899
#> 5     3747643 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2            EFO:0009899
#> 6     3747644 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2            EFO:0009899
#> 7     3747645 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2            EFO:0009899
#> 8     3747646 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2            EFO:0009899
#> 9     3747647 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2            EFO:0009899
#>    cell_type cell_type_ontology_term_id development_stage
#> 1 fibroblast                 CL:0000057 human adult stage
#> 2 fibroblast                 CL:0000057 human adult stage
#> 3 fibroblast                 CL:0000057 human adult stage
#> 4 fibroblast                 CL:0000057 human adult stage
#> 5 fibroblast                 CL:0000057 human adult stage
#> 6 fibroblast                 CL:0000057 human adult stage
#> 7 fibroblast                 CL:0000057 human adult stage
#> 8 fibroblast                 CL:0000057 human adult stage
#> 9 fibroblast                 CL:0000057 human adult stage
#>   development_stage_ontology_term_id disease disease_ontology_term_id
#> 1                     HsapDv:0000087  normal             PATO:0000461
#> 2                     HsapDv:0000087  normal             PATO:0000461
#> 3                     HsapDv:0000087  normal             PATO:0000461
#> 4                     HsapDv:0000087  normal             PATO:0000461
#> 5                     HsapDv:0000087  normal             PATO:0000461
#> 6                     HsapDv:0000087  normal             PATO:0000461
#> 7                     HsapDv:0000087  normal             PATO:0000461
#> 8                     HsapDv:0000087  normal             PATO:0000461
#> 9                     HsapDv:0000087  normal             PATO:0000461
#>                       donor_id is_primary_data self_reported_ethnicity
#> 1 Pagella_GSE161267_GSM4904134            TRUE                 unknown
#> 2 Pagella_GSE161267_GSM4904134            TRUE                 unknown
#> 3 Pagella_GSE161267_GSM4904134            TRUE                 unknown
#> 4 Pagella_GSE161267_GSM4904134            TRUE                 unknown
#> 5 Pagella_GSE161267_GSM4904134            TRUE                 unknown
#> 6 Pagella_GSE161267_GSM4904134            TRUE                 unknown
#> 7 Pagella_GSE161267_GSM4904134            TRUE                 unknown
#> 8 Pagella_GSE161267_GSM4904134            TRUE                 unknown
#> 9 Pagella_GSE161267_GSM4904134            TRUE                 unknown
#>   self_reported_ethnicity_ontology_term_id     sex sex_ontology_term_id suspension_type
#> 1                                  unknown unknown              unknown            cell
#> 2                                  unknown unknown              unknown            cell
#> 3                                  unknown unknown              unknown            cell
#> 4                                  unknown unknown              unknown            cell
#> 5                                  unknown unknown              unknown            cell
#> 6                                  unknown unknown              unknown            cell
#> 7                                  unknown unknown              unknown            cell
#> 8                                  unknown unknown              unknown            cell
#> 9                                  unknown unknown              unknown            cell
#>    tissue tissue_ontology_term_id tissue_general tissue_general_ontology_term_id
#> 1 gingiva          UBERON:0001828         mucosa                  UBERON:0000344
#> 2 gingiva          UBERON:0001828         mucosa                  UBERON:0000344
#> 3 gingiva          UBERON:0001828         mucosa                  UBERON:0000344
#> 4 gingiva          UBERON:0001828         mucosa                  UBERON:0000344
#> 5 gingiva          UBERON:0001828         mucosa                  UBERON:0000344
#> 6 gingiva          UBERON:0001828         mucosa                  UBERON:0000344
#> 7 gingiva          UBERON:0001828         mucosa                  UBERON:0000344
#> 8 gingiva          UBERON:0001828         mucosa                  UBERON:0000344
#> 9 gingiva          UBERON:0001828         mucosa                  UBERON:0000344
#>   raw_sum  nnz raw_mean_nnz raw_variance_nnz n_measured_vars
#> 1     547  329     1.662614        14.559604           31602
#> 2     982  563     1.744227         5.315247           31602
#> 3   12467 3809     3.273038       109.305683           31602
#> 4    1053  566     1.860424         7.430042           31602
#> 5     548  363     1.509642         2.410818           31602
#> 6     678  429     1.580420        11.379616           31602
#> 7     848  524     1.618321         9.437216           31602
#> 8     935  608     1.537829         4.868418           31602
#> 9     735  485     1.515464         6.213087           31602
#>  [ reached 'max' / getOption("max.print") -- omitted 3301779 rows ]

You can use both column_names and value_filter to perform specific queries. For example let’s fetch the disease column for the cell_type "B cell" in the tissue_general "lung".

cell_metadata_b_cell <- census$get("census_data")$get("homo_sapiens")$obs$read(
  value_filter = "cell_type == 'B cell' & tissue_general == 'lung'",
  column_names = "disease"
)

cell_metadata_b_cell <- as.data.frame(cell_metadata_b_cell$concat())

table(cell_metadata_b_cell)
#> disease
#>                              COVID-19 chronic obstructive pulmonary disease 
#>                                  2729                                  6369 
#>          hypersensitivity pneumonitis             interstitial lung disease 
#>                                    52                                   376 
#>                   lung adenocarcinoma             lung large cell carcinoma 
#>                                 62351                                  1534 
#>              lymphangioleiomyomatosis         non-small cell lung carcinoma 
#>                                   133                                 17484 
#>   non-specific interstitial pneumonia                                normal 
#>                                   231                                 25461 
#>                 pleomorphic carcinoma                             pneumonia 
#>                                  1210                                    50 
#>                   pulmonary emphysema                    pulmonary fibrosis 
#>                                  1512                                  6798 
#>                 pulmonary sarcoidosis             small cell lung carcinoma 
#>                                     6                                   583 
#>          squamous cell lung carcinoma 
#>                                 11920

Querying gene metadata (var)

The human gene metadata of the Census is located at census$get("census_data")$get("homo_sapiens")$ms$get("RNA")$var. Similarly to the cell metadata, it is a SOMADataFrame and thus we can also use its method read().

The mouse gene metadata is at census$get("census_data")$get("mus_musculus")$ms$get("RNA")$var.

Let’s take a look at the metadata available for column selection and row filtering.

census$get("census_data")$get("homo_sapiens")$ms$get("RNA")$var$colnames()
#> [1] "soma_joinid"    "feature_id"     "feature_name"   "feature_length" "nnz"           
#> [6] "n_measured_obs"

With the exception of soma_joinid these columns are defined in the Census schema. Similarly to the cell metadata, we can use the same operations to learn and fetch gene metadata.

For example, to get the feature_name and feature_length of the genes "ENSG00000161798" and "ENSG00000188229" we can do the following.

var_df <- census$get("census_data")$get("homo_sapiens")$ms$get("RNA")$var$read(
  value_filter = "feature_id %in% c('ENSG00000161798', 'ENSG00000188229')",
  column_names = c("feature_name", "feature_length")
)

as.data.frame(var_df$concat())
#>   feature_name feature_length
#> 1         AQP5           1884
#> 2       TUBB4B           2037

Querying expression data as Seurat

A convenient way to query and fetch expression data is to use the get_seurat method of the cellxgene.census API. This is a method that combines the column selection and value filtering we described above to obtain slices of the expression data based on metadata queries.

The method will return a Seurat object, it takes as an input a census object, the string for an organism, and for both cell and gene metadata we can specify filters and column selection as described above but with the following arguments:

  • obs_column_names — character vector indicating the columns to select for cell metadata.
  • obs_value_filter — expression with selection conditions to fetch cells meeting a criteria.
  • var_column_names — character vector indicating the columns to select for gene metadata.
  • var_value_filter — expression with selection conditions to fetch genes meeting a criteria.

For example if we want to fetch the expression data for:

  • Genes "ENSG00000161798" and "ENSG00000188229".
  • All "B cells" of "lung" with "COVID-19".
  • With all gene metadata and adding sex cell metadata.
library("Seurat")

seurat_obj <- get_seurat(
  census, "Homo sapiens",
  obs_column_names = c("cell_type", "tissue_general", "disease", "sex"),
  var_value_filter = "feature_id %in% c('ENSG00000161798', 'ENSG00000188229')",
  obs_value_filter = "cell_type == 'B cell' & tissue_general == 'lung' & disease == 'COVID-19'"
)
seurat_obj
#> An object of class Seurat 
#> 2 features across 2729 samples within 1 assay 
#> Active assay: RNA (2 features, 0 variable features)
#>  2 layers present: counts, data
head(seurat_obj[[]])
#>                 orig.ident nCount_RNA nFeature_RNA cell_type tissue_general  disease
#> cell13391229 SeuratProject          0            0    B cell           lung COVID-19
#> cell13393737 SeuratProject          1            1    B cell           lung COVID-19
#> cell13394391 SeuratProject          0            0    B cell           lung COVID-19
#> cell13394897 SeuratProject          0            0    B cell           lung COVID-19
#> cell13395941 SeuratProject          0            0    B cell           lung COVID-19
#> cell13397408 SeuratProject          0            0    B cell           lung COVID-19
#>                  sex
#> cell13391229    male
#> cell13393737 unknown
#> cell13394391    male
#> cell13394897 unknown
#> cell13395941    male
#> cell13397408 unknown
head(seurat_obj$RNA[[]])
#>                 feature_name feature_length      nnz n_measured_obs
#> ENSG00000161798         AQP5           1884  1029069       58250439
#> ENSG00000188229       TUBB4B           2037 21416107       62655002

For a full description refer to ?cellxgene.census::get_seurat.

Querying expression data as SingleCellExperiment

Similarly to the previous section, there is a get_single_cell_experiment method in the cellxgene.census API. It behaves exactly the same as get_seurat but it returns a SingleCellExperiment object.

For example, to repeat the same query we can simply do the following.

library("SingleCellExperiment")

sce_obj <- get_single_cell_experiment(
  census, "Homo sapiens",
  obs_column_names = c("cell_type", "tissue_general", "disease", "sex"),
  var_value_filter = "feature_id %in% c('ENSG00000161798', 'ENSG00000188229')",
  obs_value_filter = "cell_type == 'B cell' & tissue_general == 'lung' & disease == 'COVID-19'"
)
sce_obj
#> class: SingleCellExperiment 
#> dim: 2 2729 
#> metadata(0):
#> assays(1): counts
#> rownames(2): ENSG00000161798 ENSG00000188229
#> rowData names(4): feature_name feature_length nnz n_measured_obs
#> colnames(2729): obs13391229 obs13393737 ... obs54635684 obs54635708
#> colData names(4): cell_type tissue_general disease sex
#> reducedDimNames(0):
#> mainExpName: RNA
#> altExpNames(0):
head(colData(sce_obj))
#> DataFrame with 6 rows and 4 columns
#>               cell_type tissue_general     disease         sex
#>             <character>    <character> <character> <character>
#> obs13391229      B cell           lung    COVID-19        male
#> obs13393737      B cell           lung    COVID-19     unknown
#> obs13394391      B cell           lung    COVID-19        male
#> obs13394897      B cell           lung    COVID-19     unknown
#> obs13395941      B cell           lung    COVID-19        male
#> obs13397408      B cell           lung    COVID-19     unknown
head(rowData(sce_obj))
#> DataFrame with 2 rows and 4 columns
#>                 feature_name feature_length       nnz n_measured_obs
#>                  <character>      <integer> <integer>      <integer>
#> ENSG00000161798         AQP5           1884   1029069       58250439
#> ENSG00000188229       TUBB4B           2037  21416107       62655002

For a full description refer to ?cellxgene.census::get_single_cell_experiment.

Close the census

After use, the census object should be closed to release memory and other resources.

census$close()

This also closes all SOMA objects accessed via the top-level census. Closing can be automated using on.exit(census$close(), add = TRUE) immediately after census <- open_soma().