Querying and fetching the single-cell data and cell/gene metadata

This tutorial showcases the easiest ways to query the expression data and cell/gene metadata from the Census, and load them into R data frames, Seurat assays, and SingleCellExperiment objects.

Contents

Opening the census.
Querying cell metadata (obs).
Querying gene metadata (var).
Querying expression data as Seurat.
Querying expression data as SingleCellExperiment.

Opening the census

The cellxgene.census R package contains a convenient API to open any version of the Census (by default, the newest stable version).

library("cellxgene.census")
census <- open_soma()

You can learn more about the cellxgene.census methods by accessing their corresponding documentation, for example ?cellxgene.census::open_soma.

Querying cell metadata (obs)

The human gene metadata of the Census, for RNA assays, is located at census$get("census_data")$get("homo_sapiens")$obs. This is a SOMADataFrame and as such it can be materialized as an R data frame (tibble) using as.data.frame(obs$read()$concat()).

The mouse cell metadata is at census$get("census_data")$get("mus_musculus").obs.

For slicing the cell metadata there are two relevant arguments that can be passed through read():

column_names — character vector indicating what metadata columns to fetch.
value_filter — R expression with selection conditions to fetch rows.
- Expressions are one or more comparisons
- Comparisons are one of <column> <op> <value> or <column> <op> <column>
- Expressions can combine comparisons using && or ||
- op is one of < | > | <= | >= | == | != or %in%

To learn what metadata columns are available for fetching and filtering we can directly look at the keys of the cell metadata.

census$get("census_data")$get("homo_sapiens")$obs$colnames()
#>  [1] "soma_joinid"                             
#>  [2] "dataset_id"                              
#>  [3] "assay"                                   
#>  [4] "assay_ontology_term_id"                  
#>  [5] "cell_type"                               
#>  [6] "cell_type_ontology_term_id"              
#>  [7] "development_stage"                       
#>  [8] "development_stage_ontology_term_id"      
#>  [9] "disease"                                 
#> [10] "disease_ontology_term_id"                
#> [11] "donor_id"                                
#> [12] "is_primary_data"                         
#> [13] "observation_joinid"                      
#> [14] "self_reported_ethnicity"                 
#> [15] "self_reported_ethnicity_ontology_term_id"
#> [16] "sex"                                     
#> [17] "sex_ontology_term_id"                    
#> [18] "suspension_type"                         
#> [19] "tissue"                                  
#> [20] "tissue_ontology_term_id"                 
#> [21] "tissue_type"                             
#> [22] "tissue_general"                          
#> [23] "tissue_general_ontology_term_id"         
#> [24] "raw_sum"                                 
#> [25] "nnz"                                     
#> [26] "raw_mean_nnz"                            
#> [27] "raw_variance_nnz"                        
#> [28] "n_measured_vars"

soma_joinid is a special SOMADataFrame column that is used for join operations. The definition for all other columns can be found at the Census schema.

All of these can be used to fetch specific columns or specific rows matching a condition. For the latter we need to know the values we are looking for a priori.

For example let’s see what are the possible values available for sex. To this we can load all cell metadata but fetching only for the column sex.

unique(as.data.frame(census$get("census_data")$get("homo_sapiens")$obs$read(column_names = "sex")$concat()))
#>           sex
#> 1      female
#> 86       male
#> 63810 unknown

As you can see there are only three different values for sex, that is "male", "female" and "unknown".

With this information we can fetch all cell metatadata for a specific sex value, for example "unknown".

as.data.frame(census$get("census_data")$get("homo_sapiens")$obs$read(value_filter = "sex == 'unknown'")$concat())
#>   soma_joinid                           dataset_id     assay assay_ontology_term_id
#> 1       63809 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2            EFO:0009899
#> 2       63825 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2            EFO:0009899
#> 3       63829 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2            EFO:0009899
#> 4       63842 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2            EFO:0009899
#> 5       63845 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2            EFO:0009899
#> 6       63848 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2            EFO:0009899
#> 7       63850 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2            EFO:0009899
#> 8       63859 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2            EFO:0009899
#> 9       63877 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2            EFO:0009899
#>        cell_type cell_type_ontology_term_id development_stage
#> 1 dendritic cell                 CL:0000451           unknown
#> 2       monocyte                 CL:0000576           unknown
#> 3       monocyte                 CL:0000576           unknown
#> 4      mast cell                 CL:0000097           unknown
#> 5       monocyte                 CL:0000576           unknown
#> 6       monocyte                 CL:0000576           unknown
#> 7       monocyte                 CL:0000576           unknown
#> 8       monocyte                 CL:0000576           unknown
#> 9       monocyte                 CL:0000576           unknown
#>   development_stage_ontology_term_id disease disease_ontology_term_id donor_id
#> 1                            unknown  normal             PATO:0000461    P6709
#> 2                            unknown  normal             PATO:0000461    P6207
#> 3                            unknown  normal             PATO:0000461    P6709
#> 4                            unknown  normal             PATO:0000461    P5846
#> 5                            unknown  normal             PATO:0000461    P6709
#> 6                            unknown  normal             PATO:0000461    P6709
#> 7                            unknown  normal             PATO:0000461    P6709
#> 8                            unknown  normal             PATO:0000461    P6709
#> 9                            unknown  normal             PATO:0000461    P5846
#>   is_primary_data observation_joinid self_reported_ethnicity
#> 1           FALSE         C?ICmL<&>Z                 unknown
#> 2           FALSE         85i%LjfFIv                 unknown
#> 3           FALSE         Ayapye-s;W                 unknown
#> 4           FALSE         y?m!gJTm_}                 unknown
#> 5           FALSE         ;%L|gB$19h                 unknown
#> 6           FALSE         W6Gv)*|fO&                 unknown
#> 7           FALSE         C_4WKgszOh                 unknown
#> 8           FALSE         z`}uhAE2vd                 unknown
#> 9           FALSE         vt{WkZp)ha                 unknown
#>   self_reported_ethnicity_ontology_term_id     sex sex_ontology_term_id suspension_type
#> 1                                  unknown unknown              unknown            cell
#> 2                                  unknown unknown              unknown            cell
#> 3                                  unknown unknown              unknown            cell
#> 4                                  unknown unknown              unknown            cell
#> 5                                  unknown unknown              unknown            cell
#> 6                                  unknown unknown              unknown            cell
#> 7                                  unknown unknown              unknown            cell
#> 8                                  unknown unknown              unknown            cell
#> 9                                  unknown unknown              unknown            cell
#>            tissue tissue_ontology_term_id tissue_type tissue_general
#> 1 body of stomach          UBERON:0001161      tissue        stomach
#> 2 body of stomach          UBERON:0001161      tissue        stomach
#> 3 body of stomach          UBERON:0001161      tissue        stomach
#> 4 body of stomach          UBERON:0001161      tissue        stomach
#> 5 body of stomach          UBERON:0001161      tissue        stomach
#> 6 body of stomach          UBERON:0001161      tissue        stomach
#> 7 body of stomach          UBERON:0001161      tissue        stomach
#> 8 body of stomach          UBERON:0001161      tissue        stomach
#> 9 body of stomach          UBERON:0001161      tissue        stomach
#>   tissue_general_ontology_term_id raw_sum  nnz raw_mean_nnz raw_variance_nnz
#> 1                  UBERON:0000945     695  368     1.888587         12.14287
#> 2                  UBERON:0000945    6095 1427     4.271198        124.79807
#> 3                  UBERON:0000945    1045  492     2.123984         23.31861
#> 4                  UBERON:0000945    1546  640     2.415625         27.82386
#> 5                  UBERON:0000945    1308  530     2.467925         59.81466
#> 6                  UBERON:0000945     891  434     2.052995         16.80319
#> 7                  UBERON:0000945     847  399     2.122807         23.51503
#> 8                  UBERON:0000945     445  216     2.060185         55.25683
#> 9                  UBERON:0000945    1672  668     2.502994         15.53672
#>   n_measured_vars
#> 1           19550
#> 2           19550
#> 3           19550
#> 4           19550
#> 5           19550
#> 6           19550
#> 7           19550
#> 8           19550
#> 9           19550
#>  [ reached 'max' / getOption("max.print") -- omitted 3756271 rows ]

You can use both column_names and value_filter to perform specific queries. For example let’s fetch the disease column for the cell_type "B cell" in the tissue_general "lung".

cell_metadata_b_cell <- census$get("census_data")$get("homo_sapiens")$obs$read(
  value_filter = "cell_type == 'B cell' & tissue_general == 'lung'",
  column_names = "disease"
)

cell_metadata_b_cell <- as.data.frame(cell_metadata_b_cell$concat())

table(cell_metadata_b_cell)
#> disease
#>                                                        Alzheimer disease 
#>                                                                        0 
#>                                      B-cell acute lymphoblastic leukemia 
#>                                                                        0 
#>                                              B-cell non-Hodgkin lymphoma 
#>                                                                        0 
#>                                                        Barrett esophagus 
#>                                                                        0 
#>                                                                 COVID-19 
#>                                                                     2729 
#>                                                            Crohn disease 
#>                                                                        0 
#>                                                            Crohn ileitis 
#>                                                                        0 
#>                                                            Down syndrome 
#>                                                                        0 
#>                                                       Lewy body dementia 
#>                                                                        0 
#>                                                        Parkinson disease 
#>                                                                        0 
#>                                              Plasmodium malariae malaria 
#>                                                                        0 
#>                                                              Wilms tumor 
#>                                                                        0 
#>                                                     acute kidney failure 
#>                                                                        0 
#>                                                   acute myeloid leukemia 
#>                                                                        0 
#>                                              acute myocardial infarction 
#>                                                                        0 
#>                                             acute promyelocytic leukemia 
#>                                                                        0 
#>                                                           adenocarcinoma 
#>                                                                        0 
#>                                       age related macular degeneration 7 
#>                                                                        0 
#>                                            amyotrophic lateral sclerosis 
#>                                                                        0 
#> amyotrophic lateral sclerosis 26 with or without frontotemporal dementia 
#>                                                                        0 
#>                                                              anencephaly 
#>                                                                       79 
#>                          arrhythmogenic right ventricular cardiomyopathy 
#>                                                                        0 
#>                                                     aspiration pneumonia 
#>                                                                        0 
#>                                                     basal cell carcinoma 
#>                                                                        0 
#>                                                     basal laminar drusen 
#>                                                                        0 
#>                                             benign prostatic hyperplasia 
#>                                                                        0 
#>                                                                 blastoma 
#>                                                                        0 
#>                                                            breast cancer 
#>                                                                        0 
#>                                                         breast carcinoma 
#>                                                                        0 
#>                                                           cardiomyopathy 
#>                                                                        0 
#>                                                                 cataract 
#>                                                                        0 
#>                                         chromophobe renal cell carcinoma 
#>                                                                        0 
#>                                                   chronic kidney disease 
#>                                                                        0 
#>                                    chronic obstructive pulmonary disease 
#>                                                                     6369 
#>                                                         chronic rhinitis 
#>                                                                        0 
#>                                               clear cell renal carcinoma 
#>                                                                        0 
#>                                     colon sessile serrated adenoma/polyp 
#>                                                                        0 
#>                                                        colorectal cancer 
#>                                                                        0 
#>                                                      colorectal neoplasm 
#>                                                                        0 
#>                                         common variable immunodeficiency 
#>                                                                        0 
#>                                                 congenital heart disease 
#>                                                                        0 
#>                                                          cystic fibrosis 
#>                                                                        0 
#>                                                                 dementia 
#>                                                                        0 
#>                                                digestive system disorder 
#>                                                                        0 
#>                                                   dilated cardiomyopathy 
#>                                                                        0 
#>                                                                 epilepsy 
#>                                                                        0 
#>                                                      follicular lymphoma 
#>                                                                        0 
#>                                                  frontotemporal dementia 
#>                                                                        0 
#>                                                           gastric cancer 
#>                                                                        0 
#>                                            gastric intestinal metaplasia 
#>                                                                        0 
#>                                                                gastritis 
#>                                                                        0 
#>                                                               gingivitis 
#>                                                                        0 
#>                                                             glioblastoma 
#>                                                                        0 
#>                                                           heart disorder 
#>                                                                        0 
#>                                                            heart failure 
#>                                                                        0 
#>                                                             hydrosalpinx 
#>                                                                        0 
#>                                                       hyperplastic polyp 
#>                                                                        0 
#>                                             hypersensitivity pneumonitis 
#>                                                                       52 
#>                                                                influenza 
#>                                                                        0 
#>                                                                   injury 
#>                                                                        0 
#>                                                interstitial lung disease 
#>                                                                      376 
#>                                                                   keloid 
#>                                                                        0 
#>                                                   kidney benign neoplasm 
#>                                                                        0 
#>                                                        kidney oncocytoma 
#>                                                                        0 
#>                                                              listeriosis 
#>                                                                        0 
#>                                                    localized scleroderma 
#>                                                                        0 
#>                                                            long COVID-19 
#>                                                                        0 
#>                                               luminal A breast carcinoma 
#>                                                                        0 
#>                                               luminal B breast carcinoma 
#>                                                                        0 
#>                                                      lung adenocarcinoma 
#>                                                                    62351 
#>                                                lung large cell carcinoma 
#>                                                                     1534 
#>                                                 lymphangioleiomyomatosis 
#>                                                                      133 
#>                                                     macular degeneration 
#>                                                                        0 
#>                                            malignant pancreatic neoplasm 
#>                                                                        0 
#>                                                       multiple sclerosis 
#>                                                                        0 
#>                                                    myocardial infarction 
#>                                                                        0 
#>                                                 neuroendocrine carcinoma 
#>                                                                        0 
#>                                            non-compaction cardiomyopathy 
#>                                                                        0 
#>                                            non-small cell lung carcinoma 
#>                                                                    17484 
#>                                      non-specific interstitial pneumonia 
#>                                                                      231 
#>                                        nonpapillary renal cell carcinoma 
#>                                                                        0 
#>                                                                   normal 
#>                                                                    25461 
#>                                                        opiate dependence 
#>                                                                        0 
#>                                                            periodontitis 
#>                                                                        0 
#>                                                    pilocytic astrocytoma 
#>                                                                        0 
#>                                                      plasma cell myeloma 
#>                                                                        0 
#>                                                    pleomorphic carcinoma 
#>                                                                     1210 
#>                                                                pneumonia 
#>                                                                       50 
#>                                                   post-COVID-19 disorder 
#>                                                                        0 
#>                                premalignant hematological system disease 
#>                                                                        0 
#>                                              primary biliary cholangitis 
#>                                                                        0 
#>                                           primary sclerosing cholangitis 
#>                                                                        0 
#>                                                      pulmonary emphysema 
#>                                                                     1512 
#>                                                       pulmonary fibrosis 
#>                                                                     6798 
#>                                                    pulmonary sarcoidosis 
#>                                                                        6 
#>                                                      respiratory failure 
#>                                                                        0 
#>                                              respiratory system disorder 
#>                                                                        0 
#>                                                small cell lung carcinoma 
#>                                                                      583 
#>                                             squamous cell lung carcinoma 
#>                                                                    11920 
#>                                             systemic lupus erythematosus 
#>                                                                        0 
#>                                                   temporal lobe epilepsy 
#>                                                                        0 
#>                                                            tongue cancer 
#>                                                                        0 
#>                                                            toxoplasmosis 
#>                                                                        0 
#>                                         triple-negative breast carcinoma 
#>                                                                        0 
#>                                                               trisomy 18 
#>                                                                        0 
#>                                                          tubular adenoma 
#>                                                                        0 
#>                                                    tubulovillous adenoma 
#>                                                                        0 
#>                                                 type 1 diabetes mellitus 
#>                                                                        0 
#>                                                 type 2 diabetes mellitus 
#>                                                                        0

Querying gene metadata (var)

The human gene metadata of the Census is located at census$get("census_data")$get("homo_sapiens")$ms$get("RNA")$var. Similarly to the cell metadata, it is a SOMADataFrame and thus we can also use its method read().

The mouse gene metadata is at census$get("census_data")$get("mus_musculus")$ms$get("RNA")$var.

Let’s take a look at the metadata available for column selection and row filtering.

census$get("census_data")$get("homo_sapiens")$ms$get("RNA")$var$colnames()
#> [1] "soma_joinid"    "feature_id"     "feature_name"   "feature_length" "nnz"           
#> [6] "n_measured_obs"

With the exception of soma_joinid these columns are defined in the Census schema. Similarly to the cell metadata, we can use the same operations to learn and fetch gene metadata.

For example, to get the feature_name and feature_length of the genes "ENSG00000161798" and "ENSG00000188229" we can do the following.

var_df <- census$get("census_data")$get("homo_sapiens")$ms$get("RNA")$var$read(
  value_filter = "feature_id %in% c('ENSG00000161798', 'ENSG00000188229')",
  column_names = c("feature_name", "feature_length")
)

as.data.frame(var_df$concat())
#>   feature_name feature_length
#> 1         AQP5           1884
#> 2       TUBB4B           2037

Querying expression data as `Seurat`

A convenient way to query and fetch expression data is to use the get_seurat method of the cellxgene.census API. This is a method that combines the column selection and value filtering we described above to obtain slices of the expression data based on metadata queries.

The method will return a Seurat object, it takes as an input a census object, the string for an organism, and for both cell and gene metadata we can specify filters and column selection as described above but with the following arguments:

obs_column_names — character vector indicating the columns to select for cell metadata.
obs_value_filter — expression with selection conditions to fetch cells meeting a criteria.
var_column_names — character vector indicating the columns to select for gene metadata.
var_value_filter — expression with selection conditions to fetch genes meeting a criteria.

For example if we want to fetch the expression data for:

Genes "ENSG00000161798" and "ENSG00000188229".
All "B cells" of "lung" with "COVID-19".
With all gene metadata and adding sex cell metadata.

library("Seurat")

seurat_obj <- get_seurat(
  census, "Homo sapiens",
  obs_column_names = c("cell_type", "tissue_general", "disease", "sex"),
  var_value_filter = "feature_id %in% c('ENSG00000161798', 'ENSG00000188229')",
  obs_value_filter = "cell_type == 'B cell' & tissue_general == 'lung' & disease == 'COVID-19'"
)
seurat_obj
#> An object of class Seurat 
#> 2 features across 2729 samples within 1 assay 
#> Active assay: RNA (2 features, 0 variable features)
#>  2 layers present: counts, data

head(seurat_obj[[]])
#>                orig.ident cell_type tissue_general  disease  sex
#> cell8964464 SeuratProject    B cell           lung COVID-19 male
#> cell8964864 SeuratProject    B cell           lung COVID-19 male
#> cell8965181 SeuratProject    B cell           lung COVID-19 male
#> cell8965207 SeuratProject    B cell           lung COVID-19 male
#> cell8965360 SeuratProject    B cell           lung COVID-19 male
#> cell8965378 SeuratProject    B cell           lung COVID-19 male

head(seurat_obj$RNA[[]])
#>                 feature_name feature_length      nnz n_measured_obs
#> ENSG00000161798         AQP5           1884  1226640       68915280
#> ENSG00000188229       TUBB4B           2037 26463689       73806975

For a full description refer to ?cellxgene.census::get_seurat.

Querying expression data as `SingleCellExperiment`

Similarly to the previous section, there is a get_single_cell_experiment method in the cellxgene.census API. It behaves exactly the same as get_seurat but it returns a SingleCellExperiment object.

For example, to repeat the same query we can simply do the following.

library("SingleCellExperiment")

sce_obj <- get_single_cell_experiment(
  census, "Homo sapiens",
  obs_column_names = c("cell_type", "tissue_general", "disease", "sex"),
  var_value_filter = "feature_id %in% c('ENSG00000161798', 'ENSG00000188229')",
  obs_value_filter = "cell_type == 'B cell' & tissue_general == 'lung' & disease == 'COVID-19'"
)
sce_obj
#> class: SingleCellExperiment 
#> dim: 2 2729 
#> metadata(0):
#> assays(1): counts
#> rownames(2): ENSG00000161798 ENSG00000188229
#> rowData names(4): feature_name feature_length nnz n_measured_obs
#> colnames(2729): obs8964464 obs8964864 ... obs69303276 obs69304862
#> colData names(4): cell_type tissue_general disease sex
#> reducedDimNames(0):
#> mainExpName: RNA
#> altExpNames(0):

head(colData(sce_obj))
#> DataFrame with 6 rows and 4 columns
#>            cell_type tissue_general  disease      sex
#>             <factor>       <factor> <factor> <factor>
#> obs8964464    B cell           lung COVID-19     male
#> obs8964864    B cell           lung COVID-19     male
#> obs8965181    B cell           lung COVID-19     male
#> obs8965207    B cell           lung COVID-19     male
#> obs8965360    B cell           lung COVID-19     male
#> obs8965378    B cell           lung COVID-19     male

head(rowData(sce_obj))
#> DataFrame with 2 rows and 4 columns
#>                 feature_name feature_length       nnz n_measured_obs
#>                  <character>      <integer> <integer>      <integer>
#> ENSG00000161798         AQP5           1884   1226640       68915280
#> ENSG00000188229       TUBB4B           2037  26463689       73806975

For a full description refer to ?cellxgene.census::get_single_cell_experiment.

Close the census

After use, the census object should be closed to release memory and other resources.

census$close()

This also closes all SOMA objects accessed via the top-level census. Closing can be automated using on.exit(census$close(), add = TRUE) immediately after census <- open_soma().