Querying and fetching the single-cell data and cell/gene metadata
Source:vignettes/census_query_extract.Rmd
census_query_extract.Rmd
This tutorial showcases the easiest ways to query the expression data and cell/gene metadata from the Census, and load them into R data frames, Seurat assays, and SingleCellExperiment objects.
Contents
- Opening the census.
- Querying cell metadata (obs).
- Querying gene metadata (var).
- Querying expression data as
Seurat
. - Querying expression data as
SingleCellExperiment
.
Opening the census
The cellxgene.census
R package contains a convenient API to open any version of the Census (by default, the newest stable version).
library("cellxgene.census")
census <- open_soma()
You can learn more about the cellxgene.census
methods by accessing their corresponding documentation, for example ?cellxgene.census::open_soma
.
Querying cell metadata (obs)
The human gene metadata of the Census, for RNA assays, is located at census$get("census_data")$get("homo_sapiens")$obs
. This is a SOMADataFrame
and as such it can be materialized as an R data frame (tibble) using as.data.frame(obs$read()$concat())
.
The mouse cell metadata is at census$get("census_data")$get("mus_musculus").obs
.
For slicing the cell metadata there are two relevant arguments that can be passed through read():
-
column_names
— character vector indicating what metadata columns to fetch. -
value_filter
— R expression with selection conditions to fetch rows.- Expressions are one or more comparisons
- Comparisons are one of
<column> <op> <value>
or<column> <op> <column>
- Expressions can combine comparisons using && or ||
- op is one of < | > | <= | >= | == | != or %in%
To learn what metadata columns are available for fetching and filtering we can directly look at the keys of the cell metadata.
census$get("census_data")$get("homo_sapiens")$obs$colnames()
#> [1] "soma_joinid"
#> [2] "dataset_id"
#> [3] "assay"
#> [4] "assay_ontology_term_id"
#> [5] "cell_type"
#> [6] "cell_type_ontology_term_id"
#> [7] "development_stage"
#> [8] "development_stage_ontology_term_id"
#> [9] "disease"
#> [10] "disease_ontology_term_id"
#> [11] "donor_id"
#> [12] "is_primary_data"
#> [13] "self_reported_ethnicity"
#> [14] "self_reported_ethnicity_ontology_term_id"
#> [15] "sex"
#> [16] "sex_ontology_term_id"
#> [17] "suspension_type"
#> [18] "tissue"
#> [19] "tissue_ontology_term_id"
#> [20] "tissue_general"
#> [21] "tissue_general_ontology_term_id"
#> [22] "raw_sum"
#> [23] "nnz"
#> [24] "raw_mean_nnz"
#> [25] "raw_variance_nnz"
#> [26] "n_measured_vars"
soma_joinid
is a special SOMADataFrame
column that is used for join operations. The definition for all other columns can be found at the Census schema.
All of these can be used to fetch specific columns or specific rows matching a condition. For the latter we need to know the values we are looking for a priori.
For example let’s see what are the possible values available for sex
. To this we can load all cell metadata but fetching only for the column sex
.
unique(as.data.frame(census$get("census_data")$get("homo_sapiens")$obs$read(column_names = "sex")$concat()))
#> sex
#> 1 male
#> 224 female
#> 3747640 unknown
As you can see there are only three different values for sex, that is "male"
, "female"
and "unknown"
.
With this information we can fetch all cell metatadata for a specific sex value, for example "unknown"
.
as.data.frame(census$get("census_data")$get("homo_sapiens")$obs$read(value_filter = "sex == 'unknown'")$concat())
#> soma_joinid dataset_id assay assay_ontology_term_id
#> 1 3747639 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2 EFO:0009899
#> 2 3747640 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2 EFO:0009899
#> 3 3747641 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2 EFO:0009899
#> 4 3747642 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2 EFO:0009899
#> 5 3747643 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2 EFO:0009899
#> 6 3747644 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2 EFO:0009899
#> 7 3747645 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2 EFO:0009899
#> 8 3747646 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2 EFO:0009899
#> 9 3747647 9fcb0b73-c734-40a5-be9c-ace7eea401c9 10x 3' v2 EFO:0009899
#> cell_type cell_type_ontology_term_id development_stage
#> 1 fibroblast CL:0000057 human adult stage
#> 2 fibroblast CL:0000057 human adult stage
#> 3 fibroblast CL:0000057 human adult stage
#> 4 fibroblast CL:0000057 human adult stage
#> 5 fibroblast CL:0000057 human adult stage
#> 6 fibroblast CL:0000057 human adult stage
#> 7 fibroblast CL:0000057 human adult stage
#> 8 fibroblast CL:0000057 human adult stage
#> 9 fibroblast CL:0000057 human adult stage
#> development_stage_ontology_term_id disease disease_ontology_term_id
#> 1 HsapDv:0000087 normal PATO:0000461
#> 2 HsapDv:0000087 normal PATO:0000461
#> 3 HsapDv:0000087 normal PATO:0000461
#> 4 HsapDv:0000087 normal PATO:0000461
#> 5 HsapDv:0000087 normal PATO:0000461
#> 6 HsapDv:0000087 normal PATO:0000461
#> 7 HsapDv:0000087 normal PATO:0000461
#> 8 HsapDv:0000087 normal PATO:0000461
#> 9 HsapDv:0000087 normal PATO:0000461
#> donor_id is_primary_data self_reported_ethnicity
#> 1 Pagella_GSE161267_GSM4904134 TRUE unknown
#> 2 Pagella_GSE161267_GSM4904134 TRUE unknown
#> 3 Pagella_GSE161267_GSM4904134 TRUE unknown
#> 4 Pagella_GSE161267_GSM4904134 TRUE unknown
#> 5 Pagella_GSE161267_GSM4904134 TRUE unknown
#> 6 Pagella_GSE161267_GSM4904134 TRUE unknown
#> 7 Pagella_GSE161267_GSM4904134 TRUE unknown
#> 8 Pagella_GSE161267_GSM4904134 TRUE unknown
#> 9 Pagella_GSE161267_GSM4904134 TRUE unknown
#> self_reported_ethnicity_ontology_term_id sex sex_ontology_term_id suspension_type
#> 1 unknown unknown unknown cell
#> 2 unknown unknown unknown cell
#> 3 unknown unknown unknown cell
#> 4 unknown unknown unknown cell
#> 5 unknown unknown unknown cell
#> 6 unknown unknown unknown cell
#> 7 unknown unknown unknown cell
#> 8 unknown unknown unknown cell
#> 9 unknown unknown unknown cell
#> tissue tissue_ontology_term_id tissue_general tissue_general_ontology_term_id
#> 1 gingiva UBERON:0001828 mucosa UBERON:0000344
#> 2 gingiva UBERON:0001828 mucosa UBERON:0000344
#> 3 gingiva UBERON:0001828 mucosa UBERON:0000344
#> 4 gingiva UBERON:0001828 mucosa UBERON:0000344
#> 5 gingiva UBERON:0001828 mucosa UBERON:0000344
#> 6 gingiva UBERON:0001828 mucosa UBERON:0000344
#> 7 gingiva UBERON:0001828 mucosa UBERON:0000344
#> 8 gingiva UBERON:0001828 mucosa UBERON:0000344
#> 9 gingiva UBERON:0001828 mucosa UBERON:0000344
#> raw_sum nnz raw_mean_nnz raw_variance_nnz n_measured_vars
#> 1 547 329 1.662614 14.559604 31602
#> 2 982 563 1.744227 5.315247 31602
#> 3 12467 3809 3.273038 109.305683 31602
#> 4 1053 566 1.860424 7.430042 31602
#> 5 548 363 1.509642 2.410818 31602
#> 6 678 429 1.580420 11.379616 31602
#> 7 848 524 1.618321 9.437216 31602
#> 8 935 608 1.537829 4.868418 31602
#> 9 735 485 1.515464 6.213087 31602
#> [ reached 'max' / getOption("max.print") -- omitted 3301779 rows ]
You can use both column_names
and value_filter
to perform specific queries. For example let’s fetch the disease
column for the cell_type
"B cell"
in the tissue_general
"lung"
.
cell_metadata_b_cell <- census$get("census_data")$get("homo_sapiens")$obs$read(
value_filter = "cell_type == 'B cell' & tissue_general == 'lung'",
column_names = "disease"
)
cell_metadata_b_cell <- as.data.frame(cell_metadata_b_cell$concat())
table(cell_metadata_b_cell)
#> disease
#> COVID-19 chronic obstructive pulmonary disease
#> 2729 6369
#> hypersensitivity pneumonitis interstitial lung disease
#> 52 376
#> lung adenocarcinoma lung large cell carcinoma
#> 62351 1534
#> lymphangioleiomyomatosis non-small cell lung carcinoma
#> 133 17484
#> non-specific interstitial pneumonia normal
#> 231 25461
#> pleomorphic carcinoma pneumonia
#> 1210 50
#> pulmonary emphysema pulmonary fibrosis
#> 1512 6798
#> pulmonary sarcoidosis small cell lung carcinoma
#> 6 583
#> squamous cell lung carcinoma
#> 11920
Querying gene metadata (var)
The human gene metadata of the Census is located at census$get("census_data")$get("homo_sapiens")$ms$get("RNA")$var
. Similarly to the cell metadata, it is a SOMADataFrame
and thus we can also use its method read()
.
The mouse gene metadata is at census$get("census_data")$get("mus_musculus")$ms$get("RNA")$var
.
Let’s take a look at the metadata available for column selection and row filtering.
census$get("census_data")$get("homo_sapiens")$ms$get("RNA")$var$colnames()
#> [1] "soma_joinid" "feature_id" "feature_name" "feature_length" "nnz"
#> [6] "n_measured_obs"
With the exception of soma_joinid these columns are defined in the Census schema. Similarly to the cell metadata, we can use the same operations to learn and fetch gene metadata.
For example, to get the feature_name
and feature_length
of the genes "ENSG00000161798"
and "ENSG00000188229"
we can do the following.
var_df <- census$get("census_data")$get("homo_sapiens")$ms$get("RNA")$var$read(
value_filter = "feature_id %in% c('ENSG00000161798', 'ENSG00000188229')",
column_names = c("feature_name", "feature_length")
)
as.data.frame(var_df$concat())
#> feature_name feature_length
#> 1 AQP5 1884
#> 2 TUBB4B 2037
Querying expression data as Seurat
A convenient way to query and fetch expression data is to use the get_seurat
method of the cellxgene.census
API. This is a method that combines the column selection and value filtering we described above to obtain slices of the expression data based on metadata queries.
The method will return a Seurat
object, it takes as an input a census object, the string for an organism, and for both cell and gene metadata we can specify filters and column selection as described above but with the following arguments:
-
obs_column_names
— character vector indicating the columns to select for cell metadata. -
obs_value_filter
— expression with selection conditions to fetch cells meeting a criteria. -
var_column_names
— character vector indicating the columns to select for gene metadata. -
var_value_filter
— expression with selection conditions to fetch genes meeting a criteria.
For example if we want to fetch the expression data for:
- Genes
"ENSG00000161798"
and"ENSG00000188229"
. - All
"B cells"
of"lung"
with"COVID-19"
. - With all gene metadata and adding
sex
cell metadata.
library("Seurat")
seurat_obj <- get_seurat(
census, "Homo sapiens",
obs_column_names = c("cell_type", "tissue_general", "disease", "sex"),
var_value_filter = "feature_id %in% c('ENSG00000161798', 'ENSG00000188229')",
obs_value_filter = "cell_type == 'B cell' & tissue_general == 'lung' & disease == 'COVID-19'"
)
seurat_obj
#> An object of class Seurat
#> 2 features across 2729 samples within 1 assay
#> Active assay: RNA (2 features, 0 variable features)
#> 2 layers present: counts, data
head(seurat_obj[[]])
#> orig.ident nCount_RNA nFeature_RNA cell_type tissue_general disease
#> cell13391229 SeuratProject 0 0 B cell lung COVID-19
#> cell13393737 SeuratProject 1 1 B cell lung COVID-19
#> cell13394391 SeuratProject 0 0 B cell lung COVID-19
#> cell13394897 SeuratProject 0 0 B cell lung COVID-19
#> cell13395941 SeuratProject 0 0 B cell lung COVID-19
#> cell13397408 SeuratProject 0 0 B cell lung COVID-19
#> sex
#> cell13391229 male
#> cell13393737 unknown
#> cell13394391 male
#> cell13394897 unknown
#> cell13395941 male
#> cell13397408 unknown
head(seurat_obj$RNA[[]])
#> feature_name feature_length nnz n_measured_obs
#> ENSG00000161798 AQP5 1884 1029069 58250439
#> ENSG00000188229 TUBB4B 2037 21416107 62655002
For a full description refer to ?cellxgene.census::get_seurat
.
Querying expression data as SingleCellExperiment
Similarly to the previous section, there is a get_single_cell_experiment
method in the cellxgene.census
API. It behaves exactly the same as get_seurat
but it returns a SingleCellExperiment
object.
For example, to repeat the same query we can simply do the following.
library("SingleCellExperiment")
sce_obj <- get_single_cell_experiment(
census, "Homo sapiens",
obs_column_names = c("cell_type", "tissue_general", "disease", "sex"),
var_value_filter = "feature_id %in% c('ENSG00000161798', 'ENSG00000188229')",
obs_value_filter = "cell_type == 'B cell' & tissue_general == 'lung' & disease == 'COVID-19'"
)
sce_obj
#> class: SingleCellExperiment
#> dim: 2 2729
#> metadata(0):
#> assays(1): counts
#> rownames(2): ENSG00000161798 ENSG00000188229
#> rowData names(4): feature_name feature_length nnz n_measured_obs
#> colnames(2729): obs13391229 obs13393737 ... obs54635684 obs54635708
#> colData names(4): cell_type tissue_general disease sex
#> reducedDimNames(0):
#> mainExpName: RNA
#> altExpNames(0):
head(colData(sce_obj))
#> DataFrame with 6 rows and 4 columns
#> cell_type tissue_general disease sex
#> <character> <character> <character> <character>
#> obs13391229 B cell lung COVID-19 male
#> obs13393737 B cell lung COVID-19 unknown
#> obs13394391 B cell lung COVID-19 male
#> obs13394897 B cell lung COVID-19 unknown
#> obs13395941 B cell lung COVID-19 male
#> obs13397408 B cell lung COVID-19 unknown
head(rowData(sce_obj))
#> DataFrame with 2 rows and 4 columns
#> feature_name feature_length nnz n_measured_obs
#> <character> <integer> <integer> <integer>
#> ENSG00000161798 AQP5 1884 1029069 58250439
#> ENSG00000188229 TUBB4B 2037 21416107 62655002
For a full description refer to ?cellxgene.census::get_single_cell_experiment
.