Querying and fetching the single-cell data and cell/gene metadata
Source:vignettes/census_query_extract.Rmd
census_query_extract.Rmd
This tutorial showcases the easiest ways to query the expression data and cell/gene metadata from the Census, and load them into R data frames, Seurat assays, and SingleCellExperiment objects.
Contents
- Opening the census.
- Querying cell metadata (obs).
- Querying gene metadata (var).
- Querying expression data as
Seurat
. - Querying expression data as
SingleCellExperiment
.
Opening the census
The cellxgene.census
R package contains a convenient API to open any version of the Census (by default, the newest stable version).
library("cellxgene.census")
census <- open_soma()
You can learn more about the cellxgene.census
methods by accessing their corresponding documentation, for example ?cellxgene.census::open_soma
.
Querying cell metadata (obs)
The human gene metadata of the Census, for RNA assays, is located at census$get("census_data")$get("homo_sapiens")$obs
. This is a SOMADataFrame
and as such it can be materialized as an R data frame (tibble) using as.data.frame(obs$read()$concat())
.
The mouse cell metadata is at census$get("census_data")$get("mus_musculus").obs
.
For slicing the cell metadata there are two relevant arguments that can be passed through read():
-
column_names
— character vector indicating what metadata columns to fetch. -
value_filter
— R expression with selection conditions to fetch rows.- Expressions are one or more comparisons
- Comparisons are one of
<column> <op> <value>
or<column> <op> <column>
- Expressions can combine comparisons using && or ||
- op is one of < | > | <= | >= | == | != or %in%
To learn what metadata columns are available for fetching and filtering we can directly look at the keys of the cell metadata.
census$get("census_data")$get("homo_sapiens")$obs$colnames()
#> [1] "soma_joinid"
#> [2] "dataset_id"
#> [3] "assay"
#> [4] "assay_ontology_term_id"
#> [5] "cell_type"
#> [6] "cell_type_ontology_term_id"
#> [7] "development_stage"
#> [8] "development_stage_ontology_term_id"
#> [9] "disease"
#> [10] "disease_ontology_term_id"
#> [11] "donor_id"
#> [12] "is_primary_data"
#> [13] "observation_joinid"
#> [14] "self_reported_ethnicity"
#> [15] "self_reported_ethnicity_ontology_term_id"
#> [16] "sex"
#> [17] "sex_ontology_term_id"
#> [18] "suspension_type"
#> [19] "tissue"
#> [20] "tissue_ontology_term_id"
#> [21] "tissue_type"
#> [22] "tissue_general"
#> [23] "tissue_general_ontology_term_id"
#> [24] "raw_sum"
#> [25] "nnz"
#> [26] "raw_mean_nnz"
#> [27] "raw_variance_nnz"
#> [28] "n_measured_vars"
soma_joinid
is a special SOMADataFrame
column that is used for join operations. The definition for all other columns can be found at the Census schema.
All of these can be used to fetch specific columns or specific rows matching a condition. For the latter we need to know the values we are looking for a priori.
For example let’s see what are the possible values available for sex
. To this we can load all cell metadata but fetching only for the column sex
.
unique(as.data.frame(census$get("census_data")$get("homo_sapiens")$obs$read(column_names = "sex")$concat()))
#> sex
#> 1 female
#> 86 male
#> 63810 unknown
As you can see there are only three different values for sex, that is "male"
, "female"
and "unknown"
.
With this information we can fetch all cell metatadata for a specific sex value, for example "unknown"
.
as.data.frame(census$get("census_data")$get("homo_sapiens")$obs$read(value_filter = "sex == 'unknown'")$concat())
#> soma_joinid dataset_id assay assay_ontology_term_id
#> 1 63809 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2 EFO:0009899
#> 2 63825 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2 EFO:0009899
#> 3 63829 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2 EFO:0009899
#> 4 63842 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2 EFO:0009899
#> 5 63845 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2 EFO:0009899
#> 6 63848 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2 EFO:0009899
#> 7 63850 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2 EFO:0009899
#> 8 63859 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2 EFO:0009899
#> 9 63877 94423ec1-21f8-40e8-b5c9-c3ea82350ca4 10x 3' v2 EFO:0009899
#> cell_type cell_type_ontology_term_id development_stage
#> 1 dendritic cell CL:0000451 unknown
#> 2 monocyte CL:0000576 unknown
#> 3 monocyte CL:0000576 unknown
#> 4 mast cell CL:0000097 unknown
#> 5 monocyte CL:0000576 unknown
#> 6 monocyte CL:0000576 unknown
#> 7 monocyte CL:0000576 unknown
#> 8 monocyte CL:0000576 unknown
#> 9 monocyte CL:0000576 unknown
#> development_stage_ontology_term_id disease disease_ontology_term_id donor_id
#> 1 unknown normal PATO:0000461 P6709
#> 2 unknown normal PATO:0000461 P6207
#> 3 unknown normal PATO:0000461 P6709
#> 4 unknown normal PATO:0000461 P5846
#> 5 unknown normal PATO:0000461 P6709
#> 6 unknown normal PATO:0000461 P6709
#> 7 unknown normal PATO:0000461 P6709
#> 8 unknown normal PATO:0000461 P6709
#> 9 unknown normal PATO:0000461 P5846
#> is_primary_data observation_joinid self_reported_ethnicity
#> 1 FALSE C?ICmL<&>Z unknown
#> 2 FALSE 85i%LjfFIv unknown
#> 3 FALSE Ayapye-s;W unknown
#> 4 FALSE y?m!gJTm_} unknown
#> 5 FALSE ;%L|gB$19h unknown
#> 6 FALSE W6Gv)*|fO& unknown
#> 7 FALSE C_4WKgszOh unknown
#> 8 FALSE z`}uhAE2vd unknown
#> 9 FALSE vt{WkZp)ha unknown
#> self_reported_ethnicity_ontology_term_id sex sex_ontology_term_id suspension_type
#> 1 unknown unknown unknown cell
#> 2 unknown unknown unknown cell
#> 3 unknown unknown unknown cell
#> 4 unknown unknown unknown cell
#> 5 unknown unknown unknown cell
#> 6 unknown unknown unknown cell
#> 7 unknown unknown unknown cell
#> 8 unknown unknown unknown cell
#> 9 unknown unknown unknown cell
#> tissue tissue_ontology_term_id tissue_type tissue_general
#> 1 body of stomach UBERON:0001161 tissue stomach
#> 2 body of stomach UBERON:0001161 tissue stomach
#> 3 body of stomach UBERON:0001161 tissue stomach
#> 4 body of stomach UBERON:0001161 tissue stomach
#> 5 body of stomach UBERON:0001161 tissue stomach
#> 6 body of stomach UBERON:0001161 tissue stomach
#> 7 body of stomach UBERON:0001161 tissue stomach
#> 8 body of stomach UBERON:0001161 tissue stomach
#> 9 body of stomach UBERON:0001161 tissue stomach
#> tissue_general_ontology_term_id raw_sum nnz raw_mean_nnz raw_variance_nnz
#> 1 UBERON:0000945 695 368 1.888587 12.14287
#> 2 UBERON:0000945 6095 1427 4.271198 124.79807
#> 3 UBERON:0000945 1045 492 2.123984 23.31861
#> 4 UBERON:0000945 1546 640 2.415625 27.82386
#> 5 UBERON:0000945 1308 530 2.467925 59.81466
#> 6 UBERON:0000945 891 434 2.052995 16.80319
#> 7 UBERON:0000945 847 399 2.122807 23.51503
#> 8 UBERON:0000945 445 216 2.060185 55.25683
#> 9 UBERON:0000945 1672 668 2.502994 15.53672
#> n_measured_vars
#> 1 19550
#> 2 19550
#> 3 19550
#> 4 19550
#> 5 19550
#> 6 19550
#> 7 19550
#> 8 19550
#> 9 19550
#> [ reached 'max' / getOption("max.print") -- omitted 3756271 rows ]
You can use both column_names
and value_filter
to perform specific queries. For example let’s fetch the disease
column for the cell_type
"B cell"
in the tissue_general
"lung"
.
cell_metadata_b_cell <- census$get("census_data")$get("homo_sapiens")$obs$read(
value_filter = "cell_type == 'B cell' & tissue_general == 'lung'",
column_names = "disease"
)
cell_metadata_b_cell <- as.data.frame(cell_metadata_b_cell$concat())
table(cell_metadata_b_cell)
#> disease
#> Alzheimer disease
#> 0
#> B-cell acute lymphoblastic leukemia
#> 0
#> B-cell non-Hodgkin lymphoma
#> 0
#> Barrett esophagus
#> 0
#> COVID-19
#> 2729
#> Crohn disease
#> 0
#> Crohn ileitis
#> 0
#> Down syndrome
#> 0
#> Lewy body dementia
#> 0
#> Parkinson disease
#> 0
#> Plasmodium malariae malaria
#> 0
#> Wilms tumor
#> 0
#> acute kidney failure
#> 0
#> acute myeloid leukemia
#> 0
#> acute myocardial infarction
#> 0
#> acute promyelocytic leukemia
#> 0
#> adenocarcinoma
#> 0
#> age related macular degeneration 7
#> 0
#> amyotrophic lateral sclerosis
#> 0
#> amyotrophic lateral sclerosis 26 with or without frontotemporal dementia
#> 0
#> anencephaly
#> 79
#> arrhythmogenic right ventricular cardiomyopathy
#> 0
#> aspiration pneumonia
#> 0
#> basal cell carcinoma
#> 0
#> basal laminar drusen
#> 0
#> benign prostatic hyperplasia
#> 0
#> blastoma
#> 0
#> breast cancer
#> 0
#> breast carcinoma
#> 0
#> cardiomyopathy
#> 0
#> cataract
#> 0
#> chromophobe renal cell carcinoma
#> 0
#> chronic kidney disease
#> 0
#> chronic obstructive pulmonary disease
#> 6369
#> chronic rhinitis
#> 0
#> clear cell renal carcinoma
#> 0
#> colon sessile serrated adenoma/polyp
#> 0
#> colorectal cancer
#> 0
#> colorectal neoplasm
#> 0
#> common variable immunodeficiency
#> 0
#> congenital heart disease
#> 0
#> cystic fibrosis
#> 0
#> dementia
#> 0
#> digestive system disorder
#> 0
#> dilated cardiomyopathy
#> 0
#> epilepsy
#> 0
#> follicular lymphoma
#> 0
#> frontotemporal dementia
#> 0
#> gastric cancer
#> 0
#> gastric intestinal metaplasia
#> 0
#> gastritis
#> 0
#> gingivitis
#> 0
#> glioblastoma
#> 0
#> heart disorder
#> 0
#> heart failure
#> 0
#> hydrosalpinx
#> 0
#> hyperplastic polyp
#> 0
#> hypersensitivity pneumonitis
#> 52
#> influenza
#> 0
#> injury
#> 0
#> interstitial lung disease
#> 376
#> keloid
#> 0
#> kidney benign neoplasm
#> 0
#> kidney oncocytoma
#> 0
#> listeriosis
#> 0
#> localized scleroderma
#> 0
#> long COVID-19
#> 0
#> luminal A breast carcinoma
#> 0
#> luminal B breast carcinoma
#> 0
#> lung adenocarcinoma
#> 62351
#> lung large cell carcinoma
#> 1534
#> lymphangioleiomyomatosis
#> 133
#> macular degeneration
#> 0
#> malignant pancreatic neoplasm
#> 0
#> multiple sclerosis
#> 0
#> myocardial infarction
#> 0
#> neuroendocrine carcinoma
#> 0
#> non-compaction cardiomyopathy
#> 0
#> non-small cell lung carcinoma
#> 17484
#> non-specific interstitial pneumonia
#> 231
#> nonpapillary renal cell carcinoma
#> 0
#> normal
#> 25461
#> opiate dependence
#> 0
#> periodontitis
#> 0
#> pilocytic astrocytoma
#> 0
#> plasma cell myeloma
#> 0
#> pleomorphic carcinoma
#> 1210
#> pneumonia
#> 50
#> post-COVID-19 disorder
#> 0
#> premalignant hematological system disease
#> 0
#> primary biliary cholangitis
#> 0
#> primary sclerosing cholangitis
#> 0
#> pulmonary emphysema
#> 1512
#> pulmonary fibrosis
#> 6798
#> pulmonary sarcoidosis
#> 6
#> respiratory failure
#> 0
#> respiratory system disorder
#> 0
#> small cell lung carcinoma
#> 583
#> squamous cell lung carcinoma
#> 11920
#> systemic lupus erythematosus
#> 0
#> temporal lobe epilepsy
#> 0
#> tongue cancer
#> 0
#> toxoplasmosis
#> 0
#> triple-negative breast carcinoma
#> 0
#> trisomy 18
#> 0
#> tubular adenoma
#> 0
#> tubulovillous adenoma
#> 0
#> type 1 diabetes mellitus
#> 0
#> type 2 diabetes mellitus
#> 0
Querying gene metadata (var)
The human gene metadata of the Census is located at census$get("census_data")$get("homo_sapiens")$ms$get("RNA")$var
. Similarly to the cell metadata, it is a SOMADataFrame
and thus we can also use its method read()
.
The mouse gene metadata is at census$get("census_data")$get("mus_musculus")$ms$get("RNA")$var
.
Let’s take a look at the metadata available for column selection and row filtering.
census$get("census_data")$get("homo_sapiens")$ms$get("RNA")$var$colnames()
#> [1] "soma_joinid" "feature_id" "feature_name" "feature_length" "nnz"
#> [6] "n_measured_obs"
With the exception of soma_joinid these columns are defined in the Census schema. Similarly to the cell metadata, we can use the same operations to learn and fetch gene metadata.
For example, to get the feature_name
and feature_length
of the genes "ENSG00000161798"
and "ENSG00000188229"
we can do the following.
var_df <- census$get("census_data")$get("homo_sapiens")$ms$get("RNA")$var$read(
value_filter = "feature_id %in% c('ENSG00000161798', 'ENSG00000188229')",
column_names = c("feature_name", "feature_length")
)
as.data.frame(var_df$concat())
#> feature_name feature_length
#> 1 AQP5 1884
#> 2 TUBB4B 2037
Querying expression data as Seurat
A convenient way to query and fetch expression data is to use the get_seurat
method of the cellxgene.census
API. This is a method that combines the column selection and value filtering we described above to obtain slices of the expression data based on metadata queries.
The method will return a Seurat
object, it takes as an input a census object, the string for an organism, and for both cell and gene metadata we can specify filters and column selection as described above but with the following arguments:
-
obs_column_names
— character vector indicating the columns to select for cell metadata. -
obs_value_filter
— expression with selection conditions to fetch cells meeting a criteria. -
var_column_names
— character vector indicating the columns to select for gene metadata. -
var_value_filter
— expression with selection conditions to fetch genes meeting a criteria.
For example if we want to fetch the expression data for:
- Genes
"ENSG00000161798"
and"ENSG00000188229"
. - All
"B cells"
of"lung"
with"COVID-19"
. - With all gene metadata and adding
sex
cell metadata.
library("Seurat")
seurat_obj <- get_seurat(
census, "Homo sapiens",
obs_column_names = c("cell_type", "tissue_general", "disease", "sex"),
var_value_filter = "feature_id %in% c('ENSG00000161798', 'ENSG00000188229')",
obs_value_filter = "cell_type == 'B cell' & tissue_general == 'lung' & disease == 'COVID-19'"
)
seurat_obj
#> An object of class Seurat
#> 2 features across 2729 samples within 1 assay
#> Active assay: RNA (2 features, 0 variable features)
#> 2 layers present: counts, data
head(seurat_obj[[]])
#> orig.ident cell_type tissue_general disease sex
#> cell8964464 SeuratProject B cell lung COVID-19 male
#> cell8964864 SeuratProject B cell lung COVID-19 male
#> cell8965181 SeuratProject B cell lung COVID-19 male
#> cell8965207 SeuratProject B cell lung COVID-19 male
#> cell8965360 SeuratProject B cell lung COVID-19 male
#> cell8965378 SeuratProject B cell lung COVID-19 male
head(seurat_obj$RNA[[]])
#> feature_name feature_length nnz n_measured_obs
#> ENSG00000161798 AQP5 1884 1226640 68915280
#> ENSG00000188229 TUBB4B 2037 26463689 73806975
For a full description refer to ?cellxgene.census::get_seurat
.
Querying expression data as SingleCellExperiment
Similarly to the previous section, there is a get_single_cell_experiment
method in the cellxgene.census
API. It behaves exactly the same as get_seurat
but it returns a SingleCellExperiment
object.
For example, to repeat the same query we can simply do the following.
library("SingleCellExperiment")
sce_obj <- get_single_cell_experiment(
census, "Homo sapiens",
obs_column_names = c("cell_type", "tissue_general", "disease", "sex"),
var_value_filter = "feature_id %in% c('ENSG00000161798', 'ENSG00000188229')",
obs_value_filter = "cell_type == 'B cell' & tissue_general == 'lung' & disease == 'COVID-19'"
)
sce_obj
#> class: SingleCellExperiment
#> dim: 2 2729
#> metadata(0):
#> assays(1): counts
#> rownames(2): ENSG00000161798 ENSG00000188229
#> rowData names(4): feature_name feature_length nnz n_measured_obs
#> colnames(2729): obs8964464 obs8964864 ... obs69303276 obs69304862
#> colData names(4): cell_type tissue_general disease sex
#> reducedDimNames(0):
#> mainExpName: RNA
#> altExpNames(0):
head(colData(sce_obj))
#> DataFrame with 6 rows and 4 columns
#> cell_type tissue_general disease sex
#> <factor> <factor> <factor> <factor>
#> obs8964464 B cell lung COVID-19 male
#> obs8964864 B cell lung COVID-19 male
#> obs8965181 B cell lung COVID-19 male
#> obs8965207 B cell lung COVID-19 male
#> obs8965360 B cell lung COVID-19 male
#> obs8965378 B cell lung COVID-19 male
head(rowData(sce_obj))
#> DataFrame with 2 rows and 4 columns
#> feature_name feature_length nnz n_measured_obs
#> <character> <integer> <integer> <integer>
#> ENSG00000161798 AQP5 1884 1226640 68915280
#> ENSG00000188229 TUBB4B 2037 26463689 73806975
For a full description refer to ?cellxgene.census::get_single_cell_experiment
.