Summarizing cell and gene metadata
Source:vignettes/comp_bio_summarize_axis_query.Rmd
comp_bio_summarize_axis_query.Rmd
This vignette provides examples for basic axis metadata handling. The CZ CELLxGENE Census stores obs
(cell) and var
(gene) metadata in a SOMA DataFrame objects via the TileDB-SOMA API (documentation), which can be queried and read as an R data frame. The Census also has a convenience package which simplifies opening the census.
R data frames are in-memory objects. Take care that queries are small enough for results to fit in memory.
Contents
- Opening the Census
- Summarizing cell metadata
- Example: Summarize all cell types
- Example: Summarize a subset of cell types, selected with a
value_filter
- Full Census metadata stats
Opening the Census
The cellxgene.census
R package contains a convenient API to open any version of the Census (by default, the newest stable version).
library("cellxgene.census")
census <- open_soma()
If you open the Census, you should close it with census$close()
. This can be automated using on.exit(census$close(), add = TRUE)
immediately after census <- open_soma()
.
You can learn more about the cellxgene.census methods by accessing their corresponding documentation. For example ?cellxgene.census::open_soma
.
Summarizing cell metadata
Once the Census is open you can use its TileDB-SOMA methods as it is itself a SOMACollection
. You can thus access the metadata SOMADataFram
e objects encoding cell and gene metadata.
Tips:
- You can read an entire
SOMADataFrame
into R usingas.data.frame(soma_df$read()$concat())
. - Queries will be much faster if you request only the DataFrame columns required for your analysis (e.g.
column_names = c("soma_joinid", "cell_type_ontology_term_id")
). - You can also further refine query results by using a
value_filter
, which will filter the census for matching records.
Example: Summarize all cell types
This example reads the cell metadata (obs) into an R data frame to summarize in a variety of ways.
human <- census$get("census_data")$get("homo_sapiens")
# Read obs into an R data frame (tibble).
obs_df <- human$obs$read(column_names = c("cell_type"))
obs_df <- as.data.frame(obs_df$concat())
# Find all unique values in the cell_type column.
unique_cell_type <- unique(obs_df$cell_type)
cat(
"There are",
length(unique_cell_type),
"cell types in the Census! The first few are: ",
paste(head(unique_cell_type), collapse = ", ")
)
#> There are 698 cell types in the Census! The first few are: plasma cell, mature B cell, macrophage, follicular dendritic cell, plasmacytoid dendritic cell, conventional dendritic cell
Example: Summarize a subset of cell types, selected with a value_filter
This example utilizes a SOMA “value filter” to read the subset of cells with tissue_ontology_term_id
equal to UBERON:0002048
(lung tissue), and summarizes the query result.
# Read cell_type terms for cells which have a specific tissue term
LUNG_TISSUE <- "UBERON:0002048"
obs_df <- human$obs$read(column_names = c("cell_type"), value_filter = paste0("tissue_ontology_term_id == '", LUNG_TISSUE, "'"))
obs_df <- as.data.frame(obs_df$concat())
# Find all unique values in the cell_type column as an R data frame.
unique_cell_type <- unique(obs_df$cell_type)
cat(
"There are ",
length(unique_cell_type),
" cell types in the Census where tissue_ontology_term_id == ",
LUNG_TISSUE,
"!\nThe first few are:",
paste(head(unique_cell_type), collapse = ", "),
"\n"
)
#> There are 202 cell types in the Census where tissue_ontology_term_id == UBERON:0002048 !
#> The first few are: Schwann cell, immature Schwann cell, neuron, Schwann cell precursor, neural progenitor cell, lung ciliated cell
# Report the 10 most common
top_10 <- sort(table(obs_df$cell_type), decreasing = TRUE)[1:10]
cat(
"The top 10 cell types where tissue_ontology_term_id ==",
LUNG_TISSUE,
"are: ",
paste(names(top_10), collapse = ", ")
)
#> The top 10 cell types where tissue_ontology_term_id == UBERON:0002048 are: unknown, alveolar macrophage, CD8-positive, alpha-beta T cell, CD4-positive, alpha-beta T cell, macrophage, type II pneumocyte, classical monocyte, natural killer cell, malignant cell, epithelial cell of lower respiratory tract
You can also define much more complex value filters. For example:
- combine terms with
&
and|
- use the
%in%
operator to query on multiple values
# You can also do more complex queries, such as testing for inclusion in a list of values
obs_df <- human$obs$read(
column_names = c("cell_type_ontology_term_id"),
value_filter = "tissue_ontology_term_id %in% c('UBERON:0002082', 'UBERON:OOO2084', 'UBERON:0002080')"
)
obs_df <- as.data.frame(obs_df$concat())
# Summarize
top_10 <- sort(table(obs_df$cell_type_ontology_term_id), decreasing = TRUE)[1:10]
print(top_10)
#>
#> CL:0000746 CL:0008034 CL:0002131 CL:0002548 CL:0000115 CL:0000763 CL:0000057 CL:0000669
#> 160974 99458 96953 79733 79626 35560 33075 27515
#> unknown CL:0002144
#> 23613 18593
Full Census metadata stats
This example queries all organisms in the Census, and summarizes the diversity of various metadata labels.
cols_to_query <- c(
"cell_type_ontology_term_id",
"assay_ontology_term_id",
"tissue_ontology_term_id"
)
total_cells <- 0
for (organism in census$get("census_data")$names()) {
print(organism)
obs_df <- census$get("census_data")$get(organism)$obs$read(column_names = cols_to_query)
obs_df <- as.data.frame(obs_df$concat())
total_cells <- total_cells + nrow(obs_df)
for (col in cols_to_query) {
cat(" Unique ", col, " values: ", length(unique(obs_df[[col]])), "\n")
}
}
#> [1] "homo_sapiens"
#> Unique cell_type_ontology_term_id values: 698
#> Unique assay_ontology_term_id values: 24
#> Unique tissue_ontology_term_id values: 267
#> [1] "mus_musculus"
#> Unique cell_type_ontology_term_id values: 364
#> Unique assay_ontology_term_id values: 11
#> Unique tissue_ontology_term_id values: 84
cat("Complete Census contains ", total_cells, " cells.")
#> Complete Census contains 115556140 cells.