Axis Query Example
Goal: demonstrate basic axis metadata handling.
The CZ CELLxGENE Census stores obs (cell) metadata in a SOMA DataFrame, which can be queried and read as an R data frame. The Census also has a convenience package which simplifies opening the census.
R data frames are in-memory objects. Take care that queries are small enough for results to fit in memory.
Opening the census
The cellxgene.census
R package contains a convenient API
to open the latest version of the Census.
census <- cellxgene.census::open_soma()
You can learn more about the cellxgene.census methods by accessing
their corresponding documentation. For example
?cellxgene.census::open_soma
.
Summarize Census cell metadata
Tips:
- You can read an entire SOMA dataframe into R using
as.data.frame(soma_df$read())
. - Queries will be much faster if you request only the DataFrame
columns required for your analysis
(e.g.
column_names = c("soma_joinid", "cell_type_ontology_term_id")
). - You can also further refine query results by using a
value_filter
, which will filter the census for matching records.
Summarize all cell types
This example reads the cell metadata (obs) into an R data frame to summarize in a variety of ways.
human <- census$get("census_data")$get("homo_sapiens")
# Read obs into an R data frame (tibble).
obs_df <- as.data.frame(human$obs$read(
column_names = c("soma_joinid", "cell_type_ontology_term_id")
))
# Find all unique values in the cell_type_ontology_term_id column.
unique_cell_type_ontology_term_id <- unique(obs_df$cell_type_ontology_term_id)
cat(paste(
"There are",
length(unique_cell_type_ontology_term_id),
"cell types in the Census! The first few are:"
))
#> There are 604 cell types in the Census! The first few are:
head(unique_cell_type_ontology_term_id)
#> [1] "CL:0000540" "CL:0000738" "CL:0000763" "CL:0000136" "CL:0000235"
#> [6] "CL:0000115"
Summarize a subset of cell types, selected with a
value_filter
This example utilizes a SOMA “value filter” to read the subset of
cells with tissue_ontology_term_id
equal to
UBERON:0002048
(lung tissue), and summarizes the query
result.
# Read cell_type terms for cells which have a specific tissue term
LUNG_TISSUE <- "UBERON:0002048"
obs_df <- as.data.frame(human$obs$read(
column_names = c("cell_type_ontology_term_id"),
value_filter = paste("tissue_ontology_term_id == '", LUNG_TISSUE, "'", sep = "")
))
# Find all unique values in the cell_type_ontology_term_id column as an R data frame.
unique_cell_type_ontology_term_id <- unique(obs_df$cell_type_ontology_term_id)
cat(paste(
"There are ",
length(unique_cell_type_ontology_term_id),
" cell types in the Census where tissue_ontology_term_id == ",
LUNG_TISSUE,
"!\nThe first few are:",
sep = ""
))
#> There are 185 cell types in the Census where tissue_ontology_term_id == UBERON:0002048!
#> The first few are:
head(unique_cell_type_ontology_term_id)
#> [1] "CL:0000003" "CL:4028004" "CL:0002145" "CL:0000625" "CL:0000624"
#> [6] "CL:4028006"
# Report the 10 most common
top_10 <- sort(table(obs_df$cell_type_ontology_term_id), decreasing = TRUE)[1:10]
cat(paste("The top 10 cell types where tissue_ontology_term_id ==", LUNG_TISSUE))
#> The top 10 cell types where tissue_ontology_term_id == UBERON:0002048
print(top_10)
#>
#> CL:0000003 CL:0000583 CL:0000625 CL:0000624 CL:0000235 CL:0002063 CL:0000860
#> 562038 526859 323433 323067 254173 246279 203526
#> CL:0000623 CL:0001064 CL:0002632
#> 164944 149067 132243
You can also define much more complex value filters. For example:
- combine terms with
and
andor
- use the
%in%
operator to query on multiple values
# You can also do more complex queries, such as testing for inclusion in a list of values
obs_df <- as.data.frame(human$obs$read(
column_names = c("cell_type_ontology_term_id"),
value_filter = "tissue_ontology_term_id %in% c('UBERON:0002082', 'UBERON:OOO2084', 'UBERON:0002080')"
))
# Summarize
top_10 <- sort(table(obs_df$cell_type_ontology_term_id), decreasing = TRUE)[1:10]
print(top_10)
#>
#> CL:0000746 CL:0008034 CL:0002548 CL:0000115 CL:0002131 CL:0000763 CL:0000669
#> 159096 84750 79618 64190 61830 32088 27515
#> CL:0000003 CL:0000057 CL:0002144
#> 22707 20117 18593
Full census stats
This example queries all organisms in the Census, and summarizes the diversity of various metadata labels.
cols_to_query <- c(
"cell_type_ontology_term_id",
"assay_ontology_term_id",
"tissue_ontology_term_id"
)
total_cells <- 0
for (organism in census$get("census_data")$names()) {
print(organism)
obs_df <- as.data.frame(
census$get("census_data")$get(organism)$obs$read(column_names = cols_to_query)
)
total_cells <- total_cells + nrow(obs_df)
for (col in cols_to_query) {
cat(paste(" Unique ", col, " values: ", length(unique(obs_df[[col]])), "\n", sep = ""))
}
}
#> [1] "homo_sapiens"
#> Unique cell_type_ontology_term_id values: 604
#> Unique assay_ontology_term_id values: 20
#> Unique tissue_ontology_term_id values: 227
#> [1] "mus_musculus"
#> Unique cell_type_ontology_term_id values: 226
#> Unique assay_ontology_term_id values: 9
#> Unique tissue_ontology_term_id values: 51
cat(paste("Complete Census contains", total_cells, "cells."))
#> Complete Census contains 60361716 cells.