Skip to contents

Axis Query Example

Goal: demonstrate basic axis metadata handling.

The CZ CELLxGENE Census stores obs (cell) metadata in a SOMA DataFrame, which can be queried and read as an R data frame. The Census also has a convenience package which simplifies opening the census.

R data frames are in-memory objects. Take care that queries are small enough for results to fit in memory.

Opening the census

The cellxgene.census R package contains a convenient API to open the latest version of the Census.

census <- cellxgene.census::open_soma()

You can learn more about the cellxgene.census methods by accessing their corresponding documentation. For example ?cellxgene.census::open_soma.

Summarize Census cell metadata

Tips:

  • You can read an entire SOMA dataframe into R using as.data.frame(soma_df$read()).
  • Queries will be much faster if you request only the DataFrame columns required for your analysis (e.g. column_names = c("soma_joinid", "cell_type_ontology_term_id")).
  • You can also further refine query results by using a value_filter, which will filter the census for matching records.

Summarize all cell types

This example reads the cell metadata (obs) into an R data frame to summarize in a variety of ways.

human <- census$get("census_data")$get("homo_sapiens")

# Read obs into an R data frame (tibble).
obs_df <- as.data.frame(human$obs$read(
  column_names = c("soma_joinid", "cell_type_ontology_term_id")
))

# Find all unique values in the cell_type_ontology_term_id column.
unique_cell_type_ontology_term_id <- unique(obs_df$cell_type_ontology_term_id)

cat(paste(
  "There are",
  length(unique_cell_type_ontology_term_id),
  "cell types in the Census! The first few are:"
))
#> There are 604 cell types in the Census! The first few are:
head(unique_cell_type_ontology_term_id)
#> [1] "CL:0000540" "CL:0000738" "CL:0000763" "CL:0000136" "CL:0000235"
#> [6] "CL:0000115"

Summarize a subset of cell types, selected with a value_filter

This example utilizes a SOMA “value filter” to read the subset of cells with tissue_ontology_term_id equal to UBERON:0002048 (lung tissue), and summarizes the query result.

# Read cell_type terms for cells which have a specific tissue term
LUNG_TISSUE <- "UBERON:0002048"

obs_df <- as.data.frame(human$obs$read(
  column_names = c("cell_type_ontology_term_id"),
  value_filter = paste("tissue_ontology_term_id == '", LUNG_TISSUE, "'", sep = "")
))

# Find all unique values in the cell_type_ontology_term_id column as an R data frame.
unique_cell_type_ontology_term_id <- unique(obs_df$cell_type_ontology_term_id)
cat(paste(
  "There are ",
  length(unique_cell_type_ontology_term_id),
  " cell types in the Census where tissue_ontology_term_id == ",
  LUNG_TISSUE,
  "!\nThe first few are:",
  sep = ""
))
#> There are 185 cell types in the Census where tissue_ontology_term_id == UBERON:0002048!
#> The first few are:
head(unique_cell_type_ontology_term_id)
#> [1] "CL:0000003" "CL:4028004" "CL:0002145" "CL:0000625" "CL:0000624"
#> [6] "CL:4028006"

# Report the 10 most common
top_10 <- sort(table(obs_df$cell_type_ontology_term_id), decreasing = TRUE)[1:10]
cat(paste("The top 10 cell types where tissue_ontology_term_id ==", LUNG_TISSUE))
#> The top 10 cell types where tissue_ontology_term_id == UBERON:0002048
print(top_10)
#> 
#> CL:0000003 CL:0000583 CL:0000625 CL:0000624 CL:0000235 CL:0002063 CL:0000860 
#>     562038     526859     323433     323067     254173     246279     203526 
#> CL:0000623 CL:0001064 CL:0002632 
#>     164944     149067     132243

You can also define much more complex value filters. For example:

  • combine terms with and and or
  • use the %in% operator to query on multiple values
# You can also do more complex queries, such as testing for inclusion in a list of values
obs_df <- as.data.frame(human$obs$read(
  column_names = c("cell_type_ontology_term_id"),
  value_filter = "tissue_ontology_term_id %in% c('UBERON:0002082', 'UBERON:OOO2084', 'UBERON:0002080')"
))

# Summarize
top_10 <- sort(table(obs_df$cell_type_ontology_term_id), decreasing = TRUE)[1:10]
print(top_10)
#> 
#> CL:0000746 CL:0008034 CL:0002548 CL:0000115 CL:0002131 CL:0000763 CL:0000669 
#>     159096      84750      79618      64190      61830      32088      27515 
#> CL:0000003 CL:0000057 CL:0002144 
#>      22707      20117      18593

Full census stats

This example queries all organisms in the Census, and summarizes the diversity of various metadata labels.

cols_to_query <- c(
  "cell_type_ontology_term_id",
  "assay_ontology_term_id",
  "tissue_ontology_term_id"
)

total_cells <- 0
for (organism in census$get("census_data")$names()) {
  print(organism)
  obs_df <- as.data.frame(
    census$get("census_data")$get(organism)$obs$read(column_names = cols_to_query)
  )
  total_cells <- total_cells + nrow(obs_df)
  for (col in cols_to_query) {
    cat(paste("  Unique ", col, " values: ", length(unique(obs_df[[col]])), "\n", sep = ""))
  }
}
#> [1] "homo_sapiens"
#>   Unique cell_type_ontology_term_id values: 604
#>   Unique assay_ontology_term_id values: 20
#>   Unique tissue_ontology_term_id values: 227
#> [1] "mus_musculus"
#>   Unique cell_type_ontology_term_id values: 226
#>   Unique assay_ontology_term_id values: 9
#>   Unique tissue_ontology_term_id values: 51
cat(paste("Complete Census contains", total_cells, "cells."))
#> Complete Census contains 60361716 cells.