Summarizing cell and gene metadata

This vignette provides examples for basic axis metadata handling. The CZ CELLxGENE Census stores obs (cell) and var (gene) metadata in a SOMA DataFrame objects via the TileDB-SOMA API (documentation), which can be queried and read as an R data frame. The Census also has a convenience package which simplifies opening the census.

R data frames are in-memory objects. Take care that queries are small enough for results to fit in memory.

Contents

Opening the Census
Summarizing cell metadata
- Example: Summarize all cell types
- Example: Summarize a subset of cell types, selected with a value_filter
Full Census metadata stats

Opening the Census

The cellxgene.census R package contains a convenient API to open any version of the Census (by default, the newest stable version).

library("cellxgene.census")
census <- open_soma()

If you open the Census, you should close it with census$close(). This can be automated using on.exit(census$close(), add = TRUE) immediately after census <- open_soma().

You can learn more about the cellxgene.census methods by accessing their corresponding documentation. For example ?cellxgene.census::open_soma.

Summarizing cell metadata

Once the Census is open you can use its TileDB-SOMA methods as it is itself a SOMACollection. You can thus access the metadata SOMADataFrame objects encoding cell and gene metadata.

Tips:

You can read an entire SOMADataFrame into R using as.data.frame(soma_df$read()$concat()).
Queries will be much faster if you request only the DataFrame columns required for your analysis (e.g. column_names = c("soma_joinid", "cell_type_ontology_term_id")).
You can also further refine query results by using a value_filter, which will filter the census for matching records.

Example: Summarize all cell types

This example reads the cell metadata (obs) into an R data frame to summarize in a variety of ways.

human <- census$get("census_data")$get("homo_sapiens")

# Read obs into an R data frame (tibble).
obs_df <- human$obs$read(column_names = c("cell_type"))
obs_df <- as.data.frame(obs_df$concat())

# Find all unique values in the cell_type column.
unique_cell_type <- unique(obs_df$cell_type)

cat(
  "There are",
  length(unique_cell_type),
  "cell types in the Census! The first few are: ",
  paste(head(unique_cell_type), collapse = ", ")
)
#> There are 698 cell types in the Census! The first few are:  plasma cell, mature B cell, macrophage, follicular dendritic cell, plasmacytoid dendritic cell, conventional dendritic cell

Example: Summarize a subset of cell types, selected with a `value_filter`

This example utilizes a SOMA “value filter” to read the subset of cells with tissue_ontology_term_id equal to UBERON:0002048 (lung tissue), and summarizes the query result.

# Read cell_type terms for cells which have a specific tissue term
LUNG_TISSUE <- "UBERON:0002048"

obs_df <- human$obs$read(column_names = c("cell_type"), value_filter = paste0("tissue_ontology_term_id == '", LUNG_TISSUE, "'"))
obs_df <- as.data.frame(obs_df$concat())

# Find all unique values in the cell_type column as an R data frame.
unique_cell_type <- unique(obs_df$cell_type)
cat(
  "There are ",
  length(unique_cell_type),
  " cell types in the Census where tissue_ontology_term_id == ",
  LUNG_TISSUE,
  "!\nThe first few are:",
  paste(head(unique_cell_type), collapse = ", "),
  "\n"
)
#> There are  202  cell types in the Census where tissue_ontology_term_id ==  UBERON:0002048 !
#> The first few are: Schwann cell, immature Schwann cell, neuron, Schwann cell precursor, neural progenitor cell, lung ciliated cell

# Report the 10 most common
top_10 <- sort(table(obs_df$cell_type), decreasing = TRUE)[1:10]
cat(
  "The top 10 cell types where tissue_ontology_term_id ==",
  LUNG_TISSUE,
  "are: ",
  paste(names(top_10), collapse = ", ")
)
#> The top 10 cell types where tissue_ontology_term_id == UBERON:0002048 are:  unknown, alveolar macrophage, CD8-positive, alpha-beta T cell, CD4-positive, alpha-beta T cell, macrophage, type II pneumocyte, classical monocyte, natural killer cell, malignant cell, epithelial cell of lower respiratory tract

You can also define much more complex value filters. For example:

combine terms with & and |
use the %in% operator to query on multiple values

# You can also do more complex queries, such as testing for inclusion in a list of values
obs_df <- human$obs$read(
  column_names = c("cell_type_ontology_term_id"),
  value_filter = "tissue_ontology_term_id %in% c('UBERON:0002082', 'UBERON:OOO2084', 'UBERON:0002080')"
)

obs_df <- as.data.frame(obs_df$concat())

# Summarize
top_10 <- sort(table(obs_df$cell_type_ontology_term_id), decreasing = TRUE)[1:10]
print(top_10)
#> 
#> CL:0000746 CL:0008034 CL:0002131 CL:0002548 CL:0000115 CL:0000763 CL:0000057 CL:0000669 
#>     160974      99458      96953      79733      79626      35560      33075      27515 
#>    unknown CL:0002144 
#>      23613      18593

Full Census metadata stats

This example queries all organisms in the Census, and summarizes the diversity of various metadata labels.

cols_to_query <- c(
  "cell_type_ontology_term_id",
  "assay_ontology_term_id",
  "tissue_ontology_term_id"
)

total_cells <- 0
for (organism in census$get("census_data")$names()) {
  print(organism)

  obs_df <- census$get("census_data")$get(organism)$obs$read(column_names = cols_to_query)
  obs_df <- as.data.frame(obs_df$concat())

  total_cells <- total_cells + nrow(obs_df)
  for (col in cols_to_query) {
    cat("  Unique ", col, " values: ", length(unique(obs_df[[col]])), "\n")
  }
}
#> [1] "homo_sapiens"
#>   Unique  cell_type_ontology_term_id  values:  698 
#>   Unique  assay_ontology_term_id  values:  24 
#>   Unique  tissue_ontology_term_id  values:  267 
#> [1] "mus_musculus"
#>   Unique  cell_type_ontology_term_id  values:  364 
#>   Unique  assay_ontology_term_id  values:  11 
#>   Unique  tissue_ontology_term_id  values:  84
cat("Complete Census contains ", total_cells, " cells.")
#> Complete Census contains  115556140  cells.

Close the census

After use, the census object should be closed to release memory and other resources.

census$close()