Census supports categoricals for cell metadata

Published: April 4th, 2024

By: Emanuele Bezzi & Pablo Garcia-Nieto

Starting with the 2024-04-01 Census build, a subset of the columns in the obs dataframe are now categorical instead of strings.

Overall users will observe a smaller memory footprint when loading Census data into memory. 🚀

However, this may break some existing pipelines as explained below.

Potential breaking changes

For Python users, note that Pandas will encode these columns as pandas.Categorical for which some downstream operations may need to be adapted. See this link for more details. In particular:

Series methods like Series.value_counts() will use all categories, even if some categories are not present in the data

and

DataFrame methods like sum, groupby, pivot, value_counts also show “unused” categories when observed=False, which is the default.

For R users, note that these columns will be encoded as factor and similarly downstream operations may need to be adapted. See this link for more details.

For Python and R users interfacing with arrow, these columns will be encoded as dictionary, see more details for R in this link and Python in this link.

Identifying the `obs` columns encoded as categorical

Users can always check the the type of each cell metadata variable by inspecting the schema of obs. Categoricals will be shown as dictionary.

In Python:

import cellxgene_census
census = cellxgene_census.open_soma(census_version="latest")
census["census_data"]["homo_sapiens"].obs.schema

# soma_joinid: int64 not null
# dataset_id: dictionary<values=string, indices=int16, ordered=0> not null
# assay: dictionary<values=string, indices=int8, ordered=0> not null
# assay_ontology_term_id: dictionary<values=string, indices=int8, ordered=0> not null
# cell_type: dictionary<values=string, indices=int16, ordered=0> not null
# cell_type_ontology_term_id: dictionary<values=string, indices=int16, ordered=0> not null
# development_stage: dictionary<values=string, indices=int16, ordered=0> not null
# development_stage_ontology_term_id: dictionary<values=string, indices=int16, 
# [OUTPUT TRUNCATED]

In R:

library("cellxgene.census")
census = open_soma(census_version="latest")
census$get("census_data")$get("homo_sapiens")$obs$schema()

# Schema
# soma_joinid: int64 not null
# dataset_id: dictionary<values=string, indices=int16> not null
# assay: dictionary<values=string, indices=int8> not null
# assay_ontology_term_id: dictionary<values=string, indices=int8> not null
# cell_type: dictionary<values=string, indices=int16> not null
# cell_type_ontology_term_id: dictionary<values=string, indices=int16> not null
# development_stage: dictionary<values=string, indices=int16> not null
# development_stage_ontology_term_id: dictionary<values=string, indices=int16> not null
# [OUTPUT TRUNCATED]

Census supports categoricals for cell metadata

Potential breaking changes

Identifying the obs columns encoded as categorical

Identifying the `obs` columns encoded as categorical