Learning about the CZ CELLxGENE Census

This notebook showcases the Census contents and how to obtain high-level information about it. It covers the organization of data within the Census, what cell and gene metadata are available, and it provides simple demonstrations to summarize cell counts across cell metadata.

Contents

  • Opening the census

  • Census organization

  • Cell metadata

  • Gene metadata

  • Census summary content tables

  • Understanding Census contents beyond the summary tables

⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable is_primary_data which is described in the Census schema.

Opening the Census

The cellxgene_census python package contains a convenient API to open the latest version of the Census. If you open the census, you should close it. open_soma() returns a context, so you can open/close it in several ways, like a Python file handle. The context manager is preferred, as it will automatically close upon an error raise.

[1]:
import cellxgene_census
# Preferred: use a Python context manager with cellxgene_census.open_soma() as census: ... # or census = cellxgene_census.open_soma() ... census.close()

You can learn more about the cellxgene_census methods by accessing their corresponding documentation via help(). For example help(cellxgene_census.open_soma).

[2]:
census = cellxgene_census.open_soma(census_version="2025-11-08")

Census organization

The Census schema defines the structure of the Census. In short, you can think of the Census as a structured collection of items that stores different pieces of information. All of these items and the parent collection are SOMA objects of various types and can all be accessed with the TileDB-SOMA API (documentation).

The cellxgene_census package contains some convenient wrappers of the TileDB-SOMA API. An example of this is the function we used to open the Census: cellxgene_census.open_soma()

Main Census components

With the command above you created census, which is a SOMACollection. It is analogous to a Python dictionary, and it has two items: census_info and census_data.

Census summary info

  • census["census_info"] A collection of tables providing information of the census as a whole.

    • census["census_info"]["summary"]: A data frame with high-level information of this Census, e.g. build date, total cell count, etc.

    • census["census_info"]["datasets"]: A data frame with all datasets from CELLxGENE Discover used to create the Census.

    • census["census_info"]["summary_cell_counts"]: A data frame with cell counts stratified by relevant cell metadata

Census data

Data for each organism is stored in independent SOMAExperiment objects which are a specialized form of a SOMACollection. Each of these store a data matrix (cell by genes), cell metadata, gene metadata, and some other useful components not covered in this notebook.

This is how the data is organized for one organism – Homo sapiens:

  • census_obj["census_data"]["homo_sapiens"].obs: Cell metadata

  • census_obj["census_data"]["homo_sapiens"].ms["RNA"].X: Data matrices, currently only raw counts exist X["raw"]

  • census_obj["census_data"]["homo_sapiens"].ms["RNA"].var: Gene Metadata

Cell metadata

You can obtain all cell metadata variables by directly querying the columns of the corresponding SOMADataFrame.

All of these variables can be used for querying the Census in case you want to work with specific cells.

[3]:
keys = list(census["census_data"]["homo_sapiens"].obs.keys())

keys
[3]:
['soma_joinid',
 'dataset_id',
 'assay',
 'assay_ontology_term_id',
 'cell_type',
 'cell_type_ontology_term_id',
 'development_stage',
 'development_stage_ontology_term_id',
 'disease',
 'disease_ontology_term_id',
 'donor_id',
 'is_primary_data',
 'observation_joinid',
 'self_reported_ethnicity',
 'self_reported_ethnicity_ontology_term_id',
 'sex',
 'sex_ontology_term_id',
 'suspension_type',
 'tissue',
 'tissue_ontology_term_id',
 'tissue_type',
 'tissue_general',
 'tissue_general_ontology_term_id',
 'raw_sum',
 'nnz',
 'raw_mean_nnz',
 'raw_variance_nnz',
 'n_measured_vars']

All of these variables are defined in the CELLxGENE dataset schema except for the following:

  • soma_joinid: a SOMA-defined value use for join operations.

  • dataset_id: the dataset id as encoded in census["census-info"]["datasets"].

  • tissue_general and tissue_general_ontology_term_id: the high-level tissue mapping.

Gene metadata

Similarly, we can obtain all gene metadata variables by directly querying the columns of the corresponding SOMADataFrame.

These are the variables you can use for querying the Census in case there are specific genes you are interested in.

[4]:
keys = list(census["census_data"]["homo_sapiens"].ms["RNA"].var.keys())

keys
[4]:
['soma_joinid',
 'feature_id',
 'feature_name',
 'feature_type',
 'feature_length',
 'nnz',
 'n_measured_obs']

All of these variables are defined in the CELLxGENE dataset schema except for the following:

  • soma_joinid: a SOMA-defined value use for join operations.

  • feature_length: the length in base pairs of the gene.

[5]:
census_info = census["census_info"]["summary"].read().concat().to_pandas()

census_info
[5]:
soma_joinid label value
0 0 census_schema_version 2.4.0
1 1 census_build_date 2025-11-08
2 2 dataset_schema_version 7.0.0
3 3 total_cell_count 217768036
4 4 unique_cell_count 125463259

Census summary content tables

You can take a quick look at the high-level Census information by looking at census["census_info"]["summary"]

Of special interest are the label-value combinations for :

  • total_cell_count is the total number of cells in the Census.

  • unique_cell_count is the number of unique cells, as some cells may be present twice due to meta-analysis or consortia-like data.

Cell counts by cell metadata

By looking at census["summary_cell_counts"] you can get a general idea of cell counts stratified by some relevant cell metadata. Not all cell metadata is included in this table, you can take a look at all cell and gene metadata available in the sections below “Cell metadata” and “Gene metadata”.

The line below retrieves this table and casts it into a pandas.DataFrame.

[6]:
census_counts = census["census_info"]["summary_cell_counts"].read().concat().to_pandas()

census_counts
[6]:
soma_joinid organism category label ontology_term_id total_cell_count unique_cell_count
0 0 callithrix_jacchus all na na 2275451 1712738
1 1 callithrix_jacchus assay 10x 3' v3 EFO:0009922 2275451 1712738
2 2 callithrix_jacchus cell_type ependymal cell CL:0000065 19113 19113
3 3 callithrix_jacchus cell_type T cell CL:0000084 113 113
4 4 callithrix_jacchus cell_type endothelial cell CL:0000115 42093 41320
... ... ... ... ... ... ... ...
2615 2615 pan_troglodytes sex female PATO:0000383 78086 78086
2616 2616 pan_troglodytes sex male PATO:0000384 80013 80013
2617 2617 pan_troglodytes suspension_type nucleus na 158099 158099
2618 2618 pan_troglodytes tissue dorsolateral prefrontal cortex UBERON:0009834 158099 158099
2619 2619 pan_troglodytes tissue_general brain UBERON:0000955 158099 158099

2620 rows × 7 columns

For each combination of organism and values for each category of cell metadata you can take a look at total_cell_count and unique_cell_count for the cell counts of that combination.

The values for each category are specified in ontology_term_id and label, which are the value’s IDs and labels, respectively.

Example: cell metadata included in the summary counts table

To get all the available cell metadata in the summary counts table you can do the following. Remember this is not all the cell metadata available, as some variables were omitted in the creation of this table.

[7]:
census_counts[["organism", "category"]].value_counts(sort=False)
[7]:
organism            category
callithrix_jacchus  all                          1
                    assay                        1
                    cell_type                   40
                    disease                      1
                    self_reported_ethnicity      1
                    sex                          2
                    suspension_type              1
                    tissue                      33
                    tissue_general               1
homo_sapiens        all                          1
                    assay                       39
                    cell_type                  903
                    disease                    261
                    self_reported_ethnicity     37
                    sex                          3
                    suspension_type              1
                    tissue                     423
                    tissue_general              71
macaca_mulatta      all                          1
                    assay                        2
                    cell_type                   54
                    disease                      1
                    self_reported_ethnicity      1
                    sex                          3
                    suspension_type              1
                    tissue                      29
                    tissue_general               2
mus_musculus        all                          1
                    assay                       18
                    cell_type                  492
                    disease                     18
                    self_reported_ethnicity      1
                    sex                          3
                    suspension_type              1
                    tissue                     102
                    tissue_general              36
pan_troglodytes     all                          1
                    assay                        1
                    cell_type                   25
                    disease                      1
                    self_reported_ethnicity      1
                    sex                          2
                    suspension_type              1
                    tissue                       1
                    tissue_general               1
Name: count, dtype: int64

Example: cell counts for each sequencing assay in human data

To get the cell counts for each sequencing assay type in human data, you can perform the following pandas.DataFrame operations:

[8]:
census_human_assays = census_counts.query("organism == 'Homo sapiens' & category == 'assay'")
census_human_assays.sort_values("total_cell_count", ascending=False)
[8]:
soma_joinid organism category label ontology_term_id total_cell_count unique_cell_count

Example: number of microglial cells in the Census

If you have a specific term from any of the categories shown above you can directly find out the number of cells for that term.

[9]:
census_counts.query("label == 'microglial cell'")
[9]:
soma_joinid organism category label ontology_term_id total_cell_count unique_cell_count
7 7 callithrix_jacchus cell_type microglial cell CL:0000129 65313 57904
182 182 homo_sapiens cell_type microglial cell CL:0000129 1183509 910878
1830 1830 macaca_mulatta cell_type microglial cell CL:0000129 129589 55858
1976 1976 mus_musculus cell_type microglial cell CL:0000129 144763 100961
2592 2592 pan_troglodytes cell_type microglial cell CL:0000129 5748 5748

Understanding Census contents beyond the summary tables

While using the pre-computed tables in census["census_info"] is an easy and quick way to understand the contents of the Census, it falls short if you want to learn more about certain slices of the Census.

For example, you may want to learn more about:

  • What are the cell types available for human liver?

  • What are the total number of cells in all lung datasets stratified by sequencing technology?

  • What is the sex distribution of all cells from brain in mouse?

  • What are the diseases available for T cells?

All of these questions can be answered by directly querying the cell metadata as shown in the examples below.

Example: all cell types available in human

To exemplify the process of accessing and slicing cell metadata for summary stats, let’s start with a trivial example and take a look at all human cell types available in the Census:

[10]:
human_cell_types = (
    census["census_data"]["homo_sapiens"].obs.read(column_names=["cell_type", "is_primary_data"]).concat().to_pandas()
)
human_cell_types
[10]:
cell_type is_primary_data
0 endothelial cell False
1 malignant cell False
2 fibroblast False
3 fibroblast False
4 macrophage False
... ... ...
158982714 pvalb GABAergic cortical interneuron True
158982715 VIP GABAergic cortical interneuron True
158982716 L2/3-6 intratelencephalic projecting glutamate... True
158982717 astrocyte of the cerebral cortex True
158982718 sst GABAergic cortical interneuron True

158982719 rows × 2 columns

The number of rows is the total number of cells for humans. Now, if you wish to get the cell counts per cell type we can perform some pandas operations on this object.

In addition, we will only focus on cells that are marked with is_primary_data=True as this ensures we de-duplicate cells that appear more than once in CELLxGENE Discover.

[11]:
human_cell_types = (
    census["census_data"]["homo_sapiens"]
    .obs.read(column_names=["cell_type"], value_filter="is_primary_data == True")
    .concat()
    .to_pandas()
)

human_cell_types = human_cell_types[["cell_type"]]
human_cell_types.shape
[11]:
(96591226, 1)

This is the number of unique cells. Now let’s look at the counts per cell type:

[12]:
human_cell_type_counts = human_cell_types.value_counts()
human_cell_type_counts
[12]:
cell_type
oligodendrocyte                                         5705502
neuron                                                  3858369
naive thymus-derived CD4-positive, alpha-beta T cell    3847813
fibroblast                                              2663513
glutamatergic neuron                                    2539819
                                                         ...
effector T cell                                               0
A2 amacrine cell                                              0
OFF retinal ganglion cell                                     0
type II NK T cell                                             0
CD38-negative naive B cell                                    0
Name: count, Length: 898, dtype: int64

This shows you that the most abundant cell types are “glutamatergic neuron”, “CD8-positive, alpha-beta T cell”, and “CD4-positive, alpha-beta T cell”.

Now let’s take a look at the number of unique cell types:

[13]:
human_cell_type_counts.shape
[13]:
(898,)

That is the total number of different cell types for human.

All the information in this example can be quickly obtained from the summary table at census["census-info"]["summary_cell_counts"].

The examples below are more complex and can only be achieved by accessing the cell metadata.

Example: cell types available in human liver

Similar to the example above, we can learn what cell types are available for a specific tissue, e.g. liver.

To achieve this goal we just need to limit our cell metadata to that tissue. We will use the information in the cell metadata variable tissue_general. This variable contains the high-level tissue label for all cells in the Census:

[14]:
human_liver_cell_types = (
    census["census_data"]["homo_sapiens"]
    .obs.read(column_names=["cell_type"], value_filter="is_primary_data == True and tissue_general == 'liver'")
    .concat()
    .to_pandas()
)

human_liver_cell_types["cell_type"].value_counts()
[14]:
cell_type
malignant cell                                  196802
T cell                                          160708
hepatocyte                                      112485
macrophage                                      109647
periportal region hepatocyte                     90251
                                                 ...
epithelial cell of pancreas                          0
epithelial cell of prostate                          0
epithelial cell of proximal tubule                   0
epithelial cell of proximal tubule segment 1         0
ependymal cell                                       0
Name: count, Length: 898, dtype: int64

These are the cell types and their cell counts in the human liver.

Example: diseased T cells in human tissues

In this example we are going to get the counts for all diseased cells annotated as T cells. For the sake of the example we will focus on “CD8-positive, alpha-beta T cell” and “CD4-positive, alpha-beta T cell”:

[15]:
t_cells_list = ["CD8-positive, alpha-beta T cell", "CD4-positive, alpha-beta T cell"]

t_cells_diseased = (
    census["census_data"]["homo_sapiens"]
    .obs.read(
        column_names=["disease", "tissue_general"],
        value_filter=f"is_primary_data == True and cell_type in {t_cells_list} and disease != 'normal'",
    )
    .concat()
    .to_pandas()
)

t_cells_diseased = t_cells_diseased[["disease", "tissue_general"]].value_counts(sort=False)
t_cells_diseased
[15]:
disease                           tissue_general
B-cell non-Hodgkin lymphoma       lymph node          232979
COVID-19                          blood               834850
                                  digestive system       626
                                  lung                 71204
                                  nose                    13
                                                       ...
rheumatoid arthritis              blood                  242
squamous cell lung carcinoma      lung                 49279
                                  lymph node             100
systemic lupus erythematosus      blood               355471
triple-negative breast carcinoma  exocrine gland        2003
Name: count, Length: 63, dtype: int64

These are the cell counts annotated with the indicated disease across human tissues for “CD8-positive, alpha-beta T cell” or “CD4-positive, alpha-beta T cell”.

NOTE: In Census 2025-11-08 and later (CELLxGENE schema 7.0.0 and above), a subset of datasets encode multiple values in the disease field delimited by ' || '. If our query touched such datasets, then we’d want to handle the disease field appropriately.

And, don’t forget to close the census!

[16]:
census.close()
del census