Learning about the CZ CELLxGENE Census

This notebook showcases the Census contents and how to obtain high-level information about it. It covers the organization of data within the Census, what cell and gene metadata are available, and it provides simple demonstrations to summarize cell counts across cell metadata.

Contents

Opening the census
Census organization
Cell metadata
Gene metadata
Census summary content tables
Understanding Census contents beyond the summary tables

⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable is_primary_data which is described in the Census schema.

Opening the Census

The cellxgene_census python package contains a convenient API to open the latest version of the Census. If you open the census, you should close it. open_soma() returns a context, so you can open/close it in several ways, like a Python file handle. The context manager is preferred, as it will automatically close upon an error raise.

[1]:

import cellxgene_census

# Preferred: use a Python context manager
with cellxgene_census.open_soma() as census:
    ...

# or
census = cellxgene_census.open_soma()
...
census.close()

The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.

You can learn more about the cellxgene_census methods by accessing their corresponding documentation via help(). For example help(cellxgene_census.open_soma).

[2]:

census = cellxgene_census.open_soma()

The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.

Census organization

The Census schema defines the structure of the Census. In short, you can think of the Census as a structured collection of items that stores different pieces of information. All of these items and the parent collection are SOMA objects of various types and can all be accessed with the TileDB-SOMA API (documentation).

The cellxgene_census package contains some convenient wrappers of the TileDB-SOMA API. An example of this is the function we used to open the Census: cellxgene_census.open_soma()

Main Census components

With the command above you created census, which is a SOMACollection. It is analogous to a Python dictionary, and it has two items: census_info and census_data.

Census summary info

census["census_info"] A collection of tables providing information of the census as a whole.
- census["census_info"]["summary"]: A data frame with high-level information of this Census, e.g. build date, total cell count, etc.
- census["census_info"]["datasets"]: A data frame with all datasets from CELLxGENE Discover used to create the Census.
- census["census_info"]["summary_cell_counts"]: A data frame with cell counts stratified by relevant cell metadata

Census data

Data for each organism is stored in independent SOMAExperiment objects which are a specialized form of a SOMACollection. Each of these store a data matrix (cell by genes), cell metadata, gene metadata, and some other useful components not covered in this notebook.

This is how the data is organized for one organism – Homo sapiens:

census_obj["census_data"]["homo_sapiens"].obs: Cell metadata
census_obj["census_data"]["homo_sapiens"].ms["RNA"].X: Data matrices, currently only raw counts exist X["raw"]
census_obj["census_data"]["homo_sapiens"].ms["RNA"].var: Gene Metadata

Cell metadata

You can obtain all cell metadata variables by directly querying the columns of the corresponding SOMADataFrame.

All of these variables can be used for querying the Census in case you want to work with specific cells.

[3]:

keys = list(census["census_data"]["homo_sapiens"].obs.keys())

keys

[3]:

['soma_joinid',
 'dataset_id',
 'assay',
 'assay_ontology_term_id',
 'cell_type',
 'cell_type_ontology_term_id',
 'development_stage',
 'development_stage_ontology_term_id',
 'disease',
 'disease_ontology_term_id',
 'donor_id',
 'is_primary_data',
 'self_reported_ethnicity',
 'self_reported_ethnicity_ontology_term_id',
 'sex',
 'sex_ontology_term_id',
 'suspension_type',
 'tissue',
 'tissue_ontology_term_id',
 'tissue_general',
 'tissue_general_ontology_term_id']

All of these variables are defined in the CELLxGENE dataset schema except for the following:

soma_joinid: a SOMA-defined value use for join operations.
dataset_id: the dataset id as encoded in census["census-info"]["datasets"].
tissue_general and tissue_general_ontology_term_id: the high-level tissue mapping.

Gene metadata

Similarly, we can obtain all gene metadata variables by directly querying the columns of the corresponding SOMADataFrame.

These are the variables you can use for querying the Census in case there are specific genes you are interested in.

[4]:

keys = list(census["census_data"]["homo_sapiens"].ms["RNA"].var.keys())

keys

[4]:

['soma_joinid', 'feature_id', 'feature_name', 'feature_length']

All of these variables are defined in the CELLxGENE dataset schema except for the following:

soma_joinid: a SOMA-defined value use for join operations.
feature_length: the length in base pairs of the gene.

[5]:

census_info = census["census_info"]["summary"].read().concat().to_pandas()

census_info

[5]:

	soma_joinid	label	value
0	0	census_schema_version	1.0.0
1	1	census_build_date	2023-07-25
2	2	dataset_schema_version	3.0.0
3	3	total_cell_count	61656118
4	4	unique_cell_count	37447773
5	5	number_donors_homo_sapiens	13035
6	6	number_donors_mus_musculus	1417

Census summary content tables

You can take a quick look at the high-level Census information by looking at census["census_info"]["summary"]

Of special interest are the label-value combinations for :

total_cell_count is the total number of cells in the Census.
unique_cell_count is the number of unique cells, as some cells may be present twice due to meta-analysis or consortia-like data.
number_donors_homo_sapiens and number_donors_mus_musculus are the number of individuals for human and mouse. These are not guaranteed to be unique as one individual ID may be present or identical in different datasets.

Cell counts by cell metadata

By looking at census["summary_cell_counts"] you can get a general idea of cell counts stratified by some relevant cell metadata. Not all cell metadata is included in this table, you can take a look at all cell and gene metadata available in the sections below “Cell metadata” and “Gene metadata”.

The line below retrieves this table and casts it into a pandas.DataFrame.

[6]:

census_counts = census["census_info"]["summary_cell_counts"].read().concat().to_pandas()

census_counts

[6]:

	soma_joinid	organism	category	ontology_term_id	unique_cell_count	total_cell_count	label
0	0	Homo sapiens	all	na	33364242	56400873	na
1	1	Homo sapiens	assay	EFO:0008722	264166	279635	Drop-seq
2	2	Homo sapiens	assay	EFO:0008780	25652	51304	inDrop
3	3	Homo sapiens	assay	EFO:0008919	89477	206754	Seq-Well
4	4	Homo sapiens	assay	EFO:0008931	78750	188248	Smart-seq2
...	...	...	...	...	...	...	...
1357	1357	Mus musculus	tissue_general	UBERON:0002113	179684	208324	kidney
1358	1358	Mus musculus	tissue_general	UBERON:0002365	15577	31154	exocrine gland
1359	1359	Mus musculus	tissue_general	UBERON:0002367	37715	130135	prostate gland
1360	1360	Mus musculus	tissue_general	UBERON:0002368	13322	26644	endocrine gland
1361	1361	Mus musculus	tissue_general	UBERON:0002371	90225	144962	bone marrow

1362 rows × 7 columns

For each combination of organism and values for each category of cell metadata you can take a look at total_cell_count and unique_cell_count for the cell counts of that combination.

The values for each category are specified in ontology_term_id and label, which are the value’s IDs and labels, respectively.

Example: cell metadata included in the summary counts table

To get all the available cell metadata in the summary counts table you can do the following. Remember this is not all the cell metadata available, as some variables were omitted in the creation of this table.

[7]:

census_counts[["organism", "category"]].value_counts(sort=False)

[7]:

organism      category
Homo sapiens  all                          1
              assay                       19
              cell_type                  613
              disease                     64
              self_reported_ethnicity     26
              sex                          3
              suspension_type              1
              tissue                     220
              tissue_general              54
Mus musculus  all                          1
              assay                        9
              cell_type                  248
              disease                      5
              self_reported_ethnicity      1
              sex                          3
              suspension_type              1
              tissue                      66
              tissue_general              27
Name: count, dtype: int64

Example: cell counts for each sequencing assay in human data

To get the cell counts for each sequencing assay type in human data, you can perform the following pandas.DataFrame operations:

[8]:

census_human_assays = census_counts.query("organism == 'Homo sapiens' & category == 'assay'")
census_human_assays.sort_values("total_cell_count", ascending=False)

[8]:

	soma_joinid	organism	category	ontology_term_id	unique_cell_count	total_cell_count	label
10	10	Homo sapiens	assay	EFO:0009922	11845077	25597563	10x 3' v3
7	7	Homo sapiens	assay	EFO:0009899	7559102	12638794	10x 3' v2
14	14	Homo sapiens	assay	EFO:0011025	3872375	6139786	10x 5' v1
13	13	Homo sapiens	assay	EFO:0010550	4062980	5064268	sci-RNA-seq
8	8	Homo sapiens	assay	EFO:0009900	2930054	3139770	10x 5' v2
17	17	Homo sapiens	assay	EFO:0030004	915037	1084235	10x 5' transcription profiling
16	16	Homo sapiens	assay	EFO:0030003	744798	811422	10x 3' transcription profiling
15	15	Homo sapiens	assay	EFO:0030002	625175	642559	microwell-seq
1	1	Homo sapiens	assay	EFO:0008722	264166	279635	Drop-seq
3	3	Homo sapiens	assay	EFO:0008919	89477	206754	Seq-Well
4	4	Homo sapiens	assay	EFO:0008931	78750	188248	Smart-seq2
18	18	Homo sapiens	assay	EFO:0700003	146278	177276	BD Rhapsody Whole Transcriptome Analysis
9	9	Homo sapiens	assay	EFO:0009901	42397	121394	10x 3' v1
12	12	Homo sapiens	assay	EFO:0010183	58981	117962	single cell library construction
19	19	Homo sapiens	assay	EFO:0700004	96145	96145	BD Rhapsody Targeted mRNA
2	2	Homo sapiens	assay	EFO:0008780	25652	51304	inDrop
6	6	Homo sapiens	assay	EFO:0008995	0	29128	10x technology
5	5	Homo sapiens	assay	EFO:0008953	4693	9386	STRT-seq
11	11	Homo sapiens	assay	EFO:0010010	3105	5244	CEL-seq2

Example: number of microglial cells in the Census

If you have a specific term from any of the categories shown above you can directly find out the number of cells for that term.

[9]:

census_counts.query("label == 'microglial cell'")

[9]:

	soma_joinid	organism	category	ontology_term_id	unique_cell_count	total_cell_count	label
69	69	Homo sapiens	cell_type	CL:0000129	268114	370771	microglial cell
1038	1038	Mus musculus	cell_type	CL:0000129	48998	62617	microglial cell

Understanding Census contents beyond the summary tables

While using the pre-computed tables in census["census_info"] is an easy and quick way to understand the contents of the Census, it falls short if you want to learn more about certain slices of the Census.

For example, you may want to learn more about:

What are the cell types available for human liver?
What are the total number of cells in all lung datasets stratified by sequencing technology?
What is the sex distribution of all cells from brain in mouse?
What are the diseases available for T cells?

All of these questions can be answered by directly querying the cell metadata as shown in the examples below.

Example: all cell types available in human

To exemplify the process of accessing and slicing cell metadata for summary stats, let’s start with a trivial example and take a look at all human cell types available in the Census:

[10]:

human_cell_types = (
    census["census_data"]["homo_sapiens"].obs.read(column_names=["cell_type", "is_primary_data"]).concat().to_pandas()
)
human_cell_types

[10]:

	cell_type	is_primary_data
0	syncytiotrophoblast cell	False
1	placental villous trophoblast	False
2	syncytiotrophoblast cell	False
3	syncytiotrophoblast cell	False
4	extravillous trophoblast	False
...	...	...
56400868	pericyte	True
56400869	pericyte	True
56400870	pericyte	True
56400871	pericyte	True
56400872	pericyte	True

56400873 rows × 2 columns

The number of rows is the total number of cells for humans. Now, if you wish to get the cell counts per cell type we can perform some pandas operations on this object.

In addition, we will only focus on cells that are marked with is_primary_data=True as this ensures we de-duplicate cells that appear more than once in CELLxGENE Discover.

[11]:

human_cell_types = (
    census["census_data"]["homo_sapiens"]
    .obs.read(column_names=["cell_type"], value_filter="is_primary_data == True")
    .concat()
    .to_pandas()
)

human_cell_types = human_cell_types[["cell_type"]]
human_cell_types.shape

[11]:

(33364242, 1)

This is the number of unique cells. Now let’s look at the counts per cell type:

[12]:

human_cell_type_counts = human_cell_types.value_counts()
human_cell_type_counts

[12]:

cell_type
neuron                                             2673669
glutamatergic neuron                               1541605
CD4-positive, alpha-beta T cell                    1258976
CD8-positive, alpha-beta T cell                    1235987
classical monocyte                                 1030996
                                                    ...
microfold cell of epithelium of small intestine         19
mature conventional dendritic cell                      17
serous cell of epithelium of bronchus                   15
sperm                                                   11
type N enteroendocrine cell                             10
Name: count, Length: 599, dtype: int64

This shows you that the most abundant cell types are “glutamatergic neuron”, “CD8-positive, alpha-beta T cell”, and “CD4-positive, alpha-beta T cell”.

Now let’s take a look at the number of unique cell types:

[13]:

human_cell_type_counts.shape

[13]:

(599,)

That is the total number of different cell types for human.

All the information in this example can be quickly obtained from the summary table at census["census-info"]["summary_cell_counts"].

The examples below are more complex and can only be achieved by accessing the cell metadata.

Example: cell types available in human liver

Similar to the example above, we can learn what cell types are available for a specific tissue, e.g. liver.

To achieve this goal we just need to limit our cell metadata to that tissue. We will use the information in the cell metadata variable tissue_general. This variable contains the high-level tissue label for all cells in the Census:

[14]:

human_liver_cell_types = (
    census["census_data"]["homo_sapiens"]
    .obs.read(column_names=["cell_type"], value_filter="is_primary_data == True and tissue_general == 'liver'")
    .concat()
    .to_pandas()
)

human_liver_cell_types["cell_type"].value_counts()

[14]:

cell_type
T cell                               85739
hepatoblast                          58447
neoplastic cell                      52431
erythroblast                         45605
monocyte                             31388
                                     ...
pulmonary artery endothelial cell        1
germinal center B cell                   1
enteroendocrine cell                     1
type I pneumocyte                        1
group 2 innate lymphoid cell             1
Name: count, Length: 126, dtype: int64

These are the cell types and their cell counts in the human liver.

Example: diseased T cells in human tissues

In this example we are going to get the counts for all diseased cells annotated as T cells. For the sake of the example we will focus on “CD8-positive, alpha-beta T cell” and “CD4-positive, alpha-beta T cell”:

[15]:

t_cells_list = ["CD8-positive, alpha-beta T cell", "CD4-positive, alpha-beta T cell"]

t_cells_diseased = (
    census["census_data"]["homo_sapiens"]
    .obs.read(
        column_names=["disease", "tissue_general"],
        value_filter=f"is_primary_data == True and cell_type in {t_cells_list} and disease != 'normal'",
    )
    .concat()
    .to_pandas()
)

t_cells_diseased = t_cells_diseased[["disease", "tissue_general"]].value_counts(sort=False)
t_cells_diseased

[15]:

disease                                tissue_general
B-cell non-Hodgkin lymphoma            blood                  62499
COVID-19                               blood                 819428
                                       lung                   30578
                                       nose                      13
                                       respiratory system         4
                                       saliva                    41
Crohn disease                          colon                  17490
                                       small intestine        52029
Down syndrome                          bone marrow              181
breast cancer                          breast                  1850
chronic obstructive pulmonary disease  lung                    9382
chronic rhinitis                       nose                     909
clear cell renal carcinoma             blood                   6548
                                       kidney                 20540
                                       lymph node                36
cystic fibrosis                        lung                       7
follicular lymphoma                    lymph node              1089
influenza                              blood                   8871
interstitial lung disease              lung                    1803
kidney benign neoplasm                 blood                     20
                                       kidney                    10
kidney oncocytoma                      blood                     16
                                       kidney                  2408
lung adenocarcinoma                    adrenal gland            205
                                       brain                   3274
                                       liver                    507
                                       lung                  215013
                                       lymph node             24969
                                       pleural fluid          11558
lung large cell carcinoma              lung                    5922
lymphangioleiomyomatosis               lung                     513
non-small cell lung carcinoma          lung                   36573
nonpapillary renal cell carcinoma      adipose tissue           243
                                       adrenal gland           4828
                                       blood                    288
                                       blood clot              1717
                                       kidney                 69136
pleomorphic carcinoma                  lung                    1715
pneumonia                              lung                     856
pulmonary fibrosis                     lung                    1671
respiratory system disorder            blood                  34301
squamous cell lung carcinoma           lung                   52053
                                       lymph node               100
systemic lupus erythematosus           blood                 355471
Name: count, dtype: int64

These are the cell counts annotated with the indicated disease across human tissues for “CD8-positive, alpha-beta T cell” or “CD4-positive, alpha-beta T cell”.

And, don’t forget to close the census!

[16]:

census.close()
del census