Learning about the CZ CELLxGENE Census
This notebook showcases the Census contents and how to obtain high-level information about it. It covers the organization of data within the Census, what cell and gene metadata are available, and it provides simple demonstrations to summarize cell counts across cell metadata.
Contents
Opening the census
Census organization
Cell metadata
Gene metadata
Census summary content tables
Understanding Census contents beyond the summary tables
⚠️ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable is_primary_data
which is described in the Census schema.
Opening the Census
The cellxgene_census
python package contains a convenient API to open the latest version of the Census. If you open the census, you should close it. open_soma()
returns a context, so you can open/close it in several ways, like a Python file handle. The context manager is preferred, as it will automatically close upon an error raise.
[1]:
import cellxgene_census
# Preferred: use a Python context manager
with cellxgene_census.open_soma() as census:
...
# or
census = cellxgene_census.open_soma()
...
census.close()
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
You can learn more about the cellxgene_census
methods by accessing their corresponding documentation via help()
. For example help(cellxgene_census.open_soma)
.
[2]:
census = cellxgene_census.open_soma()
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
Census organization
The Census schema defines the structure of the Census. In short, you can think of the Census as a structured collection of items that stores different pieces of information. All of these items and the parent collection are SOMA objects of various types and can all be accessed with the TileDB-SOMA API (documentation).
The cellxgene_census
package contains some convenient wrappers of the TileDB-SOMA
API. An example of this is the function we used to open the Census: cellxgene_census.open_soma()
Main Census components
With the command above you created census
, which is a SOMACollection
. It is analogous to a Python dictionary, and it has two items: census_info
and census_data
.
Census summary info
census["census_info"]
A collection of tables providing information of the census as a whole.census["census_info"]["summary"]
: A data frame with high-level information of this Census, e.g. build date, total cell count, etc.census["census_info"]["datasets"]
: A data frame with all datasets from CELLxGENE Discover used to create the Census.census["census_info"]["summary_cell_counts"]
: A data frame with cell counts stratified by relevant cell metadata
Census data
Data for each organism is stored in independent SOMAExperiment
objects which are a specialized form of a SOMACollection
. Each of these store a data matrix (cell by genes), cell metadata, gene metadata, and some other useful components not covered in this notebook.
This is how the data is organized for one organism – Homo sapiens:
census_obj["census_data"]["homo_sapiens"].obs
: Cell metadatacensus_obj["census_data"]["homo_sapiens"].ms["RNA"].X:
Data matrices, currently only raw counts existX["raw"]
census_obj["census_data"]["homo_sapiens"].ms["RNA"].var:
Gene Metadata
Cell metadata
You can obtain all cell metadata variables by directly querying the columns of the corresponding SOMADataFrame
.
All of these variables can be used for querying the Census in case you want to work with specific cells.
[3]:
keys = list(census["census_data"]["homo_sapiens"].obs.keys())
keys
[3]:
['soma_joinid',
'dataset_id',
'assay',
'assay_ontology_term_id',
'cell_type',
'cell_type_ontology_term_id',
'development_stage',
'development_stage_ontology_term_id',
'disease',
'disease_ontology_term_id',
'donor_id',
'is_primary_data',
'self_reported_ethnicity',
'self_reported_ethnicity_ontology_term_id',
'sex',
'sex_ontology_term_id',
'suspension_type',
'tissue',
'tissue_ontology_term_id',
'tissue_general',
'tissue_general_ontology_term_id']
All of these variables are defined in the CELLxGENE dataset schema except for the following:
soma_joinid
: a SOMA-defined value use for join operations.dataset_id
: the dataset id as encoded incensus["census-info"]["datasets"]
.tissue_general
andtissue_general_ontology_term_id
: the high-level tissue mapping.
Gene metadata
Similarly, we can obtain all gene metadata variables by directly querying the columns of the corresponding SOMADataFrame
.
These are the variables you can use for querying the Census in case there are specific genes you are interested in.
[4]:
keys = list(census["census_data"]["homo_sapiens"].ms["RNA"].var.keys())
keys
[4]:
['soma_joinid', 'feature_id', 'feature_name', 'feature_length']
All of these variables are defined in the CELLxGENE dataset schema except for the following:
soma_joinid
: a SOMA-defined value use for join operations.feature_length
: the length in base pairs of the gene.
[5]:
census_info = census["census_info"]["summary"].read().concat().to_pandas()
census_info
[5]:
soma_joinid | label | value | |
---|---|---|---|
0 | 0 | census_schema_version | 1.0.0 |
1 | 1 | census_build_date | 2023-07-25 |
2 | 2 | dataset_schema_version | 3.0.0 |
3 | 3 | total_cell_count | 61656118 |
4 | 4 | unique_cell_count | 37447773 |
5 | 5 | number_donors_homo_sapiens | 13035 |
6 | 6 | number_donors_mus_musculus | 1417 |
Census summary content tables
You can take a quick look at the high-level Census information by looking at census["census_info"]["summary"]
Of special interest are the label
-value
combinations for :
total_cell_count
is the total number of cells in the Census.unique_cell_count
is the number of unique cells, as some cells may be present twice due to meta-analysis or consortia-like data.number_donors_homo_sapiens
andnumber_donors_mus_musculus
are the number of individuals for human and mouse. These are not guaranteed to be unique as one individual ID may be present or identical in different datasets.
Cell counts by cell metadata
By looking at census["summary_cell_counts"]
you can get a general idea of cell counts stratified by some relevant cell metadata. Not all cell metadata is included in this table, you can take a look at all cell and gene metadata available in the sections below “Cell metadata” and “Gene metadata”.
The line below retrieves this table and casts it into a pandas.DataFrame
.
[6]:
census_counts = census["census_info"]["summary_cell_counts"].read().concat().to_pandas()
census_counts
[6]:
soma_joinid | organism | category | ontology_term_id | unique_cell_count | total_cell_count | label | |
---|---|---|---|---|---|---|---|
0 | 0 | Homo sapiens | all | na | 33364242 | 56400873 | na |
1 | 1 | Homo sapiens | assay | EFO:0008722 | 264166 | 279635 | Drop-seq |
2 | 2 | Homo sapiens | assay | EFO:0008780 | 25652 | 51304 | inDrop |
3 | 3 | Homo sapiens | assay | EFO:0008919 | 89477 | 206754 | Seq-Well |
4 | 4 | Homo sapiens | assay | EFO:0008931 | 78750 | 188248 | Smart-seq2 |
... | ... | ... | ... | ... | ... | ... | ... |
1357 | 1357 | Mus musculus | tissue_general | UBERON:0002113 | 179684 | 208324 | kidney |
1358 | 1358 | Mus musculus | tissue_general | UBERON:0002365 | 15577 | 31154 | exocrine gland |
1359 | 1359 | Mus musculus | tissue_general | UBERON:0002367 | 37715 | 130135 | prostate gland |
1360 | 1360 | Mus musculus | tissue_general | UBERON:0002368 | 13322 | 26644 | endocrine gland |
1361 | 1361 | Mus musculus | tissue_general | UBERON:0002371 | 90225 | 144962 | bone marrow |
1362 rows × 7 columns
For each combination of organism
and values for each category
of cell metadata you can take a look at total_cell_count
and unique_cell_count
for the cell counts of that combination.
The values for each category
are specified in ontology_term_id
and label
, which are the value’s IDs and labels, respectively.
Example: cell metadata included in the summary counts table
To get all the available cell metadata in the summary counts table you can do the following. Remember this is not all the cell metadata available, as some variables were omitted in the creation of this table.
[7]:
census_counts[["organism", "category"]].value_counts(sort=False)
[7]:
organism category
Homo sapiens all 1
assay 19
cell_type 613
disease 64
self_reported_ethnicity 26
sex 3
suspension_type 1
tissue 220
tissue_general 54
Mus musculus all 1
assay 9
cell_type 248
disease 5
self_reported_ethnicity 1
sex 3
suspension_type 1
tissue 66
tissue_general 27
Name: count, dtype: int64
Example: cell counts for each sequencing assay in human data
To get the cell counts for each sequencing assay type in human data, you can perform the following pandas.DataFrame
operations:
[8]:
census_human_assays = census_counts.query("organism == 'Homo sapiens' & category == 'assay'")
census_human_assays.sort_values("total_cell_count", ascending=False)
[8]:
soma_joinid | organism | category | ontology_term_id | unique_cell_count | total_cell_count | label | |
---|---|---|---|---|---|---|---|
10 | 10 | Homo sapiens | assay | EFO:0009922 | 11845077 | 25597563 | 10x 3' v3 |
7 | 7 | Homo sapiens | assay | EFO:0009899 | 7559102 | 12638794 | 10x 3' v2 |
14 | 14 | Homo sapiens | assay | EFO:0011025 | 3872375 | 6139786 | 10x 5' v1 |
13 | 13 | Homo sapiens | assay | EFO:0010550 | 4062980 | 5064268 | sci-RNA-seq |
8 | 8 | Homo sapiens | assay | EFO:0009900 | 2930054 | 3139770 | 10x 5' v2 |
17 | 17 | Homo sapiens | assay | EFO:0030004 | 915037 | 1084235 | 10x 5' transcription profiling |
16 | 16 | Homo sapiens | assay | EFO:0030003 | 744798 | 811422 | 10x 3' transcription profiling |
15 | 15 | Homo sapiens | assay | EFO:0030002 | 625175 | 642559 | microwell-seq |
1 | 1 | Homo sapiens | assay | EFO:0008722 | 264166 | 279635 | Drop-seq |
3 | 3 | Homo sapiens | assay | EFO:0008919 | 89477 | 206754 | Seq-Well |
4 | 4 | Homo sapiens | assay | EFO:0008931 | 78750 | 188248 | Smart-seq2 |
18 | 18 | Homo sapiens | assay | EFO:0700003 | 146278 | 177276 | BD Rhapsody Whole Transcriptome Analysis |
9 | 9 | Homo sapiens | assay | EFO:0009901 | 42397 | 121394 | 10x 3' v1 |
12 | 12 | Homo sapiens | assay | EFO:0010183 | 58981 | 117962 | single cell library construction |
19 | 19 | Homo sapiens | assay | EFO:0700004 | 96145 | 96145 | BD Rhapsody Targeted mRNA |
2 | 2 | Homo sapiens | assay | EFO:0008780 | 25652 | 51304 | inDrop |
6 | 6 | Homo sapiens | assay | EFO:0008995 | 0 | 29128 | 10x technology |
5 | 5 | Homo sapiens | assay | EFO:0008953 | 4693 | 9386 | STRT-seq |
11 | 11 | Homo sapiens | assay | EFO:0010010 | 3105 | 5244 | CEL-seq2 |
Example: number of microglial cells in the Census
If you have a specific term from any of the categories shown above you can directly find out the number of cells for that term.
[9]:
census_counts.query("label == 'microglial cell'")
[9]:
soma_joinid | organism | category | ontology_term_id | unique_cell_count | total_cell_count | label | |
---|---|---|---|---|---|---|---|
69 | 69 | Homo sapiens | cell_type | CL:0000129 | 268114 | 370771 | microglial cell |
1038 | 1038 | Mus musculus | cell_type | CL:0000129 | 48998 | 62617 | microglial cell |
Understanding Census contents beyond the summary tables
While using the pre-computed tables in census["census_info"]
is an easy and quick way to understand the contents of the Census, it falls short if you want to learn more about certain slices of the Census.
For example, you may want to learn more about:
What are the cell types available for human liver?
What are the total number of cells in all lung datasets stratified by sequencing technology?
What is the sex distribution of all cells from brain in mouse?
What are the diseases available for T cells?
All of these questions can be answered by directly querying the cell metadata as shown in the examples below.
Example: all cell types available in human
To exemplify the process of accessing and slicing cell metadata for summary stats, let’s start with a trivial example and take a look at all human cell types available in the Census:
[10]:
human_cell_types = (
census["census_data"]["homo_sapiens"].obs.read(column_names=["cell_type", "is_primary_data"]).concat().to_pandas()
)
human_cell_types
[10]:
cell_type | is_primary_data | |
---|---|---|
0 | syncytiotrophoblast cell | False |
1 | placental villous trophoblast | False |
2 | syncytiotrophoblast cell | False |
3 | syncytiotrophoblast cell | False |
4 | extravillous trophoblast | False |
... | ... | ... |
56400868 | pericyte | True |
56400869 | pericyte | True |
56400870 | pericyte | True |
56400871 | pericyte | True |
56400872 | pericyte | True |
56400873 rows × 2 columns
The number of rows is the total number of cells for humans. Now, if you wish to get the cell counts per cell type we can perform some pandas
operations on this object.
In addition, we will only focus on cells that are marked with is_primary_data=True
as this ensures we de-duplicate cells that appear more than once in CELLxGENE Discover.
[11]:
human_cell_types = (
census["census_data"]["homo_sapiens"]
.obs.read(column_names=["cell_type"], value_filter="is_primary_data == True")
.concat()
.to_pandas()
)
human_cell_types = human_cell_types[["cell_type"]]
human_cell_types.shape
[11]:
(33364242, 1)
This is the number of unique cells. Now let’s look at the counts per cell type:
[12]:
human_cell_type_counts = human_cell_types.value_counts()
human_cell_type_counts
[12]:
cell_type
neuron 2673669
glutamatergic neuron 1541605
CD4-positive, alpha-beta T cell 1258976
CD8-positive, alpha-beta T cell 1235987
classical monocyte 1030996
...
microfold cell of epithelium of small intestine 19
mature conventional dendritic cell 17
serous cell of epithelium of bronchus 15
sperm 11
type N enteroendocrine cell 10
Name: count, Length: 599, dtype: int64
This shows you that the most abundant cell types are “glutamatergic neuron”, “CD8-positive, alpha-beta T cell”, and “CD4-positive, alpha-beta T cell”.
Now let’s take a look at the number of unique cell types:
[13]:
human_cell_type_counts.shape
[13]:
(599,)
That is the total number of different cell types for human.
All the information in this example can be quickly obtained from the summary table at census["census-info"]["summary_cell_counts"]
.
The examples below are more complex and can only be achieved by accessing the cell metadata.
Example: cell types available in human liver
Similar to the example above, we can learn what cell types are available for a specific tissue, e.g. liver.
To achieve this goal we just need to limit our cell metadata to that tissue. We will use the information in the cell metadata variable tissue_general
. This variable contains the high-level tissue label for all cells in the Census:
[14]:
human_liver_cell_types = (
census["census_data"]["homo_sapiens"]
.obs.read(column_names=["cell_type"], value_filter="is_primary_data == True and tissue_general == 'liver'")
.concat()
.to_pandas()
)
human_liver_cell_types["cell_type"].value_counts()
[14]:
cell_type
T cell 85739
hepatoblast 58447
neoplastic cell 52431
erythroblast 45605
monocyte 31388
...
pulmonary artery endothelial cell 1
germinal center B cell 1
enteroendocrine cell 1
type I pneumocyte 1
group 2 innate lymphoid cell 1
Name: count, Length: 126, dtype: int64
These are the cell types and their cell counts in the human liver.
Example: diseased T cells in human tissues
In this example we are going to get the counts for all diseased cells annotated as T cells. For the sake of the example we will focus on “CD8-positive, alpha-beta T cell” and “CD4-positive, alpha-beta T cell”:
[15]:
t_cells_list = ["CD8-positive, alpha-beta T cell", "CD4-positive, alpha-beta T cell"]
t_cells_diseased = (
census["census_data"]["homo_sapiens"]
.obs.read(
column_names=["disease", "tissue_general"],
value_filter=f"is_primary_data == True and cell_type in {t_cells_list} and disease != 'normal'",
)
.concat()
.to_pandas()
)
t_cells_diseased = t_cells_diseased[["disease", "tissue_general"]].value_counts(sort=False)
t_cells_diseased
[15]:
disease tissue_general
B-cell non-Hodgkin lymphoma blood 62499
COVID-19 blood 819428
lung 30578
nose 13
respiratory system 4
saliva 41
Crohn disease colon 17490
small intestine 52029
Down syndrome bone marrow 181
breast cancer breast 1850
chronic obstructive pulmonary disease lung 9382
chronic rhinitis nose 909
clear cell renal carcinoma blood 6548
kidney 20540
lymph node 36
cystic fibrosis lung 7
follicular lymphoma lymph node 1089
influenza blood 8871
interstitial lung disease lung 1803
kidney benign neoplasm blood 20
kidney 10
kidney oncocytoma blood 16
kidney 2408
lung adenocarcinoma adrenal gland 205
brain 3274
liver 507
lung 215013
lymph node 24969
pleural fluid 11558
lung large cell carcinoma lung 5922
lymphangioleiomyomatosis lung 513
non-small cell lung carcinoma lung 36573
nonpapillary renal cell carcinoma adipose tissue 243
adrenal gland 4828
blood 288
blood clot 1717
kidney 69136
pleomorphic carcinoma lung 1715
pneumonia lung 856
pulmonary fibrosis lung 1671
respiratory system disorder blood 34301
squamous cell lung carcinoma lung 52053
lymph node 100
systemic lupus erythematosus blood 355471
Name: count, dtype: int64
These are the cell counts annotated with the indicated disease across human tissues for “CD8-positive, alpha-beta T cell” or “CD4-positive, alpha-beta T cell”.
And, don’t forget to close the census!
[16]:
census.close()
del census