Census data releasesο
Last edited: Nov 8th, 2025.
Contents:
What is a Census data release?ο
It is a Census build that is publicly hosted online. A Census build is a TileDB-SOMA collection with the Census data from CZ CELLxGENE Discover as specified in the Census schema.
Any given Census build is named with a unique tag, normally the date of build, e.g., "2025-01-30".
Long-term supported (LTS) Census releasesο
To enable data stability and scientific reproducibility, CZ CELLxGENE Discover plans to keep certain Census data releases available for public access for at least 5 years upon publication.
The most recent LTS Census data release is the default opened by the APIs and recognized as census_version = "stable". To open previous LTS Census data releases, you can directly specify the version via its build date census_version = "[YYYY]-[MM]-[DD]".
Python
import cellxgene_census
census = cellxgene_census.open_soma(census_version = "stable")
R
library("cellxgene.census")
census <- open_soma(census_version = "stable")
Weekly Census releases (latest)ο
CZ CELLxGENE Discover ingests a handful of new datasets every week. To quickly enable access to these new data via the Census, CZ CELLxGENE Discover plans to perform weekly Census data releases, available for public access for 1 month.
The most recent weekly release can be opened by the APIs by specifying census_version = "latest".
Python
import cellxgene_census
census = cellxgene_census.open_soma(census_version = "latest")
R
library("cellxgene.census")
census <- open_soma(census_version = "latest")
List of LTS Census data releasesο
LTS 2025-11-08ο
Open this data release by specifying census_version = "2025-11-08" in future calls to open_soma().
Version informationο
Information |
Value |
|---|---|
Census schema version |
|
Census build date |
2025-11-08 |
Dataset schema version |
|
Number of datasets |
1845 |
Schema changesο
Census schema 2.4.0 has a few important changes that may need adjustments in analysis code:
The obs
diseaseanddisease_ontology_term_idfields may now contain multiple values delimited by' || ', so exact string equality queries on these fields may yield incomplete results.The var
feature_namefield is no longer necessarily unique. Previously, colliding gene symbols were disambiguated by appending theirfeature_id(Ensembl gene ID).feature_nameis now populated with the exact gene symbols, even if used multiple times, whilefeature_idremains unique.
These reflect changes in the newer CELLxGENE Dataset schema version.
Cell countsο
Species |
Total cells |
Unique cells |
|---|---|---|
Homo sapiens |
162,025,130 |
99,633,637 |
Mus musculus |
46,299,127 |
21,029,771 |
Macaca mulatta |
7,010,229 |
2,929,014 |
Callithrix jacchus |
2,275,451 |
1,712,738 |
Pan troglodytes |
158,099 |
158,099 |
Cell metadataο
Category |
Homo sapiens |
Mus musculus |
Callithrix jacchus |
Macaca mulatta |
Pan troglodytes |
|---|---|---|---|---|---|
Assay |
37 |
16 |
1 |
2 |
1 |
Cell type |
898 |
473 |
40 |
54 |
25 |
Development stage |
194 |
66 |
3 |
4 |
1 |
Disease |
192 |
16 |
1 |
1 |
1 |
Self-reported ethnicity |
33 |
1 |
1 |
1 |
1 |
Sex |
3 |
3 |
2 |
3 |
2 |
Suspension type |
2 |
2 |
1 |
2 |
1 |
Tissue |
417 |
101 |
33 |
29 |
1 |
Tissue general |
70 |
36 |
1 |
2 |
1 |
Embeddingsο
Find out more in the Census models page.
Available embeddings can be accessed via cellxgene_census.experimental.get_embedding(), or by specifying the obs_embeddings/var_embeddings field in cellxgene_census.get_anndata().
Cellsο
Method |
Homo sapiens |
Mus musculus |
|---|---|---|
scVI |
|
|
TranscriptFormer tf-sapiens |
|
N/A |
TranscriptFormer tf-exemplar |
|
|
LTS 2025-01-30ο
Open this data release by specifying census_version = "2025-01-30" in future calls to open_soma().
Version informationο
Information |
Value |
|---|---|
Census schema version |
|
Census build date |
2025-01-30 |
Dataset schema version |
|
Number of datasets |
1573 |
Cell and donor countsο
Type |
Homo sapiens |
Mus musculus |
|---|---|---|
Total cells |
109,085,698 |
45,351,496 |
Unique cells |
65,601,657 |
20,208,302 |
Cell metadataο
Category |
Homo sapiens |
Mus musculus |
|---|---|---|
Assay |
31 |
17 |
Cell type |
827 |
453 |
Development stage |
179 |
58 |
Disease |
140 |
12 |
Self-reported ethnicity |
36 |
1 |
Sex |
3 |
3 |
Suspension type |
1 |
1 |
Tissue |
379 |
99 |
Tissue general |
68 |
36 |
Embeddingsο
Find out more in the Census model page.
Available embeddings can be accessed via cellxgene_census.experimental.get_embedding(), or by specifying the obs_embeddings/var_embeddings field in cellxgene_census.get_anndata().
Cellsο
Method |
Homo sapiens |
Mus musculus |
|---|---|---|
scVI |
|
|
LTS 2024-07-01ο
Open this data release by specifying census_version = "2024-07-01" in future calls to open_soma().
Version informationο
Information |
Value |
|---|---|
Census schema version |
|
Census build date |
2024-05-20 |
Dataset schema version |
|
Number of datasets |
812 |
Cell and donor countsο
Type |
Homo sapiens |
Mus musculus |
|---|---|---|
Total cells |
74,322,510 |
41,233,630 |
Unique cells |
44,265,932 |
16,332,034 |
Cell metadataο
Category |
Homo sapiens |
Mus musculus |
|---|---|---|
Assay |
24 |
11 |
Cell type |
698 |
364 |
Development stage |
176 |
48 |
Disease |
109 |
7 |
Self-reported ethnicity |
31 |
NA |
Sex |
3 |
3 |
Suspension type |
2 |
2 |
Tissue |
267 |
84 |
Tissue general |
55 |
29 |
Embeddingsο
Find out more in the Census model page.
Available embeddings can be accessed via cellxgene_census.experimental.get_embedding(), or by specifying the obs_embeddings/var_embeddings field in cellxgene_census.get_anndata().
Cellsο
Method |
Homo sapiens |
Mus musculus |
|---|---|---|
scVI |
|
|
Geneformer |
|
NA |
LTS 2023-12-15ο
Open this data release by specifying census_version = "2023-12-15" in future calls to open_soma().
Version informationο
Information |
Value |
|---|---|
Census schema version |
|
Census build date |
2023-12-15 |
Dataset schema version |
|
Number of datasets |
651 |
Cell and donor countsο
Type |
Homo sapiens |
Mus musculus |
|---|---|---|
Total cells |
62,998,417 |
5,684,805 |
Unique cells |
36,227,903 |
4,128,230 |
Cell metadataο
Category |
Homo sapiens |
Mus musculus |
|---|---|---|
Assay |
20 |
10 |
Cell type |
631 |
248 |
Development stage |
173 |
36 |
Disease |
72 |
5 |
Self-reported ethnicity |
30 |
NA |
Sex |
3 |
3 |
Suspension type |
2 |
2 |
Tissue |
230 |
74 |
Tissue general |
53 |
27 |
Embeddingsο
Find out more in the Census model page.
Available embeddings can be accessed via cellxgene_census.experimental.get_embedding(), or by specifying the obs_embeddings/var_embeddings field in cellxgene_census.get_anndata().
Cellsο
Method |
Homo sapiens |
Mus musculus |
|---|---|---|
scVI |
|
|
Fine-tuned Geneformer |
|
NA |
scGPT |
|
NA |
Universal Cell Embeddings |
|
NA |
NMF |
|
NA |
Featuresο
Method |
Homo sapiens |
Mus musculus |
|---|---|---|
NMF |
|
NA |
LTS 2023-07-25ο
Open this data release by specifying census_version = "2023-07-25" in future calls to open_soma().
Version informationο
Information |
Value |
|---|---|
Census schema version |
|
Census build date |
2023-07-25 |
Dataset schema version |
|
Number of datasets |
593 |
Cell and donor countsο
Type |
Homo sapiens |
Mus musculus |
|---|---|---|
Total cells |
56,400,873 |
5,255,245 |
Unique cells |
33,364,242 |
4,083,531 |
Cell metadataο
Category |
Homo sapiens |
Mus musculus |
|---|---|---|
Assay |
19 |
9 |
Cell type |
613 |
248 |
Development stage |
164 |
33 |
Disease |
64 |
5 |
Self-reported ethnicity |
26 |
NA |
Sex |
3 |
3 |
Suspension type |
2 |
2 |
Tissue |
220 |
66 |
Tissue general |
54 |
27 |
LTS 2023-05-15ο
Open this data release by specifying census_version = "2023-05-15" in future calls to open_soma().
π΄ Errata π΄ο
Duplicate observations with is_primary_data = Trueο
In order to prevent duplicate data in analyses, each observation (cell) should be marked is_primary data = True exactly once in the Census. Since this LTS release, 243,569 observations have been identified that are represented at least twice with is_primary_data = True.
This issue will be corrected in the following LTS data release, by identifying and marking only one cell out of the duplicates as is_primary_data = True.
If you wish to use this data release, you can consider filtering out all of these 243,569 cells by using the soma_joinids provided in this file duplicate_cells_census_LTS_2023-05-15.csv.zip. You can filter specific cells by using the value_filter or obs_value_filter of the querying API functions, for more information follow this tutorial.
Version informationο
Information |
Value |
|---|---|
Census schema version |
|
Census build date |
2023-05-15 |
Dataset schema version |
|
Number of datasets |
562 |
Cell and donor countsο
Type |
Homo sapiens |
Mus musculus |
|---|---|---|
Total cells |
53,794,728 |
4,086,032 |
Unique cells |
33,758,887 |
2,914,318 |
Cell metadataο
Category |
Homo sapiens |
Mus musculus |
|---|---|---|
Assay |
20 |
9 |
Cell type |
604 |
226 |
Development stage |
164 |
30 |
Disease |
68 |
5 |
Self-reported ethnicity |
26 |
NA |
Sex |
3 |
3 |
Suspension type |
2 |
2 |
Tissue |
227 |
51 |
Tissue general |
61 |
27 |
Compatibility with package versionsο
Due to the nature of the Census storage backend, the format version will change from time to time. Format upgrades are always backwards compatible, but theyβre not always forwards compatible, which means that reading a recent Census data version using an older version of the package might result in an error. We aim to guarantee the following policy:
Every Census package version released after an LTS will be able to read every Census data release until the next LTS.
The current LTS release (2025-11-08) is compatible with the following package versions:
1.17.x