Census data releasesο
Last edited: July 8th, 2024.
Contents:
What is a Census data release?ο
It is a Census build that is publicly hosted online. A Census build is a TileDB-SOMA collection with the Census data from CZ CELLxGENE Discover as specified in the Census schema.
Any given Census build is named with a unique tag, normally the date of build, e.g., "2023-05-15"
.
Long-term supported (LTS) Census releasesο
To enable data stability and scientific reproducibility, CZ CELLxGENE Discover plans to perform regular LTS Census data releases:
Published online every six months for public access, starting on May 15, 2023.
Available for public access for at least 5 years upon publication.
The most recent LTS Census data release is the default opened by the APIs and recognized as census_version = "stable"
. To open previous LTS Census data releases, you can directly specify the version via its build date census_version = "[YYYY]-[MM]-[DD]"
.
Python
import cellxgene_census
census = cellxgene_census.open_soma(census_version = "stable")
R
library("cellxgene.census")
census <- open_soma(census_version = "stable")
Weekly Census releases (latest)ο
CZ CELLxGENE Discover ingests a handful of new datasets every week. To quickly enable access to these new data via the Census, CZ CELLxGENE Discover plans to perform weekly Census data releases:
Available for public access for 1 month.
The most recent weekly release can be opened by the APIs by specifying census_version = "latest"
.
Python
import cellxgene_census
census = cellxgene_census.open_soma(census_version = "latest")
R
library("cellxgene.census")
census <- open_soma(census_version = "latest")
List of LTS Census data releasesο
LTS 2024-07-01ο
Open this data release by specifying census_version = "2024-07-01"
in future calls to open_soma()
.
Version informationο
Information |
Value |
---|---|
Census schema version |
|
Census build date |
2024-05-20 |
Dataset schema version |
|
Number of datasets |
812 |
Cell and donor countsο
Type |
Homo sapiens |
Mus musculus |
---|---|---|
Total cells |
74,322,510 |
41,233,630 |
Unique cells |
44,265,932 |
16,332,034 |
Number of donors |
17,651 |
4,216 |
Cell metadataο
Category |
Homo sapiens |
Mus musculus |
---|---|---|
Assay |
24 |
11 |
Cell type |
698 |
364 |
Development stage |
176 |
48 |
Disease |
109 |
7 |
Self-reported ethnicity |
31 |
NA |
Sex |
3 |
3 |
Suspension type |
2 |
2 |
Tissue |
267 |
84 |
Tissue general |
55 |
29 |
Embbedingsο
Find out more in the Census model page.
Available embeddings can be accessed via cellxgene_census.experimental.get_embedding()
, or by specifying the obs_embeddings
/var_embeddings
field in cellxgene_census.get_anndata()
.
Cellsο
Method |
Homo sapiens |
Mus musculus |
---|---|---|
scVI |
|
|
Geneformer |
|
NA |
LTS 2023-12-15ο
Open this data release by specifying census_version = "2023-12-15"
in future calls to open_soma()
.
Version informationο
Information |
Value |
---|---|
Census schema version |
|
Census build date |
2023-12-15 |
Dataset schema version |
|
Number of datasets |
651 |
Cell and donor countsο
Type |
Homo sapiens |
Mus musculus |
---|---|---|
Total cells |
62,998,417 |
5,684,805 |
Unique cells |
36,227,903 |
4,128,230 |
Number of donors |
15,588 |
1,990 |
Cell metadataο
Category |
Homo sapiens |
Mus musculus |
---|---|---|
Assay |
20 |
10 |
Cell type |
631 |
248 |
Development stage |
173 |
36 |
Disease |
72 |
5 |
Self-reported ethnicity |
30 |
NA |
Sex |
3 |
3 |
Suspension type |
2 |
2 |
Tissue |
230 |
74 |
Tissue general |
53 |
27 |
Embbedingsο
Find out more in the Census model page.
Available embeddings can be accessed via cellxgene_census.experimental.get_embedding()
, or by specifying the obs_embeddings
/var_embeddings
field in cellxgene_census.get_anndata()
.
Cellsο
Method |
Homo sapiens |
Mus musculus |
---|---|---|
scVI |
|
|
Fine-tuned Geneformer |
|
NA |
scGPT |
|
NA |
Universal Cell Embeddings |
|
NA |
NMF |
|
NA |
Featuresο
Method |
Homo sapiens |
Mus musculus |
---|---|---|
NMF |
|
NA |
LTS 2023-07-25ο
Open this data release by specifying census_version = "2023-07-25"
in future calls to open_soma()
.
Version informationο
Information |
Value |
---|---|
Census schema version |
|
Census build date |
2023-07-25 |
Dataset schema version |
|
Number of datasets |
593 |
Cell and donor countsο
Type |
Homo sapiens |
Mus musculus |
---|---|---|
Total cells |
56,400,873 |
5,255,245 |
Unique cells |
33,364,242 |
4,083,531 |
Number of donors |
13,035 |
1,417 |
Cell metadataο
Category |
Homo sapiens |
Mus musculus |
---|---|---|
Assay |
19 |
9 |
Cell type |
613 |
248 |
Development stage |
164 |
33 |
Disease |
64 |
5 |
Self-reported ethnicity |
26 |
NA |
Sex |
3 |
3 |
Suspension type |
2 |
2 |
Tissue |
220 |
66 |
Tissue general |
54 |
27 |
LTS 2023-05-15ο
Open this data release by specifying census_version = "2023-05-15"
in future calls to open_soma()
.
π΄ Errata π΄ο
Duplicate observations with is_primary_data = True
ο
In order to prevent duplicate data in analyses, each observation (cell) should be marked is_primary data = True
exactly once in the Census. Since this LTS release, 243,569 observations have been identified that are represented at least twice with is_primary_data = True
.
This issue will be corrected in the following LTS data release, by identifying and marking only one cell out of the duplicates as is_primary_data = True
.
If you wish to use this data release, you can consider filtering out all of these 243,569 cells by using the soma_joinids
provided in this file duplicate_cells_census_LTS_2023-05-15.csv.zip. You can filter specific cells by using the value_filter
or obs_value_filter
of the querying API functions, for more information follow this tutorial.
Version informationο
Information |
Value |
---|---|
Census schema version |
|
Census build date |
2023-05-15 |
Dataset schema version |
|
Number of datasets |
562 |
Cell and donor countsο
Type |
Homo sapiens |
Mus musculus |
---|---|---|
Total cells |
53,794,728 |
4,086,032 |
Unique cells |
33,758,887 |
2,914,318 |
Number of donors |
12,493 |
1,362 |
Cell metadataο
Category |
Homo sapiens |
Mus musculus |
---|---|---|
Assay |
20 |
9 |
Cell type |
604 |
226 |
Development stage |
164 |
30 |
Disease |
68 |
5 |
Self-reported ethnicity |
26 |
NA |
Sex |
3 |
3 |
Suspension type |
2 |
2 |
Tissue |
227 |
51 |
Tissue general |
61 |
27 |
Compatibility with package versionsο
Due to the nature of the Census storage backend, the format version will change from time to time. Format upgrades are always backwards compatible, but theyβre not always forwards compatible, which means that reading a recent Census data version using an older version of the package might result in an error. We aim to guarantee the following policy:
Every Census package version released after an LTS will be able to read every Census data release until the next LTS.
The current LTS release (2023-12-15) is compatible with the following package versions:
1.10.x
1.11.x
1.12.x
1.13.x