Exploring the Census Datasets table๏
This tutorial demonstrates basic use of the census_datasets
dataframe that contains metadata of the Census source datasets. This metadata can be joined to the cell metadata dataframe (obs
) via the column dataset_id
,
Contents
Fetching the datasets table.
Fetching the expression data from a single dataset.
Downloading the original source H5AD file of a dataset.
โ ๏ธ Note that the Census RNA data includes duplicate cells present across multiple datasets. Duplicate cells can be filtered in or out using the cell metadata variable is_primary_data
which is described in the Census schema.
Fetching the datasets table๏
Each Census contains a top-level dataframe itemizing the datasets contained therein. You can read this into a pandas.DataFrame
.
[1]:
import cellxgene_census
census = cellxgene_census.open_soma()
census_datasets = census["census_info"]["datasets"].read().concat().to_pandas()
# for convenience, indexing on the soma_joinid which links this to other census data.
census_datasets = census_datasets.set_index("soma_joinid")
census_datasets
The "stable" release is currently 2023-07-25. Specify 'census_version="2023-07-25"' in future calls to open_soma() to ensure data consistency.
[1]:
collection_id | collection_name | collection_doi | dataset_id | dataset_title | dataset_h5ad_path | dataset_total_cell_count | |
---|---|---|---|---|---|---|---|
soma_joinid | |||||||
0 | e2c257e7-6f79-487c-b81c-39451cd4ab3c | Spatial multiomics map of trophoblast developm... | 10.1038/s41586-023-05869-0 | f171db61-e57e-4535-a06a-35d8b6ef8f2b | donor_p13_trophoblasts | f171db61-e57e-4535-a06a-35d8b6ef8f2b.h5ad | 31497 |
1 | e2c257e7-6f79-487c-b81c-39451cd4ab3c | Spatial multiomics map of trophoblast developm... | 10.1038/s41586-023-05869-0 | ecf2e08e-2032-4a9e-b466-b65b395f4a02 | All donors trophoblasts | ecf2e08e-2032-4a9e-b466-b65b395f4a02.h5ad | 67070 |
2 | e2c257e7-6f79-487c-b81c-39451cd4ab3c | Spatial multiomics map of trophoblast developm... | 10.1038/s41586-023-05869-0 | 74cff64f-9da9-4b2a-9b3b-8a04a1598040 | All donors all cell states (in vivo) | 74cff64f-9da9-4b2a-9b3b-8a04a1598040.h5ad | 286326 |
3 | f7cecffa-00b4-4560-a29a-8ad626b8ee08 | Mapping single-cell transcriptomes in the intr... | 10.1016/j.ccell.2022.11.001 | 5af90777-6760-4003-9dba-8f945fec6fdf | Single-cell transcriptomic datasets of Renal c... | 5af90777-6760-4003-9dba-8f945fec6fdf.h5ad | 270855 |
4 | 3f50314f-bdc9-40c6-8e4a-b0901ebfbe4c | Single-cell sequencing links multiregional imm... | 10.1016/j.ccell.2021.03.007 | bd65a70f-b274-4133-b9dd-0d1431b6af34 | Single-cell sequencing links multiregional imm... | bd65a70f-b274-4133-b9dd-0d1431b6af34.h5ad | 167283 |
... | ... | ... | ... | ... | ... | ... | ... |
588 | 180bff9c-c8a5-4539-b13b-ddbc00d643e6 | Molecular characterization of selectively vuln... | 10.1038/s41593-020-00764-7 | f9ad5649-f372-43e1-a3a8-423383e5a8a2 | Molecular characterization of selectively vuln... | f9ad5649-f372-43e1-a3a8-423383e5a8a2.h5ad | 8168 |
589 | a72afd53-ab92-4511-88da-252fb0e26b9a | Single-cell atlas of peripheral immune respons... | 10.1038/s41591-020-0944-y | 456e8b9b-f872-488b-871d-94534090a865 | Single-cell atlas of peripheral immune respons... | 456e8b9b-f872-488b-871d-94534090a865.h5ad | 44721 |
590 | 38833785-fac5-48fd-944a-0f62a4c23ed1 | Construction of a human cell landscape at sing... | 10.1038/s41586-020-2157-4 | 2adb1f8a-a6b1-4909-8ee8-484814e2d4bf | Construction of a human cell landscape at sing... | 2adb1f8a-a6b1-4909-8ee8-484814e2d4bf.h5ad | 598266 |
591 | 5d445965-6f1a-4b68-ba3a-b8f765155d3a | A molecular cell atlas of the human lung from ... | 10.1038/s41586-020-2922-4 | e04daea4-4412-45b5-989e-76a9be070a89 | Krasnow Lab Human Lung Cell Atlas, Smart-seq2 | e04daea4-4412-45b5-989e-76a9be070a89.h5ad | 9409 |
592 | 5d445965-6f1a-4b68-ba3a-b8f765155d3a | A molecular cell atlas of the human lung from ... | 10.1038/s41586-020-2922-4 | 8c42cfd0-0b0a-46d5-910c-fc833d83c45e | Krasnow Lab Human Lung Cell Atlas, 10X | 8c42cfd0-0b0a-46d5-910c-fc833d83c45e.h5ad | 65662 |
593 rows ร 7 columns
The sum cells across all datasets should match the number of cells across all SOMA experiments (human, mouse).
[2]:
# Count cells across all experiments
experiments_total_cells = 0
print("Count by experiment:")
for organism_name in census["census_data"].keys():
num_cells = len(cellxgene_census.get_obs(census, organism_name, column_names=["soma_joinid"]))
print(f"\t{num_cells} cells in {organism_name}")
experiments_total_cells += num_cells
print(f"\nFound {experiments_total_cells} cells in all experiments.")
# Count cells across all datasets
print(f"Found {census_datasets.dataset_total_cell_count.sum()} cells in all datasets.")
Count by experiment:
5255245 cells in mus_musculus
56400873 cells in homo_sapiens
Found 61656118 cells in all experiments.
Found 61656118 cells in all datasets.
Fetching the expression data from a single dataset๏
Lets pick one dataset to slice out of the census, and turn into an AnnData in-memory object. This can be used with the ScanPy toolchain. You can also save this AnnData locally using the AnnData write API.
[3]:
census_datasets[census_datasets.dataset_id == "0bd1a1de-3aee-40e0-b2ec-86c7a30c7149"]
[3]:
collection_id | collection_name | collection_doi | dataset_id | dataset_title | dataset_h5ad_path | dataset_total_cell_count | |
---|---|---|---|---|---|---|---|
soma_joinid | |||||||
522 | 0b9d8a04-bb9d-44da-aa27-705bb65b54eb | Tabula Muris Senis | 10.1038/s41586-020-2496-1 | 0bd1a1de-3aee-40e0-b2ec-86c7a30c7149 | Bone marrow - A single-cell transcriptomic atl... | 0bd1a1de-3aee-40e0-b2ec-86c7a30c7149.h5ad | 40220 |
Create a query on the mouse experiment, โRNAโ measurement, for the dataset_id.
[4]:
adata = cellxgene_census.get_anndata(
census, organism="Mus musculus", obs_value_filter="dataset_id == '0bd1a1de-3aee-40e0-b2ec-86c7a30c7149'"
)
adata
[4]:
AnnData object with n_obs ร n_vars = 40220 ร 52392
obs: 'soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id'
var: 'soma_joinid', 'feature_id', 'feature_name', 'feature_length'
Downloading the original source H5AD file of a dataset.๏
You can download the original H5AD file for any given dataset. This is the same H5AD you can download from the CZ CELLxGENE Discover, and may contain additional data-submitter provided information which was not included in the Census.
To do this you can fetch the location in the cloud or directly download to your system using the cellxgene-census
[5]:
# Option 1: Direct download
cellxgene_census.download_source_h5ad(
"0bd1a1de-3aee-40e0-b2ec-86c7a30c7149", to_path="Tabula_Muris_Senis-bone_marrow.h5ad"
)
[6]:
# Option 2: Get location and download via preferred method
uri = cellxgene_census.get_source_h5ad_uri("0bd1a1de-3aee-40e0-b2ec-86c7a30c7149")
uri
# you can now download the H5AD in shell via AWS CLI e.g. `aws s3 cp uri ./`
[6]:
{'uri': 's3://cellxgene-data-public/cell-census/2023-07-25/h5ads/0bd1a1de-3aee-40e0-b2ec-86c7a30c7149.h5ad',
's3_region': 'us-west-2'}
Close the census
[7]:
census.close()