CZ CELLxGENE Discover Census in AWS
The single-cell data from CZ CELLxGENE Discover Census are available for public access via Amazon Web Services (AWS).
This page describes what Census data are available in AWS and how to access them.
Contents
Census data available in AWS
The single-cell data from CZ CELLxGENE Discover included in Census (see inclusion criteria) are available either as Census-wide TileDB files or individual H5AD files of the source datasets.
Data specifications
Data | Format | Access API | Data schema | Root S3 bucket | Regions |
---|---|---|---|---|---|
Census-wide | TileDB | CELLxGENE-Census | CZ CELLxGENE Discover Census Schema | s3://cellxgene-census-public-us-west-2/cell-census/[tag] /soma/ |
us-west-2 |
TileDB-SOMA | |||||
Source datasets | H5AD | AnnData | CZ CELLxGENE Discover Dataset Schema | s3://cellxgene-census-public-us-west-2/cell-census/[tag] /h5ads/ |
See the next section for a definition of [tag]
.
Data release versioning
A data release is a Census build that is publicly hosted in AWS. A Census build is a TileDB-SOMA collection and its corresponding source H5AD files with the Census data from CZ CELLxGENE Discover.
Any given Census build is named with a unique [tag]
, normally the date of build, e.g. “2023-05-15”.
The are two types of data releases:
Long-Term Supported (LTS).
Weekly.
For more information and for a list of all LTS Census data releases available please refer to Census data releases.
How to access AWS Census data
AWS CLI for programatic downloads
Users can bulk-download Census data via the AWS CLI.
For example, to download the H5ADs files of the Census LTS release 2023-07-25
, users can execute the following from a shell session:
aws s3 sync --no-sign-request s3://cellxgene-census-public-us-west-2/cell-census/2023-07-25/h5ads/ ./h5ads/
And to download the TileDB files:
aws s3 sync --no-sign-request s3://cellxgene-census-public-us-west-2/cell-census/2023-07-25/soma/ ./soma/
CELLxGENE Census API (Python and R)
This is the recommend method for accessing Census data. Please follow the Census API quick start guide for a full guide.
For example, in Python users can create an iterator for the cell metadata Data Frame as follows:
import cellxgene_census
with cellxgene_census.open_soma() as census:
cell_metadata = census["census_data"]["homo_sapiens"].obs.read(
value_filter = "sex == 'female' and cell_type in ['microglial cell', 'neuron']",
column_names = ["assay", "cell_type", "tissue", "tissue_general", "suspension_type", "disease"]
)
If a local copy of the Census data exists, users can access it by providing the path to the soma/
folder.
import cellxgene_census
with cellxgene_census.open_soma(uri="local/path/to/soma/") as census:
...
If a copy of the Census data exists in a private S3 bucket, users can access it by providing the URI soma/
folder in the S3 bucket. This will also require customizing TileDB configuration options to specify the
bucket’s AWS region and that signed requests should be used for S3 API calls. This can be done as follows:
import cellxgene_census
uri = "s3://my-private-data-bucket/cell-census/2023-07-25/soma/"
tiledb_config={"vfs.s3.no_sign_request": "false",
"vfs.s3.region": "us-east-1"}
with cellxgene_census.open_soma(uri=uri, tiledb_config=tiledb_config) as census:
...
TileDB-SOMA API (Python and R)
The Census API provides convenience wrappers for TileDB-SOMA to access the Census Data hosted at AWS. Users can interact directly with the Census TileDB data directly via the TileDB-SOMA APIs. Please refer to the TileDb-SOMA documentation for full details on usage.
For example, in Python users can create an iterator for the cell metadata Data Frame as follows:
import cellxgene_census
import tiledbsoma
uri = "s3://cellxgene-census-public-us-west-2/cell-census/2023-07-25/soma/"
ctx = cellxgene_census.get_default_soma_context()
with tiledbsoma.open(uri, context=ctx) as census:
cell_metadata = census["census_data"]["homo_sapiens"].obs.read(
value_filter = "sex == 'female' and cell_type in ['microglial cell', 'neuron']",
column_names = ["assay", "cell_type", "tissue", "tissue_general", "suspension_type", "disease"]
)