CZ CELLxGENE Discover Census in AWS

The single-cell data from CZ CELLxGENE Discover Census are available for public access via Amazon Web Services (AWS).

This page describes what Census data are available in AWS and how to access them.

Contents

Census data available in AWS

The single-cell data from CZ CELLxGENE Discover included in Census (see inclusion criteria) are available either as Census-wide TileDB files or individual H5AD files of the source datasets.

Data specifications

Data Format Access API Data schema Root S3 bucket Regions
Census-wide TileDB CELLxGENE-Census CZ CELLxGENE Discover Census Schema s3://cellxgene-census-public-us-west-2/cell-census/[tag]/soma/ us-west-2
TileDB-SOMA
Source datasets H5AD AnnData CZ CELLxGENE Discover Dataset Schema s3://cellxgene-census-public-us-west-2/cell-census/[tag]/h5ads/

See the next section for a definition of [tag].

Data release versioning

A data release is a Census build that is publicly hosted in AWS. A Census build is a TileDB-SOMA collection and its corresponding source H5AD files with the Census data from CZ CELLxGENE Discover.

Any given Census build is named with a unique [tag], normally the date of build, e.g. “2023-05-15”.

The are two types of data releases:

  • Long-Term Supported (LTS).

  • Weekly.

For more information and for a list of all LTS Census data releases available please refer to Census data releases.

How to access AWS Census data

AWS CLI for programatic downloads

Users can bulk-download Census data via the AWS CLI.

For example, to download the H5ADs files of the Census LTS release 2023-07-25, users can execute the following from a shell session:

aws s3 sync --no-sign-request s3://cellxgene-census-public-us-west-2/cell-census/2023-07-25/h5ads/ ./h5ads/

And to download the TileDB files:

aws s3 sync --no-sign-request s3://cellxgene-census-public-us-west-2/cell-census/2023-07-25/soma/ ./soma/

CELLxGENE Census API (Python and R)

This is the recommend method for accessing Census data. Please follow the Census API quick start guide for a full guide.

For example, in Python users can create an iterator for the cell metadata Data Frame as follows:

import cellxgene_census

with cellxgene_census.open_soma() as census:
    cell_metadata = census["census_data"]["homo_sapiens"].obs.read(
        value_filter = "sex == 'female' and cell_type in ['microglial cell', 'neuron']",
        column_names = ["assay", "cell_type", "tissue", "tissue_general", "suspension_type", "disease"]
    )

If a local copy of the Census data exists, users can access it by providing the path to the soma/ folder.

import cellxgene_census

with cellxgene_census.open_soma(uri="local/path/to/soma/") as census:
   ...

If a copy of the Census data exists in a private S3 bucket, users can access it by providing the URI soma/ folder in the S3 bucket. This will also require customizing TileDB configuration options to specify the bucket’s AWS region and that signed requests should be used for S3 API calls. This can be done as follows:

import cellxgene_census

uri = "s3://my-private-data-bucket/cell-census/2023-07-25/soma/"

tiledb_config={"vfs.s3.no_sign_request": "false",
               "vfs.s3.region": "us-east-1"}

with cellxgene_census.open_soma(uri=uri, tiledb_config=tiledb_config) as census:
   ...

TileDB-SOMA API (Python and R)

The Census API provides convenience wrappers for TileDB-SOMA to access the Census Data hosted at AWS. Users can interact directly with the Census TileDB data directly via the TileDB-SOMA APIs. Please refer to the TileDb-SOMA documentation for full details on usage.

For example, in Python users can create an iterator for the cell metadata Data Frame as follows:

import cellxgene_census
import tiledbsoma

uri = "s3://cellxgene-census-public-us-west-2/cell-census/2023-07-25/soma/"
ctx = cellxgene_census.get_default_soma_context()

with tiledbsoma.open(uri, context=ctx) as census:
    cell_metadata = census["census_data"]["homo_sapiens"].obs.read(
        value_filter = "sex == 'female' and cell_type in ['microglial cell', 'neuron']",
        column_names = ["assay", "cell_type", "tissue", "tissue_general", "suspension_type", "disease"]
    )