FAQ

Last updated: Sept, 2024.

Why should I use the Census?

The Census provides efficient low-latency access via Python and R APIs to most single-cell RNA data from CZ CELLxGENE Discover. To accelerate computational research, the Census enables researchers to:

  • Access slices of data from more than 500 single-cell datasets spanning about 33M unique cells (50M total) from >60K genes from human or mice.

  • Access to data with standardized cell and gene metadata with harmonized labels.

  • Easily load multi-dataset slices into Scanpy or Seurat.

  • Implement out-of-core (a.k.a online) operations for larger-than-memory processes.

For example, a user can easily get “all T-cells from Lung with COVID-19” into AnnData, Seurat, or into memory-sufficient data chunks via PyArrow or R Arrow.

The Census is not suited for:

  • Access to non-standardized cell metadata and gene metadata available in the original datasets.

  • Access to the author-contributed normalized expression values or embeddings.

  • Access to all data from just one dataset.

  • Access to non-RNA or spatial data present in CZ CELLxGENE Discover as it is not yet supported in the Census.

If you’d like to perform any of the above tasks, you can access web downloads directly from the CZ CELLxGENE Discover Datasets feature. Click here for more information about downloading published data on CELLxGENE Discover.

What data is contained in the Census?

Most RNA non-spatial data from CZ CELLxGENE Discover is included. You can see a general description of these data and their organization in the schema description or you can use the APIs to explore the data as indicated in this tutorial.

How do I cite the use of the Census for a publication?

Please follow the citation guidelines offered by CZ CELLxGENE Discover.

Why does the Census not have a normalized layer or embeddings?

The Census does not have normalized counts or embeddings because:

  • The original normalized values and embeddings are not harmonized or integrated across datasets and are therefore numerically incompatible.

  • We have not implemented a general-purpose normalization or embedding generation method to be used across all Census data.

If you have any suggestions for methods that our team should explore, please share them with us via a feature request in the github repository.

How does the Census differentiate from other tools?

The Census differentiates from existing single-cell tools by providing fast, efficient access to the largest corpus of standardized single-cell data from CZ CELLxGENE Discover via TileDB-SOMA. Thus, single-cell data from about 33M unique cells (50M total) across >60 K genes, with 11 standardized cell metadata variables and harmonized GENCODE annotations are ready for:

  • Opening and reading data at low latency from the cloud.

  • Querying and accessing data using metadata filters.

  • Loading and creating AnnData objects.

  • Loading and creating Seurat objects.

  • From Python, creating PyArrow objects, SciPy sparse matrices, NumPy arrays, and Pandas data frames.

  • From R, creating R Arrow objects, sparse matrices (via the Matrix package), and standard data frames and (dense) matrices.

Can I query human and mouse data in a single query?

It is not possible to query both mouse and human data in a single query. This is due to the data from these organisms using different organism-specific gene annotations.

Where are the Census data hosted?

The Census data is publicly hosted free-of-cost in an Amazon Web Services (AWS) S3 bucket in the us-west-2 region.

Can I retrieve the original H5AD datasets from which the Census was built?

Yes, you can use the API function download_source_h5ad to do so. For usage, please see the reference documentation at the doc-site or directly from Python or R:

Python

import cellxgene_census
help(cellxgene_census.download_source_h5ad)

R

library(cellxgene.census)
?download_source_h5ad

How can I increase the performance of my queries?

Since the access patterns are via the internet, usually the main limiting step for data queries is bandwidth and client location. We recommend the following tactics to increase query efficiency:

  • Utilize a computer connected to high-speed internet.

  • Utilize an ethernet connection and not a wifi connection.

  • If possible utilize online computing located in the west coast of the US.

  • Highly recommended: EC2 AWS instances in the us-west-2 region.

Can I use conda to install the Census Python API?

There is not a conda package available for cellxgene-census. However you can use conda in combination with pip to install the package in a conda environment:

conda create -n census_env python=3.10
conda activate census_env
pip install cellxgene-census

How can I ask for support?

You can either submit a github issue, or for quick support, you can join the CZI Science Community on Slack (czi.co/science-slack) and ask questions in the #cellxgene-census-users channel.

How can I ask for new features?

You can submit a feature request in the github repository.

How can I contribute my data to the Census?

To inquire about submitting your data to CZ CELLxGENE Discover, click here. If your data request is accepted, the data will automatically be included in the Census if it meets the biological criteria defined in the Census schema.

Why do I get an ArraySchema error when opening the Census?

You may get this error if you are trying to open a Census data build with an old version of the Census API. Please update your Python or R Census package.

If the error persists please file a github issue.

Why do I get an error when running import cellxgene_census on Databricks?

This can occur if the cellxgene_census Python package is installed in a Databricks notebook using %sh pip install cellxgene_census. This command does not restart the Python process after installing cellxgene_census and any pip package dependencies that were pre-installed by the Databricks Runtime environment but upgraded for cellxgene_census will not be reloaded with their new version. You may see numba or pyarrow related errors, for example.

To fix, simply install using one of the following Databricks notebook “magic” commands:

pip install -U cellxgene-census

or

%pip install -U cellxgene-census

These commands restart the Python process after installing the cellxgene-census package, similar to using dbutils.library.restartPython(). Additionally, these magic commands also ensure that the package is installed on all nodes of a multi-node cluster.

See also:

Alternately, you can configure your cluster to install the cellxgene-census package each time it is started by adding this package to the “Libraries” tab on the cluster configuration page per these instructions.

How do I connect to census from behind a proxy?

TileDB doesn’t use the typical proxy environment variables and you’ll need to specify these directly. That looks like:

# Replace the ellipses with your proxy host and port info
config = {
    "vfs.s3.proxy_host": ..., "vfs.s3.proxy_port": ...
}

census = cellxgene_census.open_soma(tiledb_config=config)

It may not be obvious that a proxy is the issue. This will typically manifest as a TileDBError which says that a timeout was reached during a request to the s3 bucket.

You can read more about how to configure how TileDB communicates with S3 here.