# Census data and schema This page provides a user-friendly overview of the Census contents and its schema, in case you are interested you can find the full schema specification [here](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md). **Contents:** 1. [Schema](#schema) 2. [Data included in the Census](#data-included-in-the-census) 3. [SOMA objects](#soma-objects) ## Schema The Census is a collection of a variety of **[SOMA objects](#soma-objects)** organized with the following hierarchy. ![image](census-spatial-schema.svg) As you can see the Census data is a `SOMACollection` with two high-level items: 1. `"census_info"` for the census summary info. 2. `"census_data"` for the single-cell data and metadata. ### Census Summary Info `"census_info"` A `SOMAcollection` with tables providing information of the census as a whole, it has the following items: - `"summary"`: High-level information of this Census, e.g., build date, total cell count, etc. - `"datasets"`: A table with all datasets from CELLxGENE Discover used to create the Census. - `"summary_cell_counts"`: Cell counts stratified by relevant cell metadata. --- ### Census Single-Cell Data `"census_data"` Data for each organism is stored in independent `SOMAExperiment` objects, which are a specialized form of a `SOMACollection`. Each of these stores a data matrix (cell by genes), cell metadata, gene metadata, and feature presence matrix. This is how the data is organized for one organism – *Homo sapiens*: - `["homo_sapiens"].obs`: Cell metadata. - `["homo_sapiens"].ms["RNA"].X`: Data matrices: raw counts in `X["raw"]`, and library-size normalized counts in `X["normalized"]` (only available in Census schema V1.1.0 and above). - `["homo_sapiens"].ms["RNA"].var`: Gene metadata. - `["homo_sapiens"].ms["RNA"]["feature_dataset_presence_matrix"]`: A sparse boolean array indicating which genes were measured in each dataset. --- ### Data Included in the Census All data from [CZ CELLxGENE Discover](https://cellxgene.cziscience.com/) that adheres to the following criteria is included in the Census: - Cells from human or mouse. - **Spatial and non-spatial RNA data**, see the full list of sequencing technologies included [here](https://github.com/chanzuckerberg/cellxgene-census/blob/main/docs/cellxgene_census_schema.md#assays). - Raw counts. - Only standardized cell and gene metadata as described in the CELLxGENE Discover dataset [schema](https://github.com/chanzuckerberg/single-cell-curation/blob/main/schema/3.0.0/schema.md). ⚠️ Note that the data includes: - **Full-gene sequencing read counts** (e.g., Smart-Seq2) and **molecule counts** (e.g., 10X). - **Duplicate cells** present across multiple datasets. These can be filtered in or out using the cell metadata variable `is_primary_data`. --- ### SOMA Objects You can find the full SOMA specification [here](https://github.com/single-cell-data/SOMA/blob/main/abstract_specification.md#foundational-types). The following is a short description of the main SOMA objects used by the Census: - **`DenseNDArray`**: A dense, N-dimensional array, with offset (zero-based) integer indexing on each dimension. - **`SparseNDArray`**: The same as `DenseNDArray` but sparse, and supports point indexing (disjoint index access). - **`DataFrame`**: A multi-column table with user-defined column names and value types, with support for point indexing. - **`Collection`**: A persistent container of named SOMA objects. - **`Experiment`**: A class that represents a single-cell experiment. It always contains two objects: - `obs`: A `DataFrame` with primary annotations on the observation axis. - `ms`: A `Collection` of measurements, each composed of `X` matrices and axis annotation matrices or data frames (e.g., `var`, `varm`, `obsm`, etc.). - **`SOMAScene`**: A `Collection` of `obsl`, `varl`, and `img`. - **`Spatial`**: A collection of `Scene` objects. - **`obs_spatial_presence`**: A `DataFrame` to map observations to `Scene` objects.