Datasets

The czbenchmarks.datasets module defines the dataset abstraction used across all benchmark pipelines. It provides a uniform and type-safe way to manage dataset inputs and outputs, ensuring compatibility with models and tasks.

Overview

cz-benchmarks currently supports single-cell RNA-seq data stored in the AnnData H5AD format. The dataset system is extensible and can be used for other data modalities by creating new dataset types.

Key Components

  • BaseDataset
    An abstract class that provides methods for:

    • Storing typed inputs and model outputs (set_input, set_output)

    • Type validation via DataType enums

    • Serialization and deserialization using dill

    • Loading/unloading memory-intensive data

    All dataset types must inherit from BaseDataset.

  • SingleCellDataset
    A concrete implementation of BaseDataset for single-cell data.

    Responsibilities:

    • Loads anndata files via anndata.read_h5ad

    • Stores metadata as .obs or .var and the expression matrix as .X

    • Performs organism-based validation using the Organism enum

    • Validates gene name prefixes and presence of expected columns

    Automatically sets:

    • DataType.ANNDATA

    • DataType.METADATA

    • DataType.ORGANISM

  • PerturbationSingleCellDataset
    Subclass of SingleCellDataset designed for perturbation benchmarks.

    Responsibilities:

    • Validates presence of condition_key and split_key (e.g., condition, split)

    • Stores control and perturbed cells

    • Computes and stores DataType.PERTURBATION_TRUTH as ground-truth reference

    Automatically filters adata to only include control cells for inference.

    Example valid perturbation formats:

    • "ctrl": control

    • "GENE+ctrl": single-gene perturbation

    • "GENE1+GENE2": combinatorial perturbation

  • DataType
    Defines all valid input and output types (e.g., ANNDATA, METADATA, EMBEDDING, etc.) with expected Python types (AnnData, pd.DataFrame, np.ndarray, etc.)

  • Organism
    Enum that specifies supported species (e.g., HUMAN, MOUSE) and gene prefixes (e.g., ENSG and ENSMUSG, respectively).

Adding a New Dataset

To define a custom dataset:

  1. Inherit from BaseDataset and implement:

    • _validate(self) — raise exceptions for missing or malformed data

    • load_data(self) — populate self.inputs with required values

    • unload_data(self) — clear memory-heavy inputs (e.g., adata) before serialization

  2. Register all required inputs using self.set_input(data_type, value)

  3. Store model outputs using self.set_output(model_type, data_type, value)

  4. Use the DataType enum to enforce type safety and input validation

Example Skeleton

from czbenchmarks.datasets.base import BaseDataset
from czbenchmarks.datasets.types import DataType, Organism
import anndata as ad

class MyCustomDataset(BaseDataset):
    def load_data(self):
        adata = ad.read_h5ad(self.path)
        self.set_input(DataType.ANNDATA, adata)
        self.set_input(DataType.METADATA, adata.obs)
        self.set_input(DataType.ORGANISM, Organism.HUMAN)

    def unload_data(self):
        self._inputs.pop(DataType.ANNDATA, None)
        self._inputs.pop(DataType.METADATA, None)

    def _validate(self):
        adata = self.get_input(DataType.ANNDATA)
        assert "my_custom_key" in adata.obs.columns, "Missing key!"

Accessing Inputs and Outputs

Use the following methods for safe access:

dataset.get_input(DataType.ANNDATA)
dataset.get_input(DataType.METADATA)
dataset.get_output(ModelType.SCVI, DataType.EMBEDDING)

Serialization Support

Datasets can be serialized to disk after model inference. Internally, dill is used to support complex Python objects like AnnData.

dataset.serialize("/tmp/my_dataset.dill")
loaded = BaseDataset.deserialize("/tmp/my_dataset.dill")

# Don't forget to reload memory-intensive fields
loaded.load_data()

Tips for Developers

  • AnnData Views: Use .copy() when slicing to avoid “view” issues in Scanpy.

  • Organism Validation: Always set DataType.ORGANISM and validate var_names with Organism.prefix.

  • Gene Names: Ensure .var has feature_name or ensembl_id depending on model requirements.

  • Metadata Compatibility: Validate that all label keys required by tasks (e.g., cell_type, sex, batch) exist in .obs.