Add a Custom Dataset

This guide explains how to integrate your own dataset into cz-benchmarks.

Requirements

For single-cell datasets:

The dataset file must be an .h5ad file conforming to the AnnData on-disk format.
The AnnData object’s var_names must specify the ensembl_id for each gene OR var must contain a column named ensembl_id.
The AnnData object must meet the validation requirements of the specific models that the dataset will be used to benchmark. This means that:
- obs and var each contain the required metadata columns, as specified by the models’ required_obs_keys and required_var_keys properties, respectively.
- The ensemble_id values must be valid for the models’ accepted organisms, as specified by the available_organisms property.

Steps to Add Your Dataset

1. Prepare Your Data

Save your data as an AnnData object in .h5ad format.
Ensure the following:
- Metadata columns (e.g., cell type, batch) are included in obs.
- Gene names are properly defined in var.

2. Create a Custom Configuration File

Update src/czbenchmarks/conf/datasets.yaml by adding a new dataset entry:

datasets:
  ...

  my_dataset:
    _target_: czbenchmarks.datasets.SingleCellDataset
    path: ~/path_to_your_data/my_data.h5ad
    organism: ${organism:HUMAN}

Explanation:
- datasets: Defines the datasets to be loaded.
- my_dataset: A unique identifier for your dataset.
- _target_: Specifies the Dataset class to instantiate. Currently, cz-benchmarks supports src.czbenchmarks.datasets.single_cell.SingleCellDataset and src.czbenchmarks.datasets.single_cell.PerturbationSingleCellDataset Dataset types.
- path: Path to your .h5ad file. This may be be a local filesystem path or an S3 URL (s3://...).
- organism: Specify the organism, which must be a value from the src.czbenchmarks.datasets.types.Organism (e.g., HUMAN, MOUSE).
You may add multiple datasets to thie files, as children of datasets.

3. Load and Validate Your Dataset in Python

Use the following Python code to load your dataset:

from czbenchmarks.datasets.utils import load_dataset

# Instantiate the `SingleCellDataset` object from the configuration specified in `datasets.yaml`
dataset = load_dataset("my_dataset")

# Load the H5AD file into memory as an AnnData object, storing in the `ANNDATA` input "slot" of the dataset.
dataset.load_data()

# Ensure the basic requirements are met by the Dataset
dataset.validate()
print(dataset.get_input("ANNDATA"))

Fix any loading or validation errors, as needed.

Tips for Customization

Preprocessing: If your dataset requires specialized preprocessing, consider subclassing BaseDataset in your project.
Validation: Ensure organism-specific validations (e.g. gene name prefixes) are met.
Test with Models: Specific models may have additional validation requirements, so you will need to invoke applicable models with your specific dataset to ensure that it is fully compliant.