Add a Custom Dataset
This guide explains how to integrate your own dataset into cz-benchmarks.
Requirements
For single-cell datasets:
The dataset file must be an
.h5ad
file conforming to the AnnData on-disk format.The AnnData object’s
var_names
must specify theensembl_id
for each gene ORvar
must contain a column namedensembl_id
.The AnnData object must meet the validation requirements of the specific models that the dataset will be used to benchmark. This means that:
obs
andvar
each contain the required metadata columns, as specified by the models’required_obs_keys
andrequired_var_keys
properties, respectively.The
ensemble_id
values must be valid for the models’ accepted organisms, as specified by theavailable_organisms
property.
Steps to Add Your Dataset
1. Prepare Your Data
Save your data as an AnnData object in
.h5ad
format.Ensure the following:
Metadata columns (e.g., cell type, batch) are included in
obs
.Gene names are properly defined in
var
.
2. Create a Custom Configuration File
Update
src/czbenchmarks/conf/datasets.yaml
by adding a new dataset entry:
datasets:
...
my_dataset:
_target_: czbenchmarks.datasets.SingleCellDataset
path: ~/path_to_your_data/my_data.h5ad
organism: ${organism:HUMAN}
Explanation:
datasets
: Defines the datasets to be loaded.my_dataset
: A unique identifier for your dataset._target_
: Specifies theDataset
class to instantiate. Currently,cz-benchmarks
supportssrc.czbenchmarks.datasets.single_cell.SingleCellDataset
andsrc.czbenchmarks.datasets.single_cell.PerturbationSingleCellDataset
Dataset types.path
: Path to your.h5ad
file. This may be be a local filesystem path or an S3 URL (s3://...
).organism
: Specify the organism, which must be a value from thesrc.czbenchmarks.datasets.types.Organism
(e.g., HUMAN, MOUSE).
You may add multiple datasets to thie files, as children of
datasets
.
3. Load and Validate Your Dataset in Python
Use the following Python code to load your dataset:
from czbenchmarks.datasets.utils import load_dataset
# Instantiate the `SingleCellDataset` object from the configuration specified in `datasets.yaml`
dataset = load_dataset("my_dataset")
# Load the H5AD file into memory as an AnnData object, storing in the `ANNDATA` input "slot" of the dataset.
dataset.load_data()
# Ensure the basic requirements are met by the Dataset
dataset.validate()
print(dataset.get_input("ANNDATA"))
Fix any loading or validation errors, as needed.
Tips for Customization
Preprocessing: If your dataset requires specialized preprocessing, consider subclassing
BaseDataset
in your project.Validation: Ensure organism-specific validations (e.g. gene name prefixes) are met.
Test with Models: Specific models may have additional validation requirements, so you will need to invoke applicable models with your specific dataset to ensure that it is fully compliant.