Datasets
The czbenchmarks.datasets
module defines the dataset abstraction used across all benchmark pipelines. It provides a uniform and type-safe way to manage dataset inputs and outputs, ensuring compatibility with models and tasks.
Overview
cz-benchmarks currently supports single-cell RNA-seq data stored in the AnnData
H5AD format. The dataset system is extensible and can be used for other data modalities by creating new dataset types.
Key Components
BaseDataset
An abstract class that provides methods for:Storing typed inputs and model outputs (
set_input
,set_output
)Type validation via
DataType
enumsSerialization and deserialization using
dill
Loading/unloading memory-intensive data
All dataset types must inherit from
BaseDataset
.SingleCellDataset
A concrete implementation ofBaseDataset
for single-cell data.Responsibilities:
Loads anndata files via
anndata.read_h5ad
Stores metadata as
.obs
or.var
and the expression matrix as.X
Performs organism-based validation using the
Organism
enumValidates gene name prefixes and presence of expected columns
Automatically sets:
DataType.ANNDATA
DataType.METADATA
DataType.ORGANISM
PerturbationSingleCellDataset
Subclass ofSingleCellDataset
designed for perturbation benchmarks.Responsibilities:
Validates presence of
condition_key
andsplit_key
(e.g.,condition
,split
)Stores control and perturbed cells
Computes and stores
DataType.PERTURBATION_TRUTH
as ground-truth reference
Automatically filters
adata
to only include control cells for inference.Example valid perturbation formats:
"ctrl"
: control"GENE+ctrl"
: single-gene perturbation"GENE1+GENE2"
: combinatorial perturbation
DataType
Defines all valid input and output types (e.g.,ANNDATA
,METADATA
,EMBEDDING
, etc.) with expected Python types (AnnData
,pd.DataFrame
,np.ndarray
, etc.)Organism
Enum that specifies supported species (e.g., HUMAN, MOUSE) and gene prefixes (e.g.,ENSG
andENSMUSG
, respectively).
Adding a New Dataset
To define a custom dataset:
Inherit from
BaseDataset
and implement:_validate(self)
— raise exceptions for missing or malformed dataload_data(self)
— populateself.inputs
with required valuesunload_data(self)
— clear memory-heavy inputs (e.g.,adata
) before serialization
Register all required inputs using
self.set_input(data_type, value)
Store model outputs using
self.set_output(model_type, data_type, value)
Use the
DataType
enum to enforce type safety and input validation
Example Skeleton
from czbenchmarks.datasets.base import BaseDataset
from czbenchmarks.datasets.types import DataType, Organism
import anndata as ad
class MyCustomDataset(BaseDataset):
def load_data(self):
adata = ad.read_h5ad(self.path)
self.set_input(DataType.ANNDATA, adata)
self.set_input(DataType.METADATA, adata.obs)
self.set_input(DataType.ORGANISM, Organism.HUMAN)
def unload_data(self):
self._inputs.pop(DataType.ANNDATA, None)
self._inputs.pop(DataType.METADATA, None)
def _validate(self):
adata = self.get_input(DataType.ANNDATA)
assert "my_custom_key" in adata.obs.columns, "Missing key!"
Accessing Inputs and Outputs
Use the following methods for safe access:
dataset.get_input(DataType.ANNDATA)
dataset.get_input(DataType.METADATA)
dataset.get_output(ModelType.SCVI, DataType.EMBEDDING)
Serialization Support
Datasets can be serialized to disk after model inference. Internally, dill
is used to support complex Python objects like AnnData
.
dataset.serialize("/tmp/my_dataset.dill")
loaded = BaseDataset.deserialize("/tmp/my_dataset.dill")
# Don't forget to reload memory-intensive fields
loaded.load_data()
Tips for Developers
AnnData Views: Use
.copy()
when slicing to avoid “view” issues in Scanpy.Organism Validation: Always set
DataType.ORGANISM
and validatevar_names
withOrganism.prefix
.Gene Names: Ensure
.var
hasfeature_name
orensembl_id
depending on model requirements.Metadata Compatibility: Validate that all label keys required by tasks (e.g.,
cell_type
,sex
,batch
) exist in.obs
.