# Datasets The `czbenchmarks.datasets` module defines the dataset abstraction used across all benchmark pipelines. It provides a uniform and type-safe way to manage dataset inputs and outputs, ensuring compatibility with models and tasks. ## Overview cz-benchmarks currently supports single-cell RNA-seq data stored in the [`AnnData`](https://anndata.readthedocs.io/en/stable/) H5AD format. The dataset system is extensible and can be used for other data modalities by creating new dataset types. ## Key Components - [BaseDataset](../autoapi/czbenchmarks/datasets/base/index) An abstract class that provides methods for: - Storing typed inputs and model outputs (`set_input`, `set_output`) - Type validation via `DataType` enums - Serialization and deserialization using [`dill`](https://dill.readthedocs.io/en/latest/) - Loading/unloading memory-intensive data All dataset types must inherit from `BaseDataset`. - [SingleCellDataset](../autoapi/czbenchmarks/datasets/single_cell/index) A concrete implementation of `BaseDataset` for single-cell data. Responsibilities: - Loads anndata files via `anndata.read_h5ad` - Stores metadata as `.obs` or `.var` and the expression matrix as `.X` - Performs organism-based validation using the `Organism` enum - Validates gene name prefixes and presence of expected columns Automatically sets: - `DataType.ANNDATA` - `DataType.METADATA` - `DataType.ORGANISM` - [PerturbationSingleCellDataset](../autoapi/czbenchmarks/datasets/single_cell/index) Subclass of `SingleCellDataset` designed for perturbation benchmarks. Responsibilities: - Validates presence of `condition_key` and `split_key` (e.g., `condition`, `split`) - Stores control and perturbed cells - Computes and stores `DataType.PERTURBATION_TRUTH` as ground-truth reference Automatically filters `adata` to only include control cells for inference. Example valid perturbation formats: - `"ctrl"`: control - `"GENE+ctrl"`: single-gene perturbation - `"GENE1+GENE2"`: combinatorial perturbation - [DataType](../autoapi/czbenchmarks/datasets/types/index) Defines all valid input and output types (e.g., `ANNDATA`, `METADATA`, `EMBEDDING`, etc.) with expected Python types (`AnnData`, `pd.DataFrame`, `np.ndarray`, etc.) - [Organism](../autoapi/czbenchmarks/datasets/types/index) Enum that specifies supported species (e.g., HUMAN, MOUSE) and gene prefixes (e.g., `ENSG` and `ENSMUSG`, respectively). ## Adding a New Dataset To define a custom dataset: 1. **Inherit from `BaseDataset`** and implement: - `_validate(self)` — raise exceptions for missing or malformed data - `load_data(self)` — populate `self.inputs` with required values - `unload_data(self)` — clear memory-heavy inputs (e.g., `adata`) before serialization 2. **Register all required inputs** using `self.set_input(data_type, value)` 3. **Store model outputs** using `self.set_output(model_type, data_type, value)` 4. **Use the `DataType` enum** to enforce type safety and input validation ### Example Skeleton ```python from czbenchmarks.datasets.base import BaseDataset from czbenchmarks.datasets.types import DataType, Organism import anndata as ad class MyCustomDataset(BaseDataset): def load_data(self): adata = ad.read_h5ad(self.path) self.set_input(DataType.ANNDATA, adata) self.set_input(DataType.METADATA, adata.obs) self.set_input(DataType.ORGANISM, Organism.HUMAN) def unload_data(self): self._inputs.pop(DataType.ANNDATA, None) self._inputs.pop(DataType.METADATA, None) def _validate(self): adata = self.get_input(DataType.ANNDATA) assert "my_custom_key" in adata.obs.columns, "Missing key!" ``` ## Accessing Inputs and Outputs Use the following methods for safe access: ```python dataset.get_input(DataType.ANNDATA) dataset.get_input(DataType.METADATA) dataset.get_output(ModelType.SCVI, DataType.EMBEDDING) ``` ## Serialization Support Datasets can be serialized to disk after model inference. Internally, [`dill`](https://dill.readthedocs.io/en/latest/) is used to support complex Python objects like `AnnData`. ```python dataset.serialize("/tmp/my_dataset.dill") loaded = BaseDataset.deserialize("/tmp/my_dataset.dill") # Don't forget to reload memory-intensive fields loaded.load_data() ``` ## Tips for Developers - **AnnData Views:** Use `.copy()` when slicing to avoid "view" issues in Scanpy. - **Organism Validation:** Always set `DataType.ORGANISM` and validate `var_names` with `Organism.prefix`. - **Gene Names:** Ensure `.var` has `feature_name` or `ensembl_id` depending on model requirements. - **Metadata Compatibility:** Validate that all label keys required by tasks (e.g., `cell_type`, `sex`, `batch`) exist in `.obs`. ## Related References - [Add Custom Dataset Guide](../how_to_guides/add_custom_dataset) - [BaseDataset API](../autoapi/czbenchmarks/datasets/base/index) - [SingleCellDataset API](../autoapi/czbenchmarks/datasets/single_cell/index) - [DataType Enum](../autoapi/czbenchmarks/datasets/types/index) - [Organism Enum](../autoapi/czbenchmarks/datasets/types/index)