# Add a Custom Dataset This guide explains how to integrate your own dataset into cz-benchmarks. ## Requirements For single-cell datasets: - The dataset file must be an `.h5ad` file conforming to the [AnnData on-disk format](https://anndata.readthedocs.io/en/latest/fileformat-prose.html#on-disk-format). - The AnnData object's `var_names` must specify the `ensembl_id` for each gene OR `var` must contain a column named `ensembl_id`. - The AnnData object must meet the validation requirements of the specific models that the dataset will be used to benchmark. This means that: - `obs` and `var` each contain the required metadata columns, as specified by the models' `required_obs_keys` and `required_var_keys` properties, respectively. - The `ensemble_id` values must be valid for the models' accepted organisms, as specified by the `available_organisms` property. ## Steps to Add Your Dataset ### 1. Prepare Your Data - Save your data as an AnnData object in `.h5ad` format. - Ensure the following: - Metadata columns (e.g., cell type, batch) are included in `obs`. - Gene names are properly defined in `var`. ### 2. Create a Custom Configuration File - Update `src/czbenchmarks/conf/datasets.yaml` by adding a new dataset entry: ```yaml datasets: ... my_dataset: _target_: czbenchmarks.datasets.SingleCellDataset path: ~/path_to_your_data/my_data.h5ad organism: ${organism:HUMAN} ``` - **Explanation:** - `datasets`: Defines the datasets to be loaded. - `my_dataset`: A unique identifier for your dataset. - `_target_`: Specifies the `Dataset` class to instantiate. Currently, `cz-benchmarks` supports `src.czbenchmarks.datasets.single_cell.SingleCellDataset` and `src.czbenchmarks.datasets.single_cell.PerturbationSingleCellDataset` Dataset types. - `path`: Path to your `.h5ad` file. This may be be a local filesystem path or an S3 URL (`s3://...`). - `organism`: Specify the organism, which must be a value from the `src.czbenchmarks.datasets.types.Organism` (e.g., HUMAN, MOUSE). You may add multiple datasets to thie files, as children of `datasets`. ### 3. Load and Validate Your Dataset in Python - Use the following Python code to load your dataset: ```python from czbenchmarks.datasets.utils import load_dataset # Instantiate the `SingleCellDataset` object from the configuration specified in `datasets.yaml` dataset = load_dataset("my_dataset") # Load the H5AD file into memory as an AnnData object, storing in the `ANNDATA` input "slot" of the dataset. dataset.load_data() # Ensure the basic requirements are met by the Dataset dataset.validate() print(dataset.get_input("ANNDATA")) ``` Fix any loading or validation errors, as needed. ## Tips for Customization - **Preprocessing:** If your dataset requires specialized preprocessing, consider subclassing `BaseDataset` in your project. - **Validation:** Ensure organism-specific validations (e.g. gene name prefixes) are met. - **Test with Models:** Specific models may have additional validation requirements, so you will need to invoke applicable models with your specific dataset to ensure that it is fully compliant.