Add a Custom Dataset Type

Adding Datasets for Supported Types

This section describes how to add new datasets for use with the czbenchmarks.datasets module, focusing on single-cell RNA-seq data in AnnData .h5ad format.

Requirements

For single-cell datasets:

  • As with all datasets, Ensembl gene IDs must be valid for the specified Organism (e.g., ENSG for human, ENSMUSG for mouse).

  • The dataset file must be an .h5ad file conforming to the AnnData on-disk format.

  • The AnnData object’s var_names must specify the Ensembl gene ID for each gene, or var must contain a column named ensembl_id.

The AnnData object must also meet validation requirements for the specific dataset class:

  • For SingleCellLabeledDataset: obs must contain the label column (e.g., cell_type).

  • For SingleCellPerturbationDataset:

    • obs must contain a column with the value specified by condition_key in the dataset configuration. The control cells should be labeled with the value specified by the control condition value (control_name) for control cells.

    • A mapping of treatment cells to their control cells is expected in the AnnData unstructured data (uns) under control_cells_map. The structure of this mapping is a nested dictionary where the top level key is a condition and the value is a key/value pair of treatment cell id and control cell id, respectively.

    • A table of differential expression results is also expected in the AnnData unstructured data under de_results_wilcoxon. The differential expression results table must include the column specified by the parameter de_gene_col in the dataset configuration file, in addition to columns titled “logfoldchange” and “pval_adj”. These columns are analogous to those returned from scanpy.tl.rank_genes_groups.

1. Prepare Your Data

  • Save your data as an AnnData object in .h5ad format.

  • Ensure:

    • All required metadata columns (e.g., cell type, batch, condition) are included in obs.

    • Ensembl ids are properly defined in var or as var_names.

2. Update Datasets Configuration File

Add your datasets to the existing configuration file (e.g., src/czbenchmarks/conf/datasets.yaml) or to a new file (e.g., custom/path/my_custom_datasets.yaml):

datasets:
  my_labeled_dataset:
  _target_: czbenchmarks.datasets.SingleCellLabeledDataset
  path: /path/to/your/labeled_data.h5ad
  organism: ${organism:HUMAN}
  label_column_key: "cell_type" # Column in adata.obs with labels

  my_perturbation_dataset:
  _target_: czbenchmarks.datasets.SingleCellPerturbationDataset
  path: /path/to/your/perturb_data.h5ad
  organism: ${organism:MOUSE}
  condition_key: condition
  control_name: ctrl
  de_gene_col: gene_id

Explanation of keys:

  • datasets: Top-level key for dataset definitions.

  • Each child (e.g., my_labeled_dataset) is a unique dataset identifier.

  • _target_: The fully qualified class name of the dataset type. Supported types include:

    • czbenchmarks.datasets.SingleCellLabeledDataset (for labeled single-cell data)

    • czbenchmarks.datasets.SingleCellPerturbationDataset (for perturbation datasets)

  • path: Path to the .h5ad file (local or S3).

  • organism: Must be a value from czbenchmarks.datasets.types.Organism (e.g., HUMAN, MOUSE).

  • label_column_key: (For SingleCellLabeledDataset) Name of the label column in obs.

  • condition_key, control_name, de_gene_col: (For SingleCellPerturbationDataset) Required keys for perturbation data and DE results.

You may add multiple datasets as children of datasets.

3. Using Custom Datasets

Customized datasets can be loaded using the load_custom_dataset function, from either a supplemental yaml configuration file, as created in the example above, or from a dictionary of configuration parameters.

If the same parameters exist in both the default and supplemental yaml configuration files, the values in the supplemental file will override those in the default file. Values provided as a dictionary will override both.

While it is possible to provide both a supplemental yaml file and dictionary parameters, using multiple forms of input for parameters can make it complex to track final values and is not recommended.

a. Registering Datasets with a Supplemental YAML Configuration

Define datasets in a YAML file and load them by name using load_custom_dataset:

Example: user_dataset.yaml

datasets:
  user_dataset:
    _target_: czbenchmarks.datasets.SingleCellLabeledDataset
    organism: ${organism:HUMAN}
    path: s3://<bucket name>/<path>/example-small.h5ad

Load the dataset in Python:

from czbenchmarks.datasets.utils import load_custom_dataset
dataset = load_custom_dataset(dataset_name="user_dataset", custom_dataset_config_path='user_dataset.yaml')

b. Loading Customized Datasets from a Parameter Dictionary

from czbenchmarks.datasets.utils import load_custom_dataset
from czbenchmarks.datasets.types import Organism

my_dataset_name = "my_dataset"
custom_dataset_config = {
    "_target_": "czbenchmarks.datasets.SingleCellLabeledDataset",
    "organism": Organism.HUMAN,
    "path": "my_data.h5ad",
}
dataset = load_custom_dataset(
    dataset_name=my_dataset_name,
    custom_dataset_kwargs=custom_dataset_config
)
print(dataset.adata)

You can use any supported dataset class and provide additional keyword arguments as needed.


Steps to Add a New Dataset Type

1. Define a New Dataset Class

Create a new Python class that inherits from Dataset or one of its subclasses (such as SingleCellDataset) in czbenchmarks.datasets. Implement the required methods:

  • load_data(self): Load your data from disk (using self.path) and populate instance variables (e.g., self.adata, self.labels). You can call super().load_data() to leverage base loading logic if using a subclass like SingleCellDataset.

  • _validate(self): Add custom validation logic for your dataset. Call super()._validate() to include base checks, then add any dataset-specific assertions or error checks.

  • store_task_inputs(self): (Optional, but recommended) Save any derived or preprocessed data needed by tasks to self.task_inputs_dir.

Example:

from czbenchmarks.datasets import SingleCellDataset
import anndata as ad

class MyCustomDataset(SingleCellDataset):
  def load_data(self):
    # Load the base AnnData object using the parent method
    super().load_data()
    # Add custom loading logic
    if "my_custom_key" not in self.adata.obs:
      raise ValueError("Dataset is missing 'my_custom_key' in obs.")
    self.my_annotation = self.adata.obs["my_custom_key"]

  def _validate(self):
    # Run parent validation
    super()._validate()
    # Add custom validation logic
    assert all(self.my_annotation.notna()), "Custom annotation has missing values!"

  def store_task_inputs(self):
    # Optional: Save any derived data needed by tasks
    pass

2. Register and Use Your Dataset

  • Add your new dataset class to the appropriate module in czbenchmarks.datasets.

  • Register it in the src/czbenchmarks/datasets/__init__.py file.

  • You can now use your new dataset type with load_customized_dataset.

3. Test and Validate

  • Ensure your dataset loads and validates correctly.

  • Test it with the intended tasks to ensure compatibility.

Tips

  • Place your new class in the appropriate module under czbenchmarks.datasets.

  • If your dataset type is specialized (e.g., single-cell), inherit from the relevant subclass (SingleCellDataset).

  • Refer to existing classes in single_cell.py or single_cell_labeled.py for more examples.

  • Register your class in the module’s __init__.py if you want it to be importable directly from czbenchmarks.datasets.