Add a Custom Dataset Type
Adding Datasets for Supported Types
This section describes how to add new datasets for use with the czbenchmarks.datasets module, focusing on single-cell RNA-seq data in AnnData .h5ad format.
Requirements
For single-cell datasets:
As with all datasets, Ensembl gene IDs must be valid for the specified
Organism(e.g.,ENSGfor human,ENSMUSGfor mouse).The dataset file must be an
.h5adfile conforming to the AnnData on-disk format.The AnnData object’s
var_namesmust specify the Ensembl gene ID for each gene, orvarmust contain a column namedensembl_id.
The AnnData object must also meet validation requirements for the specific dataset class:
For
SingleCellLabeledDataset:obsmust contain the label column (e.g.,cell_type).For
SingleCellPerturbationDataset:obsmust contain a column with the value specified bycondition_keyin the dataset configuration. The control cells should be labeled with the value specified by the control condition value (control_name) for control cells.A mapping of treatment cells to their control cells is expected in the AnnData unstructured data (
uns) undercontrol_cells_map. The structure of this mapping is a nested dictionary where the top level key is a condition and the value is a key/value pair of treatment cell id and control cell id, respectively.A table of differential expression results is also expected in the AnnData unstructured data under
de_results_wilcoxon. The differential expression results table must include the column specified by the parameterde_gene_colin the dataset configuration file, in addition to columns titled “logfoldchange” and “pval_adj”. These columns are analogous to those returned fromscanpy.tl.rank_genes_groups.
1. Prepare Your Data
Save your data as an AnnData object in
.h5adformat.Ensure:
All required metadata columns (e.g., cell type, batch, condition) are included in
obs.Ensembl ids are properly defined in
varor asvar_names.
2. Update Datasets Configuration File
Add your datasets to the existing configuration file (e.g., src/czbenchmarks/conf/datasets.yaml) or to a new file (e.g., custom/path/my_custom_datasets.yaml):
datasets:
my_labeled_dataset:
_target_: czbenchmarks.datasets.SingleCellLabeledDataset
path: /path/to/your/labeled_data.h5ad
organism: ${organism:HUMAN}
label_column_key: "cell_type" # Column in adata.obs with labels
my_perturbation_dataset:
_target_: czbenchmarks.datasets.SingleCellPerturbationDataset
path: /path/to/your/perturb_data.h5ad
organism: ${organism:MOUSE}
condition_key: condition
control_name: ctrl
de_gene_col: gene_id
Explanation of keys:
datasets: Top-level key for dataset definitions.Each child (e.g.,
my_labeled_dataset) is a unique dataset identifier._target_: The fully qualified class name of the dataset type. Supported types include:czbenchmarks.datasets.SingleCellLabeledDataset(for labeled single-cell data)czbenchmarks.datasets.SingleCellPerturbationDataset(for perturbation datasets)
path: Path to the.h5adfile (local or S3).organism: Must be a value fromczbenchmarks.datasets.types.Organism(e.g., HUMAN, MOUSE).label_column_key: (ForSingleCellLabeledDataset) Name of the label column inobs.condition_key,control_name,de_gene_col: (ForSingleCellPerturbationDataset) Required keys for perturbation data and DE results.
You may add multiple datasets as children of datasets.
3. Using Custom Datasets
Customized datasets can be loaded using the load_custom_dataset function, from either a supplemental yaml configuration file, as created in the example above, or from a dictionary of configuration parameters.
If the same parameters exist in both the default and supplemental yaml configuration files, the values in the supplemental file will override those in the default file. Values provided as a dictionary will override both.
While it is possible to provide both a supplemental yaml file and dictionary parameters, using multiple forms of input for parameters can make it complex to track final values and is not recommended.
a. Registering Datasets with a Supplemental YAML Configuration
Define datasets in a YAML file and load them by name using load_custom_dataset:
Example: user_dataset.yaml
datasets:
user_dataset:
_target_: czbenchmarks.datasets.SingleCellLabeledDataset
organism: ${organism:HUMAN}
path: s3://<bucket name>/<path>/example-small.h5ad
Load the dataset in Python:
from czbenchmarks.datasets.utils import load_custom_dataset
dataset = load_custom_dataset(dataset_name="user_dataset", custom_dataset_config_path='user_dataset.yaml')
b. Loading Customized Datasets from a Parameter Dictionary
from czbenchmarks.datasets.utils import load_custom_dataset
from czbenchmarks.datasets.types import Organism
my_dataset_name = "my_dataset"
custom_dataset_config = {
"_target_": "czbenchmarks.datasets.SingleCellLabeledDataset",
"organism": Organism.HUMAN,
"path": "my_data.h5ad",
}
dataset = load_custom_dataset(
dataset_name=my_dataset_name,
custom_dataset_kwargs=custom_dataset_config
)
print(dataset.adata)
You can use any supported dataset class and provide additional keyword arguments as needed.
Steps to Add a New Dataset Type
1. Define a New Dataset Class
Create a new Python class that inherits from Dataset or one of its subclasses (such as SingleCellDataset) in czbenchmarks.datasets. Implement the required methods:
load_data(self): Load your data from disk (usingself.path) and populate instance variables (e.g.,self.adata,self.labels). You can callsuper().load_data()to leverage base loading logic if using a subclass likeSingleCellDataset._validate(self): Add custom validation logic for your dataset. Callsuper()._validate()to include base checks, then add any dataset-specific assertions or error checks.store_task_inputs(self): (Optional, but recommended) Save any derived or preprocessed data needed by tasks toself.task_inputs_dir.
Example:
from czbenchmarks.datasets import SingleCellDataset
import anndata as ad
class MyCustomDataset(SingleCellDataset):
def load_data(self):
# Load the base AnnData object using the parent method
super().load_data()
# Add custom loading logic
if "my_custom_key" not in self.adata.obs:
raise ValueError("Dataset is missing 'my_custom_key' in obs.")
self.my_annotation = self.adata.obs["my_custom_key"]
def _validate(self):
# Run parent validation
super()._validate()
# Add custom validation logic
assert all(self.my_annotation.notna()), "Custom annotation has missing values!"
def store_task_inputs(self):
# Optional: Save any derived data needed by tasks
pass
2. Register and Use Your Dataset
Add your new dataset class to the appropriate module in
czbenchmarks.datasets.Register it in the
src/czbenchmarks/datasets/__init__.pyfile.You can now use your new dataset type with
load_customized_dataset.
3. Test and Validate
Ensure your dataset loads and validates correctly.
Test it with the intended tasks to ensure compatibility.
Tips
Place your new class in the appropriate module under
czbenchmarks.datasets.If your dataset type is specialized (e.g., single-cell), inherit from the relevant subclass (
SingleCellDataset).Refer to existing classes in
single_cell.pyorsingle_cell_labeled.pyfor more examples.Register your class in the module’s
__init__.pyif you want it to be importable directly fromczbenchmarks.datasets.