Datasets
The czbenchmarks.datasets module defines the dataset abstraction used across all benchmark pipelines. It provides a uniform and type-safe way to manage dataset inputs ensuring compatibility with tasks.
Overview
cz-benchmarks currently supports single-cell RNA-seq data stored in the AnnData H5AD format. The dataset system is extensible and can be used for other data modalities by creating new dataset types.
Key Components
Dataset
An abstract class that provides ensures all concrete classes provide the following functionality:Loading a dataset file into memory.
Validation of the specified dataset file.
Specification of an
Organism.Performs organism-based validation using the
Organismenum.Storing task-specific outputs to disk for later use by
Tasks.
All dataset types must inherit from
Dataset.Organism
Enum that specifies supported species (e.g., HUMAN, MOUSE) and gene prefixes (e.g.,ENSGandENSMUSG, respectively).SingleCellDataset
An abstract implementation ofDatasetfor single-cell data.Responsibilities:
Loads AnnData object from H5AD files via
anndata.read_h5ad.Stores Anndata in
adatainstance variable.Validates gene name prefixes and that expression values are raw counts.
SingleCellLabeledDataset
Subclass ofSingleCellDatasetfor labeled single-cell data.Responsibilities:
Stores labels (expected prediction values) from a specified
obscolumn.Validates the label column exists
SingleCellPerturbationDataset
Subclass ofSingleCellDatasetdesigned for perturbation benchmarks.Responsibilities:
Validates presence of specific AnnData features:
condition_keyinadata.obscolumn names, and keys namedcontrol_cells_mapandde_results_wilcoxoninadata.uns.It also validates that a column with the value of the parameter
de_gene_col, as well as columns with the names “logfoldchange” and “pval_adj” are present in the differential expression results.The value set by
control_namemust be present for the control cells in the data of condition column inadata.obs.Matches control cells with perturbation data and determines which genes can be masked for benchmarking
Computes and stores control matched AnnData (stored as
dataset.adata). Other outputs,control_cells_map,de_results,target_conditions_dict, are stored in the unstructured portion of the AnnData (adata.uns).
Example valid perturbation formats:
{condition_name}for input or{condition_name}_{perturb}for matched control samples, respectively, where perturb can be any type of perturbation.{perturb}for a single perturbation
Using Available Datasets
Listing Available Datasets
To list all datasets registered in the system:
from czbenchmarks.datasets.utils import list_available_datasets
available_datasets = list_available_datasets()
Loading a Dataset
To load a dataset by name, use the load_dataset utility. The returned object will be an instance of the appropriate dataset class, such as SingleCellLabeledDataset or SingleCellPerturbationDataset:
from czbenchmarks.datasets import load_dataset, SingleCellLabeledDataset
dataset: SingleCellLabeledDataset = load_dataset("tsv2_prostate")
Accessing Dataset Attributes
After loading, you can access the Dataset’s attributes, which vary depending on the dataset type:
For SingleCellLabeledDataset:
adata_object = dataset.adata # AnnData object with expression data
labels_series = dataset.labels # Labels from the specified obs column
For SingleCellPerturbationDataset:
control_cells_map = dataset.control_cells_map # Dictionary: condition → {treatment cell barcodes : matched control barcodes}
target_conditions_dict = dataset.target_conditions_dict # Dictionary of masked gene ids for each condition
de_results = dataset.de_results # Differential expression results
control_matched_adata = dataset.adata # AnnData object for matched controls
Refer to the class docstrings and API documentation for more details on available attributes and methods.
Tips for Developers
AnnData Views: Use
.copy()when slicing data to avoid issues with modified “views” in Scanpy.