Design and Architecture Overview
cz-benchmarks is built for modularity and reproducibility. This guide gives you a clear, high-level overview of the cz-benchmarks architecture. By understanding these core ideas, you’ll be able to use and extend the package more effectively.
Key Design Concepts
Declarative Configuration: Use Hydra and OmegaConf to centralize and manage configuration for datasets.
Loose Coupling: Components communicate through well-defined interfaces. This minimizes dependencies and makes testing easier.
Validation and Type Safety: Custom type definitions in the datasets.
At its heart, the framework follows a simple principle: separating data from evaluation. Dataset objects handle and standardize biological data, making sure it’s ready for analysis. Task objects then run evaluations on the outputs from your models, letting you focus on your results.
Its core components include:
- Datasets:
Handle input data such as AnnData objects and metadata, making sure the data is correct and ready to use by checking types with custom DataType definitions. The Dataset component is responsible for loading, validating, and giving easy access to standardized biological data (for example, from an .h5ad file). It takes care of reading different data formats and checks that the data has everything needed for evaluation, like the right gene names and required metadata. Support for images will be added in the future. See Datasets for more details.
- CellRepresentation:
Represents the output from your model (for example, a cell embedding as a np.ndarray). The framework follows a “bring-your-own-model” approach: you run your model independently and provide the resulting CellRepresentation to the evaluation Task.
- TaskInput and TaskOutput:
These are Pydantic models that ensure type-safe data transfer.
TaskInputbundles the necessary information from aDataset(like ground-truth labels), whileTaskOutputstructures the results from a task’s internal computation before metrics are calculated.
- Tasks:
Define the different types of evaluations you can run, such as clustering, embedding quality, label prediction, and perturbation analysis. Think of Tasks as the “evaluation engine” of the framework. Each Task (like ClusteringTask or PerturbationTask) contains all the logic needed to run a specific biological benchmark. You give a Task your model’s CellRepresentation, and it computes the relevant performance metrics for you. Tasks are built by extending the base Task class, making it easy to create new types of evaluations or customize existing ones. See Tasks for more details.
- Metrics:
A central MetricRegistry handles the registration and computation of metrics, enabling consistent and reusable evaluation criteria. See Metrics for more details.
- MetricRegistry and MetricResult:
The registry provides a centralized way to compute metrics (ADJUSTED_RAND_INDEX, MEAN_SQUARED_ERROR, etc.). All tasks use this registry to produce a standardized list of MetricResult objects.
- Configuration Management:
Uses Hydra and OmegaConf to dynamically compose configurations for datasets.
Class Diagrams
classDiagram
ABC <|-- Dataset
Enum <|-- Organism
SingleCellDataset <|-- SingleCellLabeledDataset
SingleCellDataset <|-- SingleCellPerturbationDataset
classDiagram
ABC <|-- Task
BaseModel <|-- MetricResult
BaseModel <|-- TaskInput
BaseModel <|-- TaskOutput
Task <|-- BatchIntegrationTask
Task <|-- ClusteringTask
Task <|-- CrossSpeciesIntegrationTask
Task <|-- CrossSpeciesLabelPredictionTask
Task <|-- EmbeddingTask
Task <|-- MetadataLabelPredictionTask
Task <|-- PerturbationExpressionPredictionTask
Task <|-- SequentialOrganizationTask
TaskInput <|-- BatchIntegrationTaskInput
TaskInput <|-- ClusteringTaskInput
TaskInput <|-- CrossSpeciesIntegrationTaskInput
TaskInput <|-- CrossSpeciesLabelPredictionTaskInput
TaskInput <|-- EmbeddingTaskInput
TaskInput <|-- MetadataLabelPredictionTaskInput
TaskInput <|-- PerturbationExpressionPredictionTaskInput
TaskInput <|-- SequentialOrganizationTaskInput
TaskOutput <|-- BatchIntegrationOutput
TaskOutput <|-- ClusteringOutput
TaskOutput <|-- CrossSpeciesIntegrationOutput
TaskOutput <|-- CrossSpeciesLabelPredictionOutput
TaskOutput <|-- EmbeddingOutput
TaskOutput <|-- MetadataLabelPredictionOutput
TaskOutput <|-- PerturbationExpressionPredictionOutput
TaskOutput <|-- SequentialOrganizationOutput
classDiagram
BaseModel <|-- AggregatedMetricResult
BaseModel <|-- MetricInfo
BaseModel <|-- MetricResult
Enum <|-- MetricType
The Standard Workflow
A typical benchmarking workflow follows these steps:
- Load Dataset:
Use
dataset = load_dataset(...)to load a dataset. This gives you aDatasetobject with loaded data (e.g.,dataset.adata) and relevant metadata (e.g.,dataset.labels).
- User Generates Model Output:
Run your own ML model using the data from the
Datasetobject (e.g.,dataset.adata.X) to produce aCellRepresentation(such as a cell embedding). For example:embedding = my_model(dataset.adata). This step happens outside thecz-benchmarkspackage.
- Prepare Task Inputs:
Create an instance of the task-specific
TaskInputclass, populating it with the necessary ground-truth data from theDatasetobject. For example:task_input = TaskInput(labels=dataset.labels).
- Instantiate and Run Task:
Instantiate the desired
Taskand call its.run()method, passing yourCellRepresentationand the preparedTaskInput. For example:results = task.run(embedding, task_input).
- Analyze Results:
The task returns a list of
MetricResultobjects, which you can then analyze, plot, or save.
This modular design allows you to evaluate any model on any compatible dataset using a standardized and reproducible set of tasks and metrics.