Tasks
The czbenchmarks.tasks
module defines benchmarking tasks that evaluate the performance of models based on their outputs. Tasks take in datasets with model-generated outputs and compute metrics specific to each task type.
Core Concepts
BaseTask
All task classes inherit from this abstract base class. It defines the standard lifecycle of a task:Input and Output Validation via
required_inputs
andrequired_outputs
Execution via the
_run_task()
methodMetric Computation via the
_compute_metrics()
method
It also supports multi-dataset operations (
requires_multiple_datasets
) and setting baseline embeddings (via PCA) for comparison with model outputs.
Task Organization
Tasks in the czbenchmarks.tasks
module are organized based on their scope and applicability:
Generic Tasks: Tasks that can be applied across multiple modalities (e.g., embedding evaluation, clustering, label prediction) are placed directly in the
tasks/
directory. Each task is implemented in its own file (e.g.,embedding.py
,clustering.py
) with clear and descriptive names. Generic tasks avoid dependencies specific to any particular modality.Specialized Tasks: Tasks designed for specific modalities are placed in dedicated subdirectories. For example:
single_cell/
for single-cell-specific tasksimaging/
for imaging-related tasks. Not implemented in current release, For future imaging models
New subdirectories can be created as needed for other modalities.
Available Tasks
Each task class implements a specific evaluation goal. All tasks are located under the czbenchmarks.tasks
namespace or its submodules.
ClusteringTask: Performs Leiden clustering on the embedding produced by a model and compares it to ground-truth cell-type labels using metrics like Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI).
EmbeddingTask: Computes embedding quality using the Silhouette Score based on known cell-type annotations.
MetadataLabelPredictionTask: Performs k-fold cross-validation classification using multiple classifiers (logistic regression, KNN, random forest) on the model embeddings to predict metadata labels (e.g., cell type, sex). Evaluates metrics like accuracy, F1, precision, recall, and AUROC.
BatchIntegrationTask: Evaluates how well the model integrates batch-specific embeddings using entropy per cell and batch-aware Silhouette scores. Assesses whether embeddings mix batches while preserving biological labels.
PerturbationTask: Designed for gene perturbation prediction models. Compares predicted gene expression shifts to ground truth. Computes metrics such as mean squared error, Pearson R², and Jaccard overlap for DE genes.
CrossSpeciesIntegrationTask: A multi-dataset task. Evaluates how well models embed cells from different species into a shared space using metrics like entropy per cell and silhouette scores. Requires embeddings from multiple species as input.
Extending Tasks
To define a new evaluation task:
Inherit from BaseTask
Choose the Right Location:
If the task is generic and works across multiple modalities, add it to the
tasks/
directory.If the task is specific to a particular modality, add it to the appropriate subdirectory (e.g.,
single_cell/
,imaging/
).
Create the Task File:
Each task should be implemented in its own file. Below is an example skeleton for creating a new task:Override the following methods:
required_inputs
: a set ofDataType
values required as inputsrequired_outputs
: a set ofDataType
values expected as model outputs_run_task(data, model_type)
: executes task logic using input data and model outputs_compute_metrics()
: returns a list ofMetricResult
objects
Update
__init__.py
:For generic tasks, add the new task to
tasks/__init__.py
.For specialized tasks, add the new task to the
__init__.py
file in the corresponding modality-specific subdirectory.
Documentation:
Add detailed docstrings to your task class and methods.
Update the relevant documentation files to include the new task.
Optional Features:
Set
requires_multiple_datasets = True
if your task operates on a list of datasetsCall
self.set_baseline(dataset)
in your task to enable PCA baseline comparisons
Return Metrics:
Use the MetricRegistry to compute and return standard metrics with strong typing.
Example Skeleton:
from czbenchmarks.tasks.base import BaseTask from czbenchmarks.datasets import DataType from czbenchmarks.models.types import ModelType from czbenchmarks.metrics.types import MetricResult class MyNewTask(BaseTask): @property def required_inputs(self): return {DataType.METADATA} @property def required_outputs(self): return {DataType.EMBEDDING} def _run_task(self, data, model_type: ModelType): self.embedding = data.get_output(model_type, DataType.EMBEDDING) self.labels = data.get_input(DataType.METADATA)["cell_type"] def _compute_metrics(self): result = ... # your metric computation here return [MetricResult(metric_type="my_metric", value=result)]
Best Practices
Keep tasks focused and single-purpose to ensure clarity and maintainability.
Clearly document the input and output requirements for each task.
Follow the patterns and conventions established in existing tasks for consistency.
Use type hints to improve code readability and clarity.
Add logging for key steps in the task lifecycle to facilitate debugging and monitoring.