czbenchmarks.tasks

Submodules

Attributes

TASK_REGISTRY

Classes

ClusteringOutput

Output for clustering task.

ClusteringTask

Task for evaluating clustering performance against ground truth labels.

ClusteringTaskInput

Base class for task inputs.

EmbeddingOutput

Output for embedding task.

EmbeddingTask

Task for evaluating cell representation quality using labeled data.

EmbeddingTaskInput

Pydantic model for EmbeddingTask inputs.

BatchIntegrationOutput

Output for batch integration task.

BatchIntegrationTask

Task for evaluating batch integration quality.

BatchIntegrationTaskInput

Pydantic model for BatchIntegrationTask inputs.

MetadataLabelPredictionOutput

Output for label prediction task.

MetadataLabelPredictionTask

Task for predicting labels from embeddings using cross-validation.

MetadataLabelPredictionTaskInput

Pydantic model for MetadataLabelPredictionTask inputs.

SequentialOrganizationOutput

Output for sequential organization task.

SequentialOrganizationTask

Task for evaluating sequential consistency in embeddings.

SequentialOrganizationTaskInput

Pydantic model for Sequential Organization inputs.

CrossSpeciesIntegrationOutput

Output for cross-species integration task.

CrossSpeciesIntegrationTask

Task for evaluating cross-species integration quality.

CrossSpeciesIntegrationTaskInput

Pydantic model for CrossSpeciesIntegrationTask inputs.

CrossSpeciesLabelPredictionTaskInput

Base class for task inputs.

CrossSpeciesLabelPredictionOutput

Base class for task outputs.

CrossSpeciesLabelPredictionTask

Task for cross-species label prediction evaluation.

MetricResult

Represents the result of a single metric computation.

Task

Abstract base class for all benchmark tasks.

TaskInput

Base class for task inputs.

TaskOutput

Base class for task outputs.

PerturbationExpressionPredictionOutput

Output for perturbation task.

PerturbationExpressionPredictionTask

Abstract base class for all benchmark tasks.

PerturbationExpressionPredictionTaskInput

Pydantic model for Perturbation task inputs.

Package Contents

class czbenchmarks.tasks.ClusteringOutput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskOutput

Output for clustering task.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

predicted_labels: List[int]
class czbenchmarks.tasks.ClusteringTask(*, random_seed: int = RANDOM_SEED)[source]

Bases: czbenchmarks.tasks.task.Task

Task for evaluating clustering performance against ground truth labels.

This task performs clustering on embeddings and evaluates the results using multiple clustering metrics (ARI and NMI).

Parameters:

random_seed (int) – Random seed for reproducibility

display_name = 'Clustering'
description = 'Evaluate clustering performance against ground truth labels using ARI and NMI metrics.'
input_model
class czbenchmarks.tasks.ClusteringTaskInput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskInput

Base class for task inputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

obs: pandas.DataFrame
input_labels: czbenchmarks.types.ListLike
use_rep: str = 'X'
n_iterations: int = 2
flavor: Literal['leidenalg', 'igraph'] = 'igraph'
key_added: str = 'leiden'
class czbenchmarks.tasks.EmbeddingOutput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskOutput

Output for embedding task.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

cell_representation: czbenchmarks.tasks.types.CellRepresentation
class czbenchmarks.tasks.EmbeddingTask(*, random_seed: int = RANDOM_SEED)[source]

Bases: czbenchmarks.tasks.task.Task

Task for evaluating cell representation quality using labeled data.

This task computes quality metrics for cell representations using ground truth labels. Currently supports silhouette score evaluation.

Parameters:

random_seed (int) – Random seed for reproducibility

display_name = 'Embedding'
description = 'Evaluate cell representation quality using silhouette score with ground truth labels.'
input_model
class czbenchmarks.tasks.EmbeddingTaskInput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskInput

Pydantic model for EmbeddingTask inputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

input_labels: czbenchmarks.types.ListLike
class czbenchmarks.tasks.BatchIntegrationOutput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskOutput

Output for batch integration task.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

cell_representation: czbenchmarks.tasks.types.CellRepresentation
class czbenchmarks.tasks.BatchIntegrationTask(*, random_seed: int = RANDOM_SEED)[source]

Bases: czbenchmarks.tasks.task.Task

Task for evaluating batch integration quality.

This task computes metrics to assess how well different batches are integrated in the embedding space while preserving biological signals.

Parameters:

random_seed (int) – Random seed for reproducibility

display_name = 'Batch Integration'
description = 'Evaluate batch integration quality using various integration metrics.'
input_model
class czbenchmarks.tasks.BatchIntegrationTaskInput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskInput

Pydantic model for BatchIntegrationTask inputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

batch_labels: czbenchmarks.types.ListLike
labels: czbenchmarks.types.ListLike
class czbenchmarks.tasks.MetadataLabelPredictionOutput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskOutput

Output for label prediction task.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

results: List[Dict[str, Any]]
class czbenchmarks.tasks.MetadataLabelPredictionTask(*, random_seed: int = RANDOM_SEED)[source]

Bases: czbenchmarks.tasks.task.Task

Task for predicting labels from embeddings using cross-validation.

Evaluates multiple classifiers (Logistic Regression, KNN) using k-fold cross-validation. Reports standard classification metrics.

Parameters:

random_seed (int) – Random seed for reproducibility

display_name = 'Label Prediction'
description = 'Predict labels from embeddings using cross-validated classifiers and standard metrics.'
input_model
compute_baseline(expression_data: czbenchmarks.tasks.types.CellRepresentation, **kwargs) czbenchmarks.tasks.types.CellRepresentation[source]

Set a baseline cell representation using raw gene expression.

Instead of using embeddings from a model, this method uses the raw gene expression matrix as features for classification. This provides a baseline performance to compare against model-generated embeddings for classification tasks.

Parameters:

expression_data – gene expression data or embedding

Returns:

Baseline embedding

class czbenchmarks.tasks.MetadataLabelPredictionTaskInput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskInput

Pydantic model for MetadataLabelPredictionTask inputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

labels: czbenchmarks.types.ListLike
n_folds: int = 5
min_class_size: int = 10
class czbenchmarks.tasks.SequentialOrganizationOutput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskOutput

Output for sequential organization task.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

embedding: czbenchmarks.tasks.types.CellRepresentation
class czbenchmarks.tasks.SequentialOrganizationTask(*, random_seed: int = RANDOM_SEED)[source]

Bases: czbenchmarks.tasks.task.Task

Task for evaluating sequential consistency in embeddings.

This task computes sequential quality metrics for embeddings using time point labels. Evaluates how well embeddings preserve sequential organization between cells.

Parameters:

random_seed (int) – Random seed for reproducibility

display_name = 'Sequential Organization'
description = 'Evaluate sequential consistency in embeddings using time point labels and k-NN based metrics.'
input_model
class czbenchmarks.tasks.SequentialOrganizationTaskInput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskInput

Pydantic model for Sequential Organization inputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

obs: pandas.DataFrame
input_labels: czbenchmarks.types.ListLike
k: int = 15
normalize: bool = True
adaptive_k: bool = False
class czbenchmarks.tasks.CrossSpeciesIntegrationOutput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskOutput

Output for cross-species integration task.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

cell_representation: czbenchmarks.tasks.types.CellRepresentation
labels: czbenchmarks.types.ListLike
species: czbenchmarks.types.ListLike
class czbenchmarks.tasks.CrossSpeciesIntegrationTask(*, random_seed: int = RANDOM_SEED)[source]

Bases: czbenchmarks.tasks.task.Task

Task for evaluating cross-species integration quality.

This task computes metrics to assess how well different species’ data are integrated in the embedding space while preserving biological signals. It operates on multiple datasets from different species.

Parameters:

random_seed (int) – Random seed for reproducibility

display_name = 'Cross-species Integration'
description = 'Evaluate cross-species integration quality using various integration metrics.'
input_model
requires_multiple_datasets = True
abstract compute_baseline(**kwargs)[source]

Set a baseline embedding for cross-species integration.

This method is not implemented for cross-species integration tasks as standard preprocessing workflows are not directly applicable across different species.

Raises:

NotImplementedError – Always raised as baseline is not implemented

class czbenchmarks.tasks.CrossSpeciesIntegrationTaskInput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskInput

Pydantic model for CrossSpeciesIntegrationTask inputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

labels: List[czbenchmarks.types.ListLike]
organism_list: List[czbenchmarks.datasets.types.Organism]
class czbenchmarks.tasks.CrossSpeciesLabelPredictionTaskInput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskInput

Base class for task inputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

labels: List[czbenchmarks.types.ListLike]
organisms: List[czbenchmarks.datasets.types.Organism]
sample_ids: List[czbenchmarks.types.ListLike] | None = None
aggregation_method: Literal['none', 'mean', 'median'] = 'mean'
n_folds: int = 5
class czbenchmarks.tasks.CrossSpeciesLabelPredictionOutput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskOutput

Base class for task outputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

results: List[Dict[str, Any]]
class czbenchmarks.tasks.CrossSpeciesLabelPredictionTask(*, random_seed: int = RANDOM_SEED)[source]

Bases: czbenchmarks.tasks.task.Task

Task for cross-species label prediction evaluation.

This task evaluates cross-species transfer by training classifiers on one species and testing on another species. It computes accuracy, F1, precision, recall, and AUROC for multiple classifiers (Logistic Regression, KNN, Random Forest).

The task can optionally aggregate cell-level embeddings to sample/donor level before running classification.

Parameters:

random_seed (int) – Random seed for reproducibility

display_name = 'cross-species label prediction'
requires_multiple_datasets = True
abstract compute_baseline(**kwargs)[source]

Set a baseline for cross-species label prediction.

This method is not implemented for cross-species prediction tasks as standard preprocessing workflows need to be applied per species.

Raises:

NotImplementedError – Always raised as baseline is not implemented

czbenchmarks.tasks.TASK_REGISTRY
class czbenchmarks.tasks.MetricResult(/, **data: Any)[source]

Bases: pydantic.BaseModel

Represents the result of a single metric computation.

Encapsulates the computed value, associated metric type, and any parameters used during computation. Provides functionality for generating aggregation keys to group similar metrics.

metric_type

The type of metric computed.

Type:

MetricType

value

The computed metric value.

Type:

float

params

Parameters used during computation.

Type:

Optional[Dict[str, Any]]

aggregation_key()

Generates a key based on the metric type and parameters to aggregate similar metrics together.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

metric_type: MetricType
value: float
params: Dict[str, Any] | None = None
property aggregation_key: str

return a key based on the metric type and params in order to aggregate the same metrics together

class czbenchmarks.tasks.Task(*, random_seed: int = RANDOM_SEED)[source]

Bases: abc.ABC

Abstract base class for all benchmark tasks.

Defines the interface that all tasks must implement. Tasks are responsible for: 1. Declaring their required input/output data types 2. Running task-specific computations 3. Computing evaluation metrics

Tasks should store any intermediate results as instance variables to be used in metric computation.

Parameters:

random_seed (int) – Random seed for reproducibility

random_seed = 42
requires_multiple_datasets = False
classmethod __init_subclass__(**kwargs)[source]

Automatically register task subclasses when they are defined.

compute_baseline(expression_data: czbenchmarks.tasks.types.CellRepresentation, **kwargs) czbenchmarks.tasks.types.CellRepresentation[source]

Set a baseline embedding using PCA on gene expression data.

This method performs standard preprocessing on the raw gene expression data and uses PCA for dimensionality reduction. It then sets the PCA embedding as the BASELINE model output in the dataset, which can be used for comparison with other model embeddings.

Parameters:
  • expression_data – expression data to use for anndata

  • **kwargs – Additional arguments passed to run_standard_scrna_workflow

run(cell_representation: czbenchmarks.tasks.types.CellRepresentation | List[czbenchmarks.tasks.types.CellRepresentation], task_input: TaskInput) List[czbenchmarks.metrics.types.MetricResult][source]

Run the task on input data and compute metrics.

Parameters:
  • cell_representation – gene expression data or embedding to use for the task

  • task_input – Pydantic model with inputs for the task

Returns:

A one-element list containing a single metric result for the task For multiple embeddings: List of metric results for each task, one per dataset

Return type:

For single embedding

Raises:

ValueError – If input does not match multiple embedding requirement

class czbenchmarks.tasks.TaskInput(/, **data: Any)[source]

Bases: pydantic.BaseModel

Base class for task inputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class czbenchmarks.tasks.TaskOutput(/, **data: Any)[source]

Bases: pydantic.BaseModel

Base class for task outputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class czbenchmarks.tasks.PerturbationExpressionPredictionOutput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskOutput

Output for perturbation task.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

pred_mean_change_dict: Dict[str, numpy.ndarray]
true_mean_change_dict: Dict[str, numpy.ndarray]
class czbenchmarks.tasks.PerturbationExpressionPredictionTask(*, random_seed: int = RANDOM_SEED)[source]

Bases: czbenchmarks.tasks.task.Task

Abstract base class for all benchmark tasks.

Defines the interface that all tasks must implement. Tasks are responsible for: 1. Declaring their required input/output data types 2. Running task-specific computations 3. Computing evaluation metrics

Tasks should store any intermediate results as instance variables to be used in metric computation.

Parameters:

random_seed (int) – Random seed for reproducibility

Perturbation Expression Prediction Task.

This task evaluates perturbation-induced expression predictions against their ground truth values. This is done by calculating metrics derived from predicted and ground truth log fold change values for each condition. Currently, Spearman rank correlation is supported.

The following arguments are required and must be supplied by the task input class (PerturbationExpressionPredictionTaskInput) when running the task. These parameters are described below for documentation purposes:

  • predictions_adata (ad.AnnData):

    The anndata containing model predictions

  • dataset_adata (ad.AnnData):

    The anndata object from SingleCellPerturbationDataset.

  • pred_effect_operation (Literal[“difference”, “ratio”]):

    How to compute predicted effect between treated and control mean predictions over genes.

    • “ratio” uses \(\log\left(\frac{\text{mean}(\text{treated}) + \varepsilon}{\text{mean}(\text{control}) + \varepsilon}\right)\) when means are positive.

    • “difference” uses \(\text{mean}(\text{treated}) - \text{mean}(\text{control})\) and is generally safe across scales (probabilities, z-scores, raw expression).

    Default is “ratio”.

  • gene_index (Optional[pd.Index]):

    The index of the genes in the predictions AnnData.

  • cell_index (Optional[pd.Index]):

    The index of the cells in the predictions AnnData.

Parameters:

random_seed (int) – Random seed for reproducibility.

Returns:

dictionary of mean predicted and ground truth changes in gene expression values for each condition.

Return type:

PerturbationExpressionPredictionTask

display_name = 'Perturbation Expression Prediction'
description = 'Evaluate the quality of predicted changes in expression levels for genes that are...
input_model
condition_key = None
abstract compute_baseline(**kwargs)[source]

Set a baseline embedding for perturbation expression prediction.

This method is not implemented for perturbation expression prediction tasks.

Raises:

NotImplementedError – Always raised as baseline is not implemented

class czbenchmarks.tasks.PerturbationExpressionPredictionTaskInput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskInput

Pydantic model for Perturbation task inputs.

Dataclass to contain input parameters for the PerturbationExpressionPredictionTask. The row and column ordering of the model predictions can optionallybe provided as cell_index and gene_index, respectively, so the task can align a model matrix that is a subset of or re-ordered relative to the dataset adata.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

adata: anndata.AnnData
pred_effect_operation: Literal['difference', 'ratio'] = ('ratio',)
gene_index: pandas.Index | None = None
cell_index: pandas.Index | None = None