czbenchmarks.tasks

Submodules

Attributes

TASK_REGISTRY

Classes

`ClusteringOutput`	Output for clustering task.
`ClusteringTask`	Task for evaluating clustering performance against ground truth labels.
`ClusteringTaskInput`	Base class for task inputs.
`EmbeddingOutput`	Output for embedding task.
`EmbeddingTask`	Task for evaluating cell representation quality using labeled data.
`EmbeddingTaskInput`	Pydantic model for EmbeddingTask inputs.
`BatchIntegrationOutput`	Output for batch integration task.
`BatchIntegrationTask`	Task for evaluating batch integration quality.
`BatchIntegrationTaskInput`	Pydantic model for BatchIntegrationTask inputs.
`MetadataLabelPredictionOutput`	Output for label prediction task.
`MetadataLabelPredictionTask`	Task for predicting labels from embeddings using cross-validation.
`MetadataLabelPredictionTaskInput`	Pydantic model for MetadataLabelPredictionTask inputs.
`SequentialOrganizationOutput`	Output for sequential organization task.
`SequentialOrganizationTask`	Task for evaluating sequential consistency in embeddings.
`SequentialOrganizationTaskInput`	Pydantic model for Sequential Organization inputs.
`CrossSpeciesIntegrationOutput`	Output for cross-species integration task.
`CrossSpeciesIntegrationTask`	Task for evaluating cross-species integration quality.
`CrossSpeciesIntegrationTaskInput`	Pydantic model for CrossSpeciesIntegrationTask inputs.
`CrossSpeciesLabelPredictionTaskInput`	Pydantic model for CrossSpeciesLabelPredictionTask inputs.
`CrossSpeciesLabelPredictionOutput`	Base class for task outputs.
`CrossSpeciesLabelPredictionTask`	Task for cross-species label prediction evaluation.
`MetricResult`	Represents the result of a single metric computation.
`Task`	Abstract base class for all benchmark tasks.
`TaskInput`	Base class for task inputs.
`TaskOutput`	Base class for task outputs.
`PerturbationExpressionPredictionOutput`	Output for perturbation task.
`PerturbationExpressionPredictionTask`	Task for evaluating perturbation-induced expression predictions against
`PerturbationExpressionPredictionTaskInput`	Pydantic model for Perturbation task inputs.

Package Contents

class czbenchmarks.tasks.ClusteringOutput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskOutput

Output for clustering task.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

predicted_labels: List[int]

class czbenchmarks.tasks.ClusteringTask(*, random_seed: int = RANDOM_SEED)[source]

Bases: czbenchmarks.tasks.task.Task

Task for evaluating clustering performance against ground truth labels.

This task performs clustering on embeddings and evaluates the results using multiple clustering metrics (ARI and NMI).

display_name = 'Clustering'

description = 'Evaluate clustering performance against ground truth labels using ARI and NMI metrics.'

input_model

baseline_model

class czbenchmarks.tasks.ClusteringTaskInput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskInput

Base class for task inputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

obs: Annotated[pandas.DataFrame, Field(description='Cell metadata DataFrame (e.g. the `obs` from an AnnData object).')]

input_labels: Annotated[czbenchmarks.types.ListLike, Field(description='Ground truth labels for metric calculation (e.g. `obs.cell_type` from an AnnData object).')]

use_rep: Annotated[str, Field(description="Data representation to use for clustering (e.g. the 'X' or obsm['X_pca'] from an AnnData object).")] = 'X'

n_iterations: Annotated[int, Field(description='Number of iterations for the Leiden algorithm.')] = 2

flavor: Annotated[Literal['leidenalg', 'igraph'], Field(description='Algorithm for Leiden community detection.')] = 'igraph'

key_added: Annotated[str, Field(description='Key in AnnData.obs where cluster assignments are stored.')] = 'leiden'

class czbenchmarks.tasks.EmbeddingOutput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskOutput

Output for embedding task.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

cell_representation: czbenchmarks.tasks.types.CellRepresentation

class czbenchmarks.tasks.EmbeddingTask(*, random_seed: int = RANDOM_SEED)[source]

Bases: czbenchmarks.tasks.task.Task

Task for evaluating cell representation quality using labeled data.

This task computes quality metrics for cell representations using ground truth labels. Currently supports silhouette score evaluation.

display_name = 'Embedding'

description = 'Evaluate cell representation quality using silhouette score with ground truth labels.'

input_model

baseline_model

class czbenchmarks.tasks.EmbeddingTaskInput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskInput

Pydantic model for EmbeddingTask inputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

input_labels: Annotated[czbenchmarks.types.ListLike, Field(description='Ground truth labels for metric calculation (e.g. `obs.cell_type` from an AnnData object).')]

class czbenchmarks.tasks.BatchIntegrationOutput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskOutput

Output for batch integration task.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

cell_representation: czbenchmarks.tasks.types.CellRepresentation

class czbenchmarks.tasks.BatchIntegrationTask(*, random_seed: int = RANDOM_SEED)[source]

Bases: czbenchmarks.tasks.task.Task

Task for evaluating batch integration quality.

This task computes metrics to assess how well different batches are integrated in the embedding space while preserving biological signals.

display_name = 'Batch Integration'

description = 'Evaluate batch integration quality using various integration metrics.'

input_model

baseline_model

class czbenchmarks.tasks.BatchIntegrationTaskInput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskInput

Pydantic model for BatchIntegrationTask inputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

batch_labels: Annotated[czbenchmarks.types.ListLike, Field(description='Batch labels for each cell (e.g. `obs.batch` from an AnnData object).')]

labels: Annotated[czbenchmarks.types.ListLike, Field(description='Ground truth labels for metric calculation (e.g. `obs.cell_type` from an AnnData object).')]

class czbenchmarks.tasks.MetadataLabelPredictionOutput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskOutput

Output for label prediction task.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

results: List[Dict[str, Any]]

class czbenchmarks.tasks.MetadataLabelPredictionTask(*, random_seed: int = RANDOM_SEED)[source]

Bases: czbenchmarks.tasks.task.Task

Task for predicting labels from embeddings using cross-validation.

Evaluates multiple classifiers (Logistic Regression, KNN) using k-fold cross-validation. Reports standard classification metrics.

display_name = 'Label Prediction'

description = 'Predict labels from embeddings using cross-validated classifiers and standard metrics.'

input_model

baseline_model

compute_baseline(expression_data: czbenchmarks.tasks.types.CellRepresentation, baseline_input: LabelPredictionBaselineInput = None) → czbenchmarks.tasks.types.CellRepresentation[source]

Set a baseline cell representation using raw gene expression.

This baseline uses the raw gene expression matrix directly as features.

class czbenchmarks.tasks.MetadataLabelPredictionTaskInput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskInput

Pydantic model for MetadataLabelPredictionTask inputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

labels: Annotated[czbenchmarks.types.ListLike, Field(description='Ground truth labels for prediction (e.g. `obs.cell_type` from an AnnData object)')]

n_folds: Annotated[int, Field(description='Number of folds for stratified cross-validation.')] = 5

min_class_size: Annotated[int, Field(description='Minimum number of samples required for a class to be included in evaluation.')] = 10

class czbenchmarks.tasks.SequentialOrganizationOutput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskOutput

Output for sequential organization task.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

embedding: czbenchmarks.tasks.types.CellRepresentation

class czbenchmarks.tasks.SequentialOrganizationTask(*, random_seed: int = RANDOM_SEED)[source]

Bases: czbenchmarks.tasks.task.Task

Task for evaluating sequential consistency in embeddings.

This task computes sequential quality metrics for embeddings using time point labels. Evaluates how well embeddings preserve sequential organization between cells.

display_name = 'Sequential Organization'

description = 'Evaluate sequential consistency in embeddings using time point labels and k-NN based metrics.'

input_model

baseline_model

class czbenchmarks.tasks.SequentialOrganizationTaskInput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskInput

Pydantic model for Sequential Organization inputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

obs: Annotated[pandas.DataFrame, Field(description='Cell metadata DataFrame (e.g. the `obs` from an AnnData object).')]

input_labels: Annotated[czbenchmarks.types.ListLike, Field(description='Ground truth labels for metric calculation (e.g. `obs.cell_type` from an AnnData object).')]

k: Annotated[int, Field(description='Number of nearest neighbors for k-NN based metrics.')] = 15

normalize: Annotated[bool, Field(description='Whether to normalize the embedding for k-NN based metrics.')] = True

adaptive_k: Annotated[bool, Field(description='Whether to use an adaptive number of nearest neighbors for k-NN based metrics.')] = False

class czbenchmarks.tasks.CrossSpeciesIntegrationOutput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskOutput

Output for cross-species integration task.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

cell_representation: czbenchmarks.tasks.types.CellRepresentation

labels: czbenchmarks.types.ListLike

species: czbenchmarks.types.ListLike

class czbenchmarks.tasks.CrossSpeciesIntegrationTask(*, random_seed: int = RANDOM_SEED)[source]

Bases: czbenchmarks.tasks.task.Task

Task for evaluating cross-species integration quality.

This task computes metrics to assess how well different species’ data are integrated in the embedding space while preserving biological signals. It operates on multiple datasets from different species.

display_name = 'Cross-species Integration'

description = 'Evaluate cross-species integration quality using various integration metrics.'

input_model

baseline_model

requires_multiple_datasets = True

abstract compute_baseline(expression_data: czbenchmarks.tasks.types.CellRepresentation, baseline_input: czbenchmarks.tasks.task.NoBaselineInput = None)[source]

Set a baseline embedding for cross-species integration.

Not implemented as standard preprocessing is not applicable across species.

class czbenchmarks.tasks.CrossSpeciesIntegrationTaskInput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskInput

Pydantic model for CrossSpeciesIntegrationTask inputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

labels: Annotated[List[czbenchmarks.types.ListLike], Field(description='List of ground truth labels for each species dataset (e.g., cell types).')]

organism_list: Annotated[List[czbenchmarks.datasets.types.Organism], Field(description='List of organisms corresponding to each dataset for cross-species evaluation.')]

class czbenchmarks.tasks.CrossSpeciesLabelPredictionTaskInput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskInput

Pydantic model for CrossSpeciesLabelPredictionTask inputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

labels: Annotated[List[czbenchmarks.types.ListLike], Field(description='List of ground truth labels for each species dataset (e.g., cell types).')]

organisms: Annotated[List[czbenchmarks.datasets.types.Organism], Field(description='List of organisms corresponding to each dataset for cross-species evaluation.')]

sample_ids: Annotated[List[czbenchmarks.types.ListLike] | None, Field(description='Optional list of sample/donor IDs for aggregation, one per dataset.')] = None

aggregation_method: Annotated[Literal['none', 'mean', 'median'], Field(description="Method to aggregate cells with the same sample_id ('none', 'mean', or 'median').")] = 'mean'

n_folds: Annotated[int, Field(description='Number of cross-validation folds for intra-species evaluation.')] = 5

class czbenchmarks.tasks.CrossSpeciesLabelPredictionOutput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskOutput

Base class for task outputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

results: List[Dict[str, Any]]

class czbenchmarks.tasks.CrossSpeciesLabelPredictionTask(*, random_seed: int = RANDOM_SEED)[source]

Bases: czbenchmarks.tasks.task.Task

Task for cross-species label prediction evaluation.

This task evaluates cross-species transfer by training classifiers on one species and testing on another species. It computes accuracy, F1, precision, recall, and AUROC for multiple classifiers (Logistic Regression, KNN, Random Forest).

The task can optionally aggregate cell-level embeddings to sample/donor level before running classification.

display_name = 'cross-species label prediction'

description = 'Evaluate cross-species label prediction performance using multiple classifiers.'

input_model

baseline_model

requires_multiple_datasets = True

abstract compute_baseline(expression_data: czbenchmarks.tasks.types.CellRepresentation, baseline_input: czbenchmarks.tasks.task.NoBaselineInput = None)[source]

Set a baseline for cross-species label prediction.

Not implemented as standard preprocessing needs to be applied per species.

czbenchmarks.tasks.TASK_REGISTRY

class czbenchmarks.tasks.MetricResult(/, **data: Any)[source]

Bases: pydantic.BaseModel

Represents the result of a single metric computation.

Encapsulates the computed value, associated metric type, and any parameters used during computation. Provides functionality for generating aggregation keys to group similar metrics.

metric_type

The type of metric computed.

Type:: MetricType

value

The computed metric value.

Type:: float

params

Parameters used during computation.

Type:: Optional[Dict[str, Any]]

aggregation_key(): Generates a key based on the metric type and parameters to aggregate similar metrics together.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

metric_type: MetricType

value: float

params: Dict[str, Any] | None = None

property aggregation_key: str: return a key based on the metric type and params in order to aggregate the same metrics together

class czbenchmarks.tasks.Task(*, random_seed: int = RANDOM_SEED)[source]

Bases: abc.ABC

Abstract base class for all benchmark tasks.

Defines the interface that all tasks must implement. Tasks are responsible for: 1. Declaring their required input/output data types 2. Running task-specific computations 3. Computing evaluation metrics

Tasks should store any intermediate results as instance variables to be used in metric computation.

input_model: Type[TaskInput]

baseline_model: Type[BaselineInput]

random_seed = 42

requires_multiple_datasets = False

classmethod __init_subclass__(**kwargs)[source]: Automatically register task subclasses when they are defined.

compute_baseline(expression_data: czbenchmarks.tasks.types.CellRepresentation, baseline_input: PCABaselineInput = None) → czbenchmarks.tasks.types.CellRepresentation[source]: Set a baseline embedding using PCA on gene expression data.

run(cell_representation: czbenchmarks.tasks.types.CellRepresentation | List[czbenchmarks.tasks.types.CellRepresentation], task_input: TaskInput) → List[czbenchmarks.metrics.types.MetricResult][source]

Run the task on input data and compute metrics.

Parameters:

cell_representation – gene expression data or embedding to use for the task
task_input – Pydantic model with inputs for the task

Returns:

A one-element list containing a single metric result for the task For multiple embeddings: List of metric results for each task, one per dataset

Return type:

For single embedding

Raises:

ValueError – If input does not match multiple embedding requirement

class czbenchmarks.tasks.TaskInput(/, **data: Any)[source]

Bases: pydantic.BaseModel

Base class for task inputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class czbenchmarks.tasks.TaskOutput(/, **data: Any)[source]

Bases: pydantic.BaseModel

Base class for task outputs.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

model_config: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class czbenchmarks.tasks.PerturbationExpressionPredictionOutput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskOutput

Output for perturbation task.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

pred_mean_change_dict: Dict[str, numpy.ndarray]

true_mean_change_dict: Dict[str, numpy.ndarray]

class czbenchmarks.tasks.PerturbationExpressionPredictionTask(*, random_seed: int = RANDOM_SEED)[source]

Bases: czbenchmarks.tasks.task.Task

Task for evaluating perturbation-induced expression predictions against their ground truth values. This is done by calculating metrics derived from predicted and ground truth log fold change values for each condition. Currently, Spearman rank correlation is supported.

The following arguments are required and must be supplied by the task input class (PerturbationExpressionPredictionTaskInput) when running the task. These parameters are described below for documentation purposes:

predictions_adata (ad.AnnData):
The anndata containing model predictions
dataset_adata (ad.AnnData):
The anndata object from SingleCellPerturbationDataset.
pred_effect_operation (Literal[“difference”, “ratio”]):
How to compute predicted effect between treated and control mean predictions over genes.
- “ratio” uses \(\log\left(\frac{\text{mean}(\text{treated}) + \varepsilon}{\text{mean}(\text{control}) + \varepsilon}\right)\) when means are positive.
- “difference” uses \(\text{mean}(\text{treated}) - \text{mean}(\text{control})\) and is generally safe across scales (probabilities, z-scores, raw expression).
Default is “ratio”.
gene_index (Optional[pd.Index]):
The index of the genes in the predictions AnnData.
cell_index (Optional[pd.Index]):
The index of the cells in the predictions AnnData.

display_name = 'Perturbation Expression Prediction'

description = 'Evaluate the quality of predicted changes in expression levels for genes that are...

input_model

baseline_model

condition_key = None

abstract compute_baseline(expression_data: czbenchmarks.tasks.types.CellRepresentation, baseline_input: czbenchmarks.tasks.task.NoBaselineInput = None)[source]

Set a baseline embedding for perturbation expression prediction.

Not implemented as this task evaluates expression matrices, not embeddings.

class czbenchmarks.tasks.PerturbationExpressionPredictionTaskInput(/, **data: Any)[source]

Bases: czbenchmarks.tasks.task.TaskInput

Pydantic model for Perturbation task inputs.

Dataclass to contain input parameters for the PerturbationExpressionPredictionTask. The row and column ordering of the model predictions can optionallybe provided as cell_index and gene_index, respectively, so the task can align a model matrix that is a subset of or re-ordered relative to the dataset adata.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

adata: Annotated[anndata.AnnData, Field(description='AnnData object from SingleCellPerturbationDataset containing perturbation data and metadata.')]

pred_effect_operation: Annotated[Literal['difference', 'ratio'], Field(description="Method to compute predicted effect: 'difference' (mean(treated) - mean(control)) or 'ratio' (log ratio of means).")] = 'ratio'

gene_index: Annotated[pandas.Index | None, Field(description='Optional gene index for predictions to align model predictions with dataset genes.')] = None

cell_index: Annotated[pandas.Index | None, Field(description='Optional cell index for predictions to align model predictions with dataset cells.')] = None