Metrics
The czbenchmarks.metrics module provides a unified and extensible framework for computing performance metrics across all evaluation tasks.
Overview
At the core of this module is a centralized registry, MetricRegistry, which stores all supported metrics. Each metric is registered with a unique type, required arguments, default parameters, a description, and a set of descriptive tags.
Purpose
Allows tasks to declare and compute metrics in a unified, type-safe, and extensible manner.
Ensures metrics are reproducible and callable via shared interfaces across tasks like clustering, embedding, and label prediction.
Key Components
MetricRegistry
A class that registers and manages metric functions, performs argument validation, and handles invocation.MetricType
AnEnumdefining all supported metric names. Each task refers toMetricTypemembers to identify which metrics to compute.Tags:
Each metric is tagged with its associated category to allow filtering:clustering: ARI, NMIembedding: Silhouette Scoreintegration: Entropy per Cell, Batch Silhouettelabel_prediction: Accuracy, F1, Precision, Recall, AUROCperturbation: Spearman correlation
Supported Metrics
The following metrics are pre-registered:
Metric Type |
Task |
Description |
|---|---|---|
|
clustering |
Measures the similarity between two clusterings, adjusted for chance. A higher value indicates better alignment. |
|
clustering |
Quantifies the amount of shared information between two clusterings, normalized to ensure comparability. |
|
embedding |
Evaluates how well-separated clusters are in an embedding space. Higher scores indicate better-defined clusters. |
|
integration |
Assesses the mixing of batch labels at the single-cell level. Higher entropy indicates better integration. |
|
integration |
Combines silhouette scoring with batch information to evaluate clustering quality while accounting for batch effects. |
|
perturbation |
Rank correlation between predicted and actual values |
|
label_prediction |
Average accuracy across k-fold cross-validation splits, indicating overall classification performance. |
|
label_prediction |
Average F1 score across folds, balancing precision and recall for classification tasks. |
|
label_prediction |
Average precision across folds, reflecting the proportion of true positives among predicted positives. |
|
label_prediction |
Average recall across folds, indicating the proportion of true positives correctly identified. |
|
label_prediction |
Average area under the ROC curve across folds, measuring the ability to distinguish between classes. |
How to Compute a Metric
Use metrics_registry.compute() inside your task’s _compute_metrics() method:
from czbenchmarks.metrics.types import MetricType, metrics_registry
value = metrics_registry.compute(
MetricType.ADJUSTED_RAND_INDEX,
labels_true=true_labels,
labels_pred=predicted_labels,
)
# Wrap in a result object
from czbenchmarks.metrics.types import MetricResult
result = MetricResult(metric_type=MetricType.ADJUSTED_RAND_INDEX, value=value)