czbenchmarks.metrics.utils
Attributes
Functions
|
Find nearest neighbors using HNSW algorithm. |
|
Compute entropy of batch labels in local neighborhoods. |
|
Compute Jaccard similarity between true and predicted values. |
|
Compute mean of a metric across folds. |
|
Get a single metric value from filtered results. |
|
aggregate a collection of MetricResults by their type and parameters |
|
Measure how sequentially close neighbors are in embedding space. |
Module Contents
- czbenchmarks.metrics.utils.logger
- czbenchmarks.metrics.utils.nearest_neighbors_hnsw(data: numpy.ndarray, expansion_factor: int = 200, max_links: int = 48, n_neighbors: int = 100, random_seed: int = RANDOM_SEED) tuple[numpy.ndarray, numpy.ndarray] [source]
Find nearest neighbors using HNSW algorithm.
- Parameters:
data – Input data matrix of shape (n_samples, n_features)
expansion_factor – Size of dynamic candidate list for search
max_links – Number of bi-directional links created for every new element
n_neighbors – Number of nearest neighbors to find
- Returns:
Indices array of shape (n_samples, n_neighbors)
Distances array of shape (n_samples, n_neighbors)
- Return type:
Tuple containing
- czbenchmarks.metrics.utils.compute_entropy_per_cell(X: numpy.ndarray, labels: pandas.Categorical | pandas.Series | numpy.ndarray, n_neighbors: int = 200, random_seed: int = RANDOM_SEED) numpy.ndarray [source]
Compute entropy of batch labels in local neighborhoods.
For each cell, finds nearest neighbors and computes entropy of batch label distribution in that neighborhood.
- Parameters:
X – Cell Embedding matrix of shape (n_cells, n_features)
labels – Series containing batch labels for each cell
n_neighbors – Number of nearest neighbors to consider
random_seed – Random seed for reproducibility
- Returns:
Array of entropy values for each cell, normalized by log of number of batches
- czbenchmarks.metrics.utils.jaccard_score(y_true: set[str], y_pred: set[str])[source]
Compute Jaccard similarity between true and predicted values.
- Parameters:
y_true – True values
y_pred – Predicted values
- czbenchmarks.metrics.utils.mean_fold_metric(results_df, metric='accuracy', classifier=None)[source]
Compute mean of a metric across folds.
- Parameters:
results_df – DataFrame containing cross-validation results. Must have columns: - “classifier”: Name of the classifier (e.g., “lr”, “knn”) And one of the following metric columns: - “accuracy”: For accuracy scores - “f1”: For F1 scores - “precision”: For precision scores - “recall”: For recall scores
metric – Name of metric column to average (“accuracy”, “f1”, etc.)
classifier – Optional classifier name to filter results
- Returns:
Mean value of the metric across folds
- Raises:
KeyError – If the specified metric column is not present in results_df
- czbenchmarks.metrics.utils.single_metric(results_df, metric: str, **kwargs)[source]
Get a single metric value from filtered results.
- Parameters:
results_df – DataFrame containing classification results
metric – Name of metric column to extract (“accuracy”, “f1”, etc.)
**kwargs – Filter parameters (e.g., classifier, train_species, test_species)
- Returns:
Single metric value from the filtered results
- Raises:
ValueError – If filtering results in 0 or >1 rows
KeyError – If the specified metric column is not present in results_df
- czbenchmarks.metrics.utils.aggregate_results(results: Iterable[czbenchmarks.metrics.types.MetricResult]) list[czbenchmarks.metrics.types.AggregatedMetricResult] [source]
aggregate a collection of MetricResults by their type and parameters
- czbenchmarks.metrics.utils.sequential_alignment(X: numpy.ndarray, labels: numpy.ndarray, k: int = 10, normalize: bool = True, adaptive_k: bool = False) float [source]
Measure how sequentially close neighbors are in embedding space.
Works with UNSORTED data - does not assume X and labels are pre-sorted.
Parameters:
- Xnp.ndarray
Embedding matrix of shape (n_samples, n_features) (can be unsorted)
- labelsnp.ndarray
Sequential labels of shape (n_samples,) (can be unsorted) Must be numeric or convertible to numeric. String labels will raise error.
- kint
Number of neighbors to consider
- normalizebool
Whether to normalize score to [0,1] range
- adaptive_kbool
Use adaptive k based on local density
Returns:
: float: Sequential alignment score (higher = better sequential consistency)