czbenchmarks.metrics.utils

Functions

nearest_neighbors_hnsw(→ tuple[numpy.ndarray, ...)

Find nearest neighbors using HNSW algorithm.

compute_entropy_per_cell(→ numpy.ndarray)

Compute entropy of batch labels in local neighborhoods.

jaccard_score(y_true, y_pred)

Compute Jaccard similarity between true and predicted values.

mean_fold_metric(results_df[, metric, classifier])

Compute mean of a metric across folds.

Module Contents

czbenchmarks.metrics.utils.nearest_neighbors_hnsw(data: numpy.ndarray, expansion_factor: int = 200, max_links: int = 48, n_neighbors: int = 100) tuple[numpy.ndarray, numpy.ndarray][source]

Find nearest neighbors using HNSW algorithm.

Parameters:
  • data – Input data matrix of shape (n_samples, n_features)

  • expansion_factor – Size of dynamic candidate list for search

  • max_links – Number of bi-directional links created for every new element

  • n_neighbors – Number of nearest neighbors to find

Returns:

  • Indices array of shape (n_samples, n_neighbors)

  • Distances array of shape (n_samples, n_neighbors)

Return type:

Tuple containing

czbenchmarks.metrics.utils.compute_entropy_per_cell(X: numpy.ndarray, labels: pandas.Categorical | pandas.Series | numpy.ndarray) numpy.ndarray[source]

Compute entropy of batch labels in local neighborhoods.

For each cell, finds nearest neighbors and computes entropy of batch label distribution in that neighborhood.

Parameters:
  • X – Cell embedding matrix of shape (n_cells, n_features)

  • labels – Series containing batch labels for each cell

Returns:

Array of entropy values for each cell, normalized by log of number of batches

czbenchmarks.metrics.utils.jaccard_score(y_true: set[str], y_pred: set[str])[source]

Compute Jaccard similarity between true and predicted values.

Parameters:
  • y_true – True values

  • y_pred – Predicted values

czbenchmarks.metrics.utils.mean_fold_metric(results_df, metric='accuracy', classifier=None)[source]

Compute mean of a metric across folds.

Parameters:
  • results_df

    DataFrame containing cross-validation results. Must have columns: - “classifier”: Name of the classifier (e.g., “lr”, “knn”) - One of the following metric columns:

    • ”accuracy”: For accuracy scores

    • ”f1”: For F1 scores

    • ”precision”: For precision scores

    • ”recall”: For recall scores

  • metric – Name of metric column to average (“accuracy”, “f1”, etc.)

  • classifier – Optional classifier name to filter results

Returns:

Mean value of the metric across folds

Raises:

KeyError – If the specified metric column is not present in results_df