czbenchmarks.tasks.utils

Attributes

logger

MULTI_DATASET_TASK_NAMES

TASK_NAMES

Functions

print_correlation_metrics_baseline_and_model(metrics_df)

Print a summary table of all metrics.

print_metrics_summary(metrics_list)

Print a nice summary table of all metrics.

cluster_embedding(→ List[int])

Cluster cells in embedding space using the Leiden algorithm.

filter_minimum_class(→ tuple[numpy.ndarray, ...)

Filter data to remove classes with too few samples.

run_standard_scrna_workflow(...)

Run a standard preprocessing workflow for single-cell RNA-seq data.

is_not_count_data(→ bool)

Guess if a matrix contains log-normalized (non-integer) values by inspecting random cell sums.

aggregate_cells_to_samples(→ tuple[numpy.ndarray, ...)

Aggregate cell-level embeddings to sample level.

Module Contents

czbenchmarks.tasks.utils.logger
czbenchmarks.tasks.utils.MULTI_DATASET_TASK_NAMES
czbenchmarks.tasks.utils.TASK_NAMES
czbenchmarks.tasks.utils.print_correlation_metrics_baseline_and_model(metrics_df: pandas.DataFrame, moderate_correlation_threshold: float = 0.3, precision: int = 4)[source]

Print a summary table of all metrics. :param metrics_dict: Dictionary of model prediction metric values :param baseline_metrics_dict: Dictionary of baseline metric values :param moderate_correlation_threshold: Threshold for considering a correlation as moderate :param precision: Precision for the summary table

czbenchmarks.tasks.utils.print_metrics_summary(metrics_list)[source]

Print a nice summary table of all metrics.

Parameters:

metrics_list – List of MetricResult objects or dict with metric lists

czbenchmarks.tasks.utils.cluster_embedding(adata: anndata.AnnData, n_iterations: int = 2, flavor: Literal['leidenalg', 'igraph'] = FLAVOR, use_rep: str = 'X', key_added: str = KEY_ADDED, *, random_seed: int = RANDOM_SEED) List[int][source]

Cluster cells in embedding space using the Leiden algorithm.

Computes nearest neighbors in the embedding space and runs the Leiden community detection algorithm to identify clusters.

Parameters:
  • adata – AnnData object containing the embedding

  • n_iterations – Number of iterations for the Leiden algorithm

  • flavor – Flavor of the Leiden algorithm

  • use_rep – Key in adata.obsm containing the embedding coordinates If None, embedding is assumed to be in adata.X

  • key_added – Key in adata.obs to store the cluster assignments

  • random_seed (int) – Random seed for reproducibility

Returns:

List of cluster assignments as integers

czbenchmarks.tasks.utils.filter_minimum_class(features: numpy.ndarray, labels: numpy.ndarray | pandas.Series, min_class_size: int = 10) tuple[numpy.ndarray, numpy.ndarray | pandas.Series][source]

Filter data to remove classes with too few samples.

Removes classes that have fewer samples than the minimum threshold. Useful for ensuring enough samples per class for ML tasks.

Parameters:
  • features – Feature matrix of shape (n_samples, n_features)

  • labels – Labels array of shape (n_samples,)

  • min_class_size – Minimum number of samples required per class

Returns:

  • Filtered feature matrix

  • Filtered labels as categorical data

Return type:

Tuple containing

czbenchmarks.tasks.utils.run_standard_scrna_workflow(adata: anndata.AnnData, n_top_genes: int = 3000, n_pcs: int = 50, obsm_key: str = OBSM_KEY, random_state: int = RANDOM_SEED) czbenchmarks.tasks.types.CellRepresentation[source]

Run a standard preprocessing workflow for single-cell RNA-seq data.

This function performs common preprocessing steps for scRNA-seq analysis: 1. Normalization of counts per cell 2. Log transformation 3. Identification of highly variable genes 4. Subsetting to highly variable genes 5. Principal component analysis

Parameters:
  • adata – AnnData object containing the raw count data

  • n_top_genes – Number of highly variable genes to select

  • n_pcs – Number of principal components to compute

  • random_state – Random seed for reproducibility

czbenchmarks.tasks.utils.is_not_count_data(matrix: czbenchmarks.tasks.types.CellRepresentation, sample_size: int | float = 1000, tol: float = 0.01, random_seed: int = RANDOM_SEED) bool[source]

Guess if a matrix contains log-normalized (non-integer) values by inspecting random cell sums.

This function randomly picks a subset of rows (cells), sums their values, and checks if any of those sums are not close to integers, which would indicate the data is not raw counts.

Parameters:
  • matrix – Expression matrix (cells x genes).

  • sample_size – How many cells to check (default: 1000 or all if fewer).

  • tol – Allowed deviation from integer for sum to be considered integer-like.

Returns:

True if at least one sampled cell sum is non-integer (suggesting log-normalized data).

Return type:

bool

czbenchmarks.tasks.utils.aggregate_cells_to_samples(embeddings: czbenchmarks.tasks.types.CellRepresentation, labels: czbenchmarks.types.ListLike, sample_ids: czbenchmarks.types.ListLike, aggregation_method: Literal['mean', 'median'] = 'mean') tuple[numpy.ndarray, pandas.Series, pandas.Series][source]

Aggregate cell-level embeddings to sample level.

This function groups cells by sample ID and aggregates their embeddings using the specified method. It also ensures that each sample has a consistent label (taking the first occurrence for each sample).

Parameters:
  • embeddings – Cell-level embeddings of shape (n_cells, d)

  • labels – Cell-level labels, length n_cells

  • sample_ids – Sample/donor identifiers for grouping cells, length n_cells

  • aggregation_method – Method to aggregate embeddings (“mean” or “median”)

Returns:

  • sample_embeddings: Aggregated embeddings (n_samples, d)

  • sample_labels: Labels for each sample (length n_samples)

  • sample_ids_out: Sample identifiers (length n_samples)

Return type:

Tuple containing

Raises:

ValueError – If inputs have mismatched lengths