czbenchmarks.tasks.utils
Attributes
Functions
|
Print a summary table of all metrics. |
|
Print a nice summary table of all metrics. |
|
Cluster cells in embedding space using the Leiden algorithm. |
|
Filter data to remove classes with too few samples. |
Run a standard preprocessing workflow for single-cell RNA-seq data. |
|
|
Guess if a matrix contains log-normalized (non-integer) values by inspecting random cell sums. |
|
Aggregate cell-level embeddings to sample level. |
Module Contents
- czbenchmarks.tasks.utils.logger
- czbenchmarks.tasks.utils.MULTI_DATASET_TASK_NAMES
- czbenchmarks.tasks.utils.TASK_NAMES
- czbenchmarks.tasks.utils.print_correlation_metrics_baseline_and_model(metrics_df: pandas.DataFrame, moderate_correlation_threshold: float = 0.3, precision: int = 4)[source]
Print a summary table of all metrics. :param metrics_dict: Dictionary of model prediction metric values :param baseline_metrics_dict: Dictionary of baseline metric values :param moderate_correlation_threshold: Threshold for considering a correlation as moderate :param precision: Precision for the summary table
- czbenchmarks.tasks.utils.print_metrics_summary(metrics_list)[source]
Print a nice summary table of all metrics.
- Parameters:
metrics_list – List of MetricResult objects or dict with metric lists
- czbenchmarks.tasks.utils.cluster_embedding(adata: anndata.AnnData, n_iterations: int = 2, flavor: Literal['leidenalg', 'igraph'] = FLAVOR, use_rep: str = 'X', key_added: str = KEY_ADDED, *, random_seed: int = RANDOM_SEED) List[int] [source]
Cluster cells in embedding space using the Leiden algorithm.
Computes nearest neighbors in the embedding space and runs the Leiden community detection algorithm to identify clusters.
- Parameters:
adata – AnnData object containing the embedding
n_iterations – Number of iterations for the Leiden algorithm
flavor – Flavor of the Leiden algorithm
use_rep – Key in adata.obsm containing the embedding coordinates If None, embedding is assumed to be in adata.X
key_added – Key in adata.obs to store the cluster assignments
random_seed (int) – Random seed for reproducibility
- Returns:
List of cluster assignments as integers
- czbenchmarks.tasks.utils.filter_minimum_class(features: numpy.ndarray, labels: numpy.ndarray | pandas.Series, min_class_size: int = 10) tuple[numpy.ndarray, numpy.ndarray | pandas.Series] [source]
Filter data to remove classes with too few samples.
Removes classes that have fewer samples than the minimum threshold. Useful for ensuring enough samples per class for ML tasks.
- Parameters:
features – Feature matrix of shape (n_samples, n_features)
labels – Labels array of shape (n_samples,)
min_class_size – Minimum number of samples required per class
- Returns:
Filtered feature matrix
Filtered labels as categorical data
- Return type:
Tuple containing
- czbenchmarks.tasks.utils.run_standard_scrna_workflow(adata: anndata.AnnData, n_top_genes: int = 3000, n_pcs: int = 50, obsm_key: str = OBSM_KEY, random_state: int = RANDOM_SEED) czbenchmarks.tasks.types.CellRepresentation [source]
Run a standard preprocessing workflow for single-cell RNA-seq data.
This function performs common preprocessing steps for scRNA-seq analysis: 1. Normalization of counts per cell 2. Log transformation 3. Identification of highly variable genes 4. Subsetting to highly variable genes 5. Principal component analysis
- Parameters:
adata – AnnData object containing the raw count data
n_top_genes – Number of highly variable genes to select
n_pcs – Number of principal components to compute
random_state – Random seed for reproducibility
- czbenchmarks.tasks.utils.is_not_count_data(matrix: czbenchmarks.tasks.types.CellRepresentation, sample_size: int | float = 1000, tol: float = 0.01, random_seed: int = RANDOM_SEED) bool [source]
Guess if a matrix contains log-normalized (non-integer) values by inspecting random cell sums.
This function randomly picks a subset of rows (cells), sums their values, and checks if any of those sums are not close to integers, which would indicate the data is not raw counts.
- Parameters:
matrix – Expression matrix (cells x genes).
sample_size – How many cells to check (default: 1000 or all if fewer).
tol – Allowed deviation from integer for sum to be considered integer-like.
- Returns:
True if at least one sampled cell sum is non-integer (suggesting log-normalized data).
- Return type:
- czbenchmarks.tasks.utils.aggregate_cells_to_samples(embeddings: czbenchmarks.tasks.types.CellRepresentation, labels: czbenchmarks.types.ListLike, sample_ids: czbenchmarks.types.ListLike, aggregation_method: Literal['mean', 'median'] = 'mean') tuple[numpy.ndarray, pandas.Series, pandas.Series] [source]
Aggregate cell-level embeddings to sample level.
This function groups cells by sample ID and aggregates their embeddings using the specified method. It also ensures that each sample has a consistent label (taking the first occurrence for each sample).
- Parameters:
embeddings – Cell-level embeddings of shape (n_cells, d)
labels – Cell-level labels, length n_cells
sample_ids – Sample/donor identifiers for grouping cells, length n_cells
aggregation_method – Method to aggregate embeddings (“mean” or “median”)
- Returns:
sample_embeddings: Aggregated embeddings (n_samples, d)
sample_labels: Labels for each sample (length n_samples)
sample_ids_out: Sample identifiers (length n_samples)
- Return type:
Tuple containing
- Raises:
ValueError – If inputs have mismatched lengths