czbenchmarks.tasks.utils

Attributes

logger

MULTI_DATASET_TASK_NAMES

TASK_NAMES

Functions

cluster_embedding(→ List[int])

Cluster cells in embedding space using the Leiden algorithm.

filter_minimum_class(→ tuple[numpy.ndarray, ...)

Filter data to remove classes with too few samples.

run_standard_scrna_workflow(→ anndata.AnnData)

Run a standard preprocessing workflow for single-cell RNA-seq data.

Module Contents

czbenchmarks.tasks.utils.logger
czbenchmarks.tasks.utils.MULTI_DATASET_TASK_NAMES
czbenchmarks.tasks.utils.TASK_NAMES
czbenchmarks.tasks.utils.cluster_embedding(adata: anndata.AnnData, obsm_key: str = OBSM_KEY, random_seed: int = RANDOM_SEED, n_iterations: int = 2, flavor: str = FLAVOR, key_added: str = KEY_ADDED) List[int][source]

Cluster cells in embedding space using the Leiden algorithm.

Computes nearest neighbors in the embedding space and runs the Leiden community detection algorithm to identify clusters.

Parameters:
  • adata – AnnData object containing the embedding

  • obsm_key – Key in adata.obsm containing the embedding coordinates

  • random_seed – Random seed for reproducibility

  • n_iterations – Number of iterations for the Leiden algorithm

  • flavor – Flavor of the Leiden algorithm

  • key_added – Key in adata.obs to store the cluster assignments

Returns:

List of cluster assignments as integers

czbenchmarks.tasks.utils.filter_minimum_class(features: numpy.ndarray, labels: numpy.ndarray | pandas.Series, min_class_size: int = 10) tuple[numpy.ndarray, numpy.ndarray | pandas.Series][source]

Filter data to remove classes with too few samples.

Removes classes that have fewer samples than the minimum threshold. Useful for ensuring enough samples per class for ML tasks.

Parameters:
  • features – Feature matrix of shape (n_samples, n_features)

  • labels – Labels array of shape (n_samples,)

  • min_class_size – Minimum number of samples required per class

Returns:

  • Filtered feature matrix

  • Filtered labels as categorical data

Return type:

Tuple containing

czbenchmarks.tasks.utils.run_standard_scrna_workflow(adata: anndata.AnnData, n_top_genes: int = 3000, n_pcs: int = 50, random_state: int = 42) anndata.AnnData[source]

Run a standard preprocessing workflow for single-cell RNA-seq data.

This function performs common preprocessing steps for scRNA-seq analysis: 1. Normalization of counts per cell 2. Log transformation 3. Identification of highly variable genes 4. Subsetting to highly variable genes 5. Principal component analysis

Parameters:
  • adata – AnnData object containing the raw count data

  • n_top_genes – Number of highly variable genes to select

  • n_pcs – Number of principal components to compute

  • random_state – Random seed for reproducibility