czbenchmarks.tasks.utils

Attributes

`logger`
`MULTI_DATASET_TASK_NAMES`
`TASK_NAMES`

Functions

`cluster_embedding`(→ List[int])	Cluster cells in embedding space using the Leiden algorithm.
`filter_minimum_class`(→ tuple[numpy.ndarray, ...)	Filter data to remove classes with too few samples.
`run_standard_scrna_workflow`(→ anndata.AnnData)	Run a standard preprocessing workflow for single-cell RNA-seq data.

Module Contents

czbenchmarks.tasks.utils.logger

czbenchmarks.tasks.utils.MULTI_DATASET_TASK_NAMES

czbenchmarks.tasks.utils.TASK_NAMES

czbenchmarks.tasks.utils.cluster_embedding(adata: anndata.AnnData, obsm_key: str = OBSM_KEY, n_iterations: int = 2, flavor: Literal['leidenalg', 'igraph'] = FLAVOR, key_added: str = KEY_ADDED, *, random_seed: int = RANDOM_SEED) → List[int][source]

Cluster cells in embedding space using the Leiden algorithm.

Computes nearest neighbors in the embedding space and runs the Leiden community detection algorithm to identify clusters.

Parameters:

adata – AnnData object containing the embedding
obsm_key – Key in adata.obsm containing the embedding coordinates
n_iterations – Number of iterations for the Leiden algorithm
flavor – Flavor of the Leiden algorithm
key_added – Key in adata.obs to store the cluster assignments
random_seed (int) – Random seed for reproducibility

Returns:

List of cluster assignments as integers

czbenchmarks.tasks.utils.filter_minimum_class(features: numpy.ndarray, labels: numpy.ndarray | pandas.Series, min_class_size: int = 10) → tuple[numpy.ndarray, numpy.ndarray | pandas.Series][source]

Filter data to remove classes with too few samples.

Removes classes that have fewer samples than the minimum threshold. Useful for ensuring enough samples per class for ML tasks.

Parameters:

features – Feature matrix of shape (n_samples, n_features)
labels – Labels array of shape (n_samples,)
min_class_size – Minimum number of samples required per class

Returns:

Filtered feature matrix
Filtered labels as categorical data

Return type:

Tuple containing

czbenchmarks.tasks.utils.run_standard_scrna_workflow(adata: anndata.AnnData, n_top_genes: int = 3000, n_pcs: int = 50, random_state: int = RANDOM_SEED) → anndata.AnnData[source]

Run a standard preprocessing workflow for single-cell RNA-seq data.

This function performs common preprocessing steps for scRNA-seq analysis: 1. Normalization of counts per cell 2. Log transformation 3. Identification of highly variable genes 4. Subsetting to highly variable genes 5. Principal component analysis

Parameters:

adata – AnnData object containing the raw count data
n_top_genes – Number of highly variable genes to select
n_pcs – Number of principal components to compute
random_state – Random seed for reproducibility