czbenchmarks.tasks.utils
Attributes
Functions
|
Cluster cells in embedding space using the Leiden algorithm. |
|
Filter data to remove classes with too few samples. |
|
Run a standard preprocessing workflow for single-cell RNA-seq data. |
Module Contents
- czbenchmarks.tasks.utils.logger
- czbenchmarks.tasks.utils.MULTI_DATASET_TASK_NAMES
- czbenchmarks.tasks.utils.TASK_NAMES
- czbenchmarks.tasks.utils.cluster_embedding(adata: anndata.AnnData, obsm_key: str = OBSM_KEY, random_seed: int = RANDOM_SEED, n_iterations: int = 2, flavor: str = FLAVOR, key_added: str = KEY_ADDED) List[int] [source]
Cluster cells in embedding space using the Leiden algorithm.
Computes nearest neighbors in the embedding space and runs the Leiden community detection algorithm to identify clusters.
- Parameters:
adata – AnnData object containing the embedding
obsm_key – Key in adata.obsm containing the embedding coordinates
random_seed – Random seed for reproducibility
n_iterations – Number of iterations for the Leiden algorithm
flavor – Flavor of the Leiden algorithm
key_added – Key in adata.obs to store the cluster assignments
- Returns:
List of cluster assignments as integers
- czbenchmarks.tasks.utils.filter_minimum_class(features: numpy.ndarray, labels: numpy.ndarray | pandas.Series, min_class_size: int = 10) tuple[numpy.ndarray, numpy.ndarray | pandas.Series] [source]
Filter data to remove classes with too few samples.
Removes classes that have fewer samples than the minimum threshold. Useful for ensuring enough samples per class for ML tasks.
- Parameters:
features – Feature matrix of shape (n_samples, n_features)
labels – Labels array of shape (n_samples,)
min_class_size – Minimum number of samples required per class
- Returns:
Filtered feature matrix
Filtered labels as categorical data
- Return type:
Tuple containing
- czbenchmarks.tasks.utils.run_standard_scrna_workflow(adata: anndata.AnnData, n_top_genes: int = 3000, n_pcs: int = 50, random_state: int = 42) anndata.AnnData [source]
Run a standard preprocessing workflow for single-cell RNA-seq data.
This function performs common preprocessing steps for scRNA-seq analysis: 1. Normalization of counts per cell 2. Log transformation 3. Identification of highly variable genes 4. Subsetting to highly variable genes 5. Principal component analysis
- Parameters:
adata – AnnData object containing the raw count data
n_top_genes – Number of highly variable genes to select
n_pcs – Number of principal components to compute
random_state – Random seed for reproducibility