czbenchmarks.datasets.single_cell_perturbation

Attributes

logger

Classes

SingleCellPerturbationDataset

Single cell dataset with perturbation data, containing control and

Functions

sample_de_genes(→ Dict[str, List[str]])

Sample a percentage of genes for masking for each condition from a

Module Contents

czbenchmarks.datasets.single_cell_perturbation.logger
czbenchmarks.datasets.single_cell_perturbation.sample_de_genes(de_results: pandas.DataFrame, percent_genes_to_mask: float, min_de_genes_to_mask: int, condition_col: str, gene_col: str, seed: int = RANDOM_SEED) Dict[str, List[str]][source]

Sample a percentage of genes for masking for each condition from a differential expression results dataframe.

Parameters:
  • de_results (pd.DataFrame) – Differential expression results dataframe.

  • percent_genes_to_mask (float) – Percentage of genes to mask.

  • min_de_genes_to_mask (int) – Minimum number of masked differentially expressed genes. If not met, no genes are masked.

  • condition_col (str) – Column name for the condition.

  • gene_col (str) – Column name for the gene names.

  • seed (int) – Random seed.

Returns:

Dictionary that maps each condition to a list of genes to be masked for that condition.

Return type:

Dict[str, List[str]]

class czbenchmarks.datasets.single_cell_perturbation.SingleCellPerturbationDataset(path: pathlib.Path, organism: czbenchmarks.datasets.types.Organism, condition_key: str = 'condition', control_name: str = 'ctrl', de_gene_col: str = 'gene', de_metric_col: str = 'logfoldchange', de_pval_col: str = 'pval_adj', percent_genes_to_mask: float = 0.5, min_de_genes_to_mask: int = 5, pval_threshold: float = 0.0001, min_logfoldchange: float = 1.0, task_inputs_dir: pathlib.Path | None = None, random_seed: int = RANDOM_SEED, target_conditions_override: Dict[str, List[str]] | None = None)[source]

Bases: czbenchmarks.datasets.single_cell.SingleCellDataset

Single cell dataset with perturbation data, containing control and perturbed cells.

This class extends SingleCellDataset to handle datasets with perturbation data. It includes functionality for validating condition formats, and perturbation data with matched control cells.

Input data requirements:

  • H5AD file containing single-cell gene expression data.

  • Must have a column condition_key in adata.obs specifying

    control and perturbed conditions.

  • Condition format must be one of: - {control_name} for control samples. - {perturb} for a single perturbation.

de_results

Differential expression results calculated on ground truth data using matched controls.

Type:

pd.DataFrame

target_conditions_dict

Dictionary that maps each condition to a list of masked genes for that condition.

Type:

Dict[str, List[str]]

control_cells_ids

Dictionary mapping each condition to a dictionary of treatment cell barcodes (keys) to matched control cell barcodes (values). It is used primarily for creation of differential expression results in data processing and may be removed in a future release.

Type:

dict

Instantiate a SingleCellPerturbationDataset instance.

Parameters:
  • path (Path) – Path to the dataset file.

  • organism (Organism) – Enum value indicating the organism.

  • condition_key (str) – Key for the column in adata.obs specifying conditions. Defaults to “condition”.

  • control_name (str) – Name of the control condition. Defaults to “ctrl”.

  • de_gene_col (str) – Column name for the names of genes which are differentially expressed in the differential expression results. Defaults to “gene”.

  • de_metric_col (str) – Column name for the metric of the differential expression results. Defaults to “logfoldchange”.

  • de_pval_col (str) – Column name for the p-value of the differential expression results. Defaults to “pval_adj”.

  • percent_genes_to_mask (float) – Percentage of genes to mask. Default is 0.5.

  • min_de_genes_to_mask (int) – Minimum number of differentially expressed genes required to mask that condition. If not met, no genes are masked. Default is 5.

  • pval_threshold (float) – P-value threshold for differential expression. Default is 1e-4.

  • min_logfoldchange (float) – Minimum log-fold change for differential expression. Default is 1.0.

  • task_inputs_dir (Optional[Path]) – Path to the directory containing the task inputs. Default is None. If not provided, a default path will be used.

  • random_seed (int) – Random seed for reproducibility.

  • target_conditions_override (Optional[Dict[str, List[str]]]) – Dictionary that maps a target condition to a list of genes that the user specified to be masked. This overrides the default sampling of genes for masking in target_conditions_dict. Default is None.

property de_results: pandas.DataFrame
target_conditions_dict: dict
control_cells_ids: dict
UNS_DE_RESULTS_KEY = 'de_results'
UNS_CONTROL_MAP_KEY = 'control_cells_map'
UNS_TARGET_GENES_KEY = 'target_conditions_dict'
UNS_METRIC_COL_KEY = 'metric_column'
UNS_CONFIG_KEY = 'config'
UNS_RANDOM_SEED_KEY = 'random_seed'
random_seed = 42
condition_key = 'condition'
control_name = 'ctrl'
deg_test_name = 'wilcoxon'
de_gene_col = 'gene'
de_metric_col = 'logfoldchange'
de_pval_col = 'pval_adj'
target_conditions_override = None
percent_genes_to_mask = 0.5
min_de_genes_to_mask = 5
pval_threshold = 0.0001
min_logfoldchange = 1.0
load_and_filter_deg_results()[source]

Load and filter differential expression results from adata.uns. - Enforces that de_pval_col and de_metric_col are present in the dataframe and are not null. - Filters out rows where the p-value is greater than the pval_threshold. - Filters out rows where the metric is less than the min_logfoldchange. - Returns the filtered dataframe.

Returns:

Differential expression results dataframe after filtering.

Return type:

pd.DataFrame

load_data() None[source]

Load the dataset and populates the perturbation truth data. - Validates the presence of required keys and values in adata:

  • condition_key in adata.obs

  • control_name present in adata.obs[condition_key]

  • de_results_{self.deg_test_name} in adata.uns

  • control_cells_map in adata.uns

  • Loads and filters differential expression results from adata.uns,

    keeping only genes whose differential expression meets user-defined thresholds.

  • Populates the target_conditions_dict attribute

property metric_column: str
property control_mapping: Dict[str, Dict[str, List[str]]]
property target_genes: Dict[str, List[str]]
set_control_mapping(raw_mapping: Dict) None[source]
get_controls(condition: str, treated_barcode: str | None = None) List[str][source]
get_indices_for(condition: str, treated_barcodes: List[str] | None = None) tuple[numpy.ndarray, numpy.ndarray][source]
store_task_inputs() pathlib.Path[source]

Store all task inputs into a single .h5ad file.

The AnnData object contains in uns: - target_conditions_dict - de_results (DataFrame with required columns) - control_cells_ids

Returns:

Path to the task inputs directory.

Return type:

Path