czbenchmarks.datasets.single_cell_perturbation ============================================== .. py:module:: czbenchmarks.datasets.single_cell_perturbation Attributes ---------- .. autoapisummary:: czbenchmarks.datasets.single_cell_perturbation.logger Classes ------- .. autoapisummary:: czbenchmarks.datasets.single_cell_perturbation.SingleCellPerturbationDataset Functions --------- .. autoapisummary:: czbenchmarks.datasets.single_cell_perturbation.sample_de_genes Module Contents --------------- .. py:data:: logger .. py:function:: sample_de_genes(de_results: pandas.DataFrame, percent_genes_to_mask: float, min_de_genes_to_mask: int, condition_col: str, gene_col: str, seed: int = RANDOM_SEED) -> Dict[str, List[str]] Sample a percentage of genes for masking for each condition from a differential expression results dataframe. :param de_results: Differential expression results dataframe. :type de_results: pd.DataFrame :param percent_genes_to_mask: Percentage of genes to mask. :type percent_genes_to_mask: float :param min_de_genes_to_mask: Minimum number of masked differentially expressed genes. If not met, no genes are masked. :type min_de_genes_to_mask: int :param condition_col: Column name for the condition. :type condition_col: str :param gene_col: Column name for the gene names. :type gene_col: str :param seed: Random seed. :type seed: int :returns: Dictionary that maps each condition to a list of genes to be masked for that condition. :rtype: Dict[str, List[str]] .. py:class:: SingleCellPerturbationDataset(path: pathlib.Path, organism: czbenchmarks.datasets.types.Organism, condition_key: str = 'condition', control_name: str = 'ctrl', de_gene_col: str = 'gene', de_metric_col: str = 'logfoldchange', de_pval_col: str = 'pval_adj', percent_genes_to_mask: float = 0.5, min_de_genes_to_mask: int = 5, pval_threshold: float = 0.0001, min_logfoldchange: float = 1.0, task_inputs_dir: Optional[pathlib.Path] = None, random_seed: int = RANDOM_SEED, target_conditions_override: Optional[Dict[str, List[str]]] = None) Bases: :py:obj:`czbenchmarks.datasets.single_cell.SingleCellDataset` Single cell dataset with perturbation data, containing control and perturbed cells. This class extends `SingleCellDataset` to handle datasets with perturbation data. It includes functionality for validating condition formats, and perturbation data with matched control cells. Input data requirements: - H5AD file containing single-cell gene expression data. - Must have a column ``condition_key`` in ``adata.obs`` specifying control and perturbed conditions. - Condition format must be one of: - ``{control_name}`` for control samples. - ``{perturb}`` for a single perturbation. .. attribute:: de_results Differential expression results calculated on ground truth data using matched controls. :type: pd.DataFrame .. attribute:: target_conditions_dict Dictionary that maps each condition to a list of masked genes for that condition. :type: Dict[str, List[str]] .. attribute:: control_cells_ids Dictionary mapping each condition to a dictionary of treatment cell barcodes (keys) to matched control cell barcodes (values). It is used primarily for creation of differential expression results in data processing and may be removed in a future release. :type: dict Instantiate a SingleCellPerturbationDataset instance. :param path: Path to the dataset file. :type path: Path :param organism: Enum value indicating the organism. :type organism: Organism :param condition_key: Key for the column in `adata.obs` specifying conditions. Defaults to "condition". :type condition_key: str :param control_name: Name of the control condition. Defaults to "ctrl". :type control_name: str :param de_gene_col: Column name for the names of genes which are differentially expressed in the differential expression results. Defaults to "gene". :type de_gene_col: str :param de_metric_col: Column name for the metric of the differential expression results. Defaults to "logfoldchange". :type de_metric_col: str :param de_pval_col: Column name for the p-value of the differential expression results. Defaults to "pval_adj". :type de_pval_col: str :param percent_genes_to_mask: Percentage of genes to mask. Default is 0.5. :type percent_genes_to_mask: float :param min_de_genes_to_mask: Minimum number of differentially expressed genes required to mask that condition. If not met, no genes are masked. Default is 5. :type min_de_genes_to_mask: int :param pval_threshold: P-value threshold for differential expression. Default is 1e-4. :type pval_threshold: float :param min_logfoldchange: Minimum log-fold change for differential expression. Default is 1.0. :type min_logfoldchange: float :param task_inputs_dir: Path to the directory containing the task inputs. Default is None. If not provided, a default path will be used. :type task_inputs_dir: Optional[Path] :param random_seed: Random seed for reproducibility. :type random_seed: int :param target_conditions_override: Dictionary that maps a target condition to a list of genes that the user specified to be masked. This overrides the default sampling of genes for masking in target_conditions_dict. Default is None. :type target_conditions_override: Optional[Dict[str, List[str]]] .. py:property:: de_results :type: pandas.DataFrame .. py:attribute:: target_conditions_dict :type: dict .. py:attribute:: control_cells_ids :type: dict .. py:attribute:: UNS_DE_RESULTS_KEY :value: 'de_results' .. py:attribute:: UNS_CONTROL_MAP_KEY :value: 'control_cells_map' .. py:attribute:: UNS_TARGET_GENES_KEY :value: 'target_conditions_dict' .. py:attribute:: UNS_METRIC_COL_KEY :value: 'metric_column' .. py:attribute:: UNS_CONFIG_KEY :value: 'config' .. py:attribute:: UNS_RANDOM_SEED_KEY :value: 'random_seed' .. py:attribute:: random_seed :value: 42 .. py:attribute:: condition_key :value: 'condition' .. py:attribute:: control_name :value: 'ctrl' .. py:attribute:: deg_test_name :value: 'wilcoxon' .. py:attribute:: de_gene_col :value: 'gene' .. py:attribute:: de_metric_col :value: 'logfoldchange' .. py:attribute:: de_pval_col :value: 'pval_adj' .. py:attribute:: target_conditions_override :value: None .. py:attribute:: percent_genes_to_mask :value: 0.5 .. py:attribute:: min_de_genes_to_mask :value: 5 .. py:attribute:: pval_threshold :value: 0.0001 .. py:attribute:: min_logfoldchange :value: 1.0 .. py:method:: load_and_filter_deg_results() Load and filter differential expression results from adata.uns. - Enforces that de_pval_col and de_metric_col are present in the dataframe and are not null. - Filters out rows where the p-value is greater than the pval_threshold. - Filters out rows where the metric is less than the min_logfoldchange. - Returns the filtered dataframe. :returns: Differential expression results dataframe after filtering. :rtype: pd.DataFrame .. py:method:: load_data() -> None Load the dataset and populates the perturbation truth data. - Validates the presence of required keys and values in `adata`: - `condition_key` in `adata.obs` - `control_name` present in `adata.obs[condition_key]` - `de_results_{self.deg_test_name}` in `adata.uns` - `control_cells_map` in `adata.uns` - Loads and filters differential expression results from `adata.uns`, keeping only genes whose differential expression meets user-defined thresholds. - Populates the `target_conditions_dict` attribute .. py:property:: metric_column :type: str .. py:property:: control_mapping :type: Dict[str, Dict[str, List[str]]] .. py:property:: target_genes :type: Dict[str, List[str]] .. py:method:: set_control_mapping(raw_mapping: Dict) -> None .. py:method:: get_controls(condition: str, treated_barcode: Optional[str] = None) -> List[str] .. py:method:: get_indices_for(condition: str, treated_barcodes: Optional[List[str]] = None) -> tuple[numpy.ndarray, numpy.ndarray] .. py:method:: store_task_inputs() -> pathlib.Path Store all task inputs into a single .h5ad file. The AnnData object contains in uns: - target_conditions_dict - de_results (DataFrame with required columns) - control_cells_ids :returns: Path to the task inputs directory. :rtype: Path