czbenchmarks.datasets.single_cell_perturbation
==============================================

.. py:module:: czbenchmarks.datasets.single_cell_perturbation


Attributes
----------

.. autoapisummary::

   czbenchmarks.datasets.single_cell_perturbation.logger


Classes
-------

.. autoapisummary::

   czbenchmarks.datasets.single_cell_perturbation.SingleCellPerturbationDataset


Functions
---------

.. autoapisummary::

   czbenchmarks.datasets.single_cell_perturbation.sample_de_genes


Module Contents
---------------

.. py:data:: logger

.. py:function:: sample_de_genes(de_results: pandas.DataFrame, percent_genes_to_mask: float, min_de_genes_to_mask: int, condition_col: str, gene_col: str, seed: int = RANDOM_SEED) -> Dict[str, List[str]]

   Sample a percentage of genes for masking for each condition from a
   differential expression results dataframe.

   :param de_results: Differential expression results dataframe.
   :type de_results: pd.DataFrame
   :param percent_genes_to_mask: Percentage of genes to mask.
   :type percent_genes_to_mask: float
   :param min_de_genes_to_mask: Minimum number of masked differentially
                                expressed genes. If not met, no genes are masked.
   :type min_de_genes_to_mask: int
   :param condition_col: Column name for the condition.
   :type condition_col: str
   :param gene_col: Column name for the gene names.
   :type gene_col: str
   :param seed: Random seed.
   :type seed: int

   :returns: Dictionary that maps each condition to a list of
             genes to be masked for that condition.
   :rtype: Dict[str, List[str]]


.. py:class:: SingleCellPerturbationDataset(path: pathlib.Path, organism: czbenchmarks.datasets.types.Organism, condition_key: str = 'condition', control_name: str = 'ctrl', de_gene_col: str = 'gene', de_metric_col: str = 'logfoldchange', de_pval_col: str = 'pval_adj', percent_genes_to_mask: float = 0.5, min_de_genes_to_mask: int = 5, pval_threshold: float = 0.0001, min_logfoldchange: float = 1.0, task_inputs_dir: Optional[pathlib.Path] = None, random_seed: int = RANDOM_SEED, target_conditions_override: Optional[Dict[str, List[str]]] = None)

   Bases: :py:obj:`czbenchmarks.datasets.single_cell.SingleCellDataset`


   Single cell dataset with perturbation data, containing control and
   perturbed cells.

   This class extends `SingleCellDataset` to handle datasets with perturbation
   data. It includes functionality for validating condition formats,
   and perturbation data with matched control cells.

   Input data requirements:

   - H5AD file containing single-cell gene expression data.
   - Must have a column ``condition_key`` in ``adata.obs`` specifying
       control and perturbed conditions.
   - Condition format must be one of:
     - ``{control_name}`` for control samples.
     - ``{perturb}`` for a single perturbation.

   .. attribute:: de_results

      Differential expression results calculated on ground
      truth data using matched controls.

      :type: pd.DataFrame

   .. attribute:: target_conditions_dict

      Dictionary that maps each
      condition to a list of masked genes for that condition.

      :type: Dict[str, List[str]]

   .. attribute:: control_cells_ids

      Dictionary mapping each condition to a dictionary
      of treatment cell barcodes (keys) to matched control cell barcodes (values).
      It is used primarily for creation of differential expression results
      in data processing and may be removed in a future release.

      :type: dict

   Instantiate a SingleCellPerturbationDataset instance.

   :param path: Path to the dataset file.
   :type path: Path
   :param organism: Enum value indicating the organism.
   :type organism: Organism
   :param condition_key: Key for the column in `adata.obs` specifying
                         conditions. Defaults to "condition".
   :type condition_key: str
   :param control_name: Name of the control condition. Defaults to
                        "ctrl".
   :type control_name: str
   :param de_gene_col: Column name for the names of genes which are
                       differentially expressed in the differential expression results.
                       Defaults to "gene".
   :type de_gene_col: str
   :param de_metric_col: Column name for the metric of the differential expression results.
                         Defaults to "logfoldchange".
   :type de_metric_col: str
   :param de_pval_col: Column name for the p-value of the differential expression results.
                       Defaults to "pval_adj".
   :type de_pval_col: str
   :param percent_genes_to_mask: Percentage of genes to mask.
                                 Default is 0.5.
   :type percent_genes_to_mask: float
   :param min_de_genes_to_mask: Minimum number of differentially
                                expressed genes required to mask that condition. If not met, no genes
                                are masked. Default is 5.
   :type min_de_genes_to_mask: int
   :param pval_threshold: P-value threshold for differential expression.
                          Default is 1e-4.
   :type pval_threshold: float
   :param min_logfoldchange: Minimum log-fold change for differential
                             expression. Default is 1.0.
   :type min_logfoldchange: float
   :param task_inputs_dir: Path to the directory containing the task inputs.
                           Default is None. If not provided, a default path will be used.
   :type task_inputs_dir: Optional[Path]
   :param random_seed: Random seed for reproducibility.
   :type random_seed: int
   :param target_conditions_override: Dictionary that
                                      maps a target condition to a list of genes that the user specified to be masked.
                                      This overrides the default sampling of genes for masking in target_conditions_dict.
                                      Default is None.
   :type target_conditions_override: Optional[Dict[str, List[str]]]


   .. py:property:: de_results
      :type: pandas.DataFrame


   .. py:attribute:: target_conditions_dict
      :type:  dict


   .. py:attribute:: control_cells_ids
      :type:  dict


   .. py:attribute:: UNS_DE_RESULTS_KEY
      :value: 'de_results'


   .. py:attribute:: UNS_CONTROL_MAP_KEY
      :value: 'control_cells_map'


   .. py:attribute:: UNS_TARGET_GENES_KEY
      :value: 'target_conditions_dict'


   .. py:attribute:: UNS_METRIC_COL_KEY
      :value: 'metric_column'


   .. py:attribute:: UNS_CONFIG_KEY
      :value: 'config'


   .. py:attribute:: UNS_RANDOM_SEED_KEY
      :value: 'random_seed'


   .. py:attribute:: random_seed
      :value: 42


   .. py:attribute:: condition_key
      :value: 'condition'


   .. py:attribute:: control_name
      :value: 'ctrl'


   .. py:attribute:: deg_test_name
      :value: 'wilcoxon'


   .. py:attribute:: de_gene_col
      :value: 'gene'


   .. py:attribute:: de_metric_col
      :value: 'logfoldchange'


   .. py:attribute:: de_pval_col
      :value: 'pval_adj'


   .. py:attribute:: target_conditions_override
      :value: None


   .. py:attribute:: percent_genes_to_mask
      :value: 0.5


   .. py:attribute:: min_de_genes_to_mask
      :value: 5


   .. py:attribute:: pval_threshold
      :value: 0.0001


   .. py:attribute:: min_logfoldchange
      :value: 1.0


   .. py:method:: load_and_filter_deg_results()

      Load and filter differential expression results from adata.uns.
      - Enforces that de_pval_col and de_metric_col are present in the dataframe and are not null.
      - Filters out rows where the p-value is greater than the pval_threshold.
      - Filters out rows where the metric is less than the min_logfoldchange.
      - Returns the filtered dataframe.

      :returns: Differential expression results dataframe after filtering.
      :rtype: pd.DataFrame


   .. py:method:: load_data() -> None

      Load the dataset and populates the perturbation truth data.
      - Validates the presence of required keys and values in `adata`:
          - `condition_key` in `adata.obs`
          - `control_name` present in `adata.obs[condition_key]`
          - `de_results_{self.deg_test_name}` in `adata.uns`
          - `control_cells_map` in `adata.uns`
      - Loads and filters differential expression results from `adata.uns`,
          keeping only genes whose differential expression meets
          user-defined thresholds.
      - Populates the `target_conditions_dict` attribute


   .. py:property:: metric_column
      :type: str


   .. py:property:: control_mapping
      :type: Dict[str, Dict[str, List[str]]]


   .. py:property:: target_genes
      :type: Dict[str, List[str]]


   .. py:method:: set_control_mapping(raw_mapping: Dict) -> None


   .. py:method:: get_controls(condition: str, treated_barcode: Optional[str] = None) -> List[str]


   .. py:method:: get_indices_for(condition: str, treated_barcodes: Optional[List[str]] = None) -> tuple[numpy.ndarray, numpy.ndarray]


   .. py:method:: store_task_inputs() -> pathlib.Path

      Store all task inputs into a single .h5ad file.

      The AnnData object contains in uns:
      - target_conditions_dict
      - de_results (DataFrame with required columns)
      - control_cells_ids

      :returns: Path to the task inputs directory.
      :rtype: Path