czbenchmarks.datasets

Submodules

Classes

Dataset

Abstract base class for datasets.

SingleCellLabeledDataset

Single cell dataset containing gene expression data and a label column.

SingleCellPerturbationDataset

Single cell dataset with perturbation data, containing control and

Organism

Create a collection of name/value pairs.

Functions

list_available_datasets(→ Dict[str, Dict[str, str]])

Return a sorted list of all dataset names defined in the datasets.yaml Hydra configuration.

load_dataset(→ czbenchmarks.datasets.dataset.Dataset)

Load, download (if needed), and instantiate a dataset using Hydra configuration.

load_custom_dataset(...)

Instantiate a dataset with a custom configuration. This can include but

Package Contents

class czbenchmarks.datasets.Dataset(dataset_type_name: str, path: str | pathlib.Path, organism: czbenchmarks.datasets.types.Organism, task_inputs_dir: pathlib.Path | None = None, **kwargs: Any)[source]

Bases: abc.ABC

Abstract base class for datasets.

Each concrete Dataset subclass is responsible for extracting and managing the data required for a specific type of task from the provided input file. Subclasses should define instance variables to store these task-specific data items, which can then be accessed as object attributes or written to files for downstream use.

All Dataset instances must specify an Organism enum value to indicate the organism from which the data was derived.

Subclasses must implement:
  • load_data: Loads the dataset from the input file and populates relevant instance variables.

  • store_task_inputs: Stores the extracted task-specific inputs in files or directories as needed.

  • _validate: Validates dataset-specific constraints and requirements.

path

The path to the dataset file.

task_inputs_dir

The directory where task-specific input files are stored.

organism

The organism from which the data was derived.

Initialize a Dataset instance.

Parameters:
  • dataset_type_name (str) – Name of the dataset type (used for directory naming).

  • path (str | Path) – Path to the dataset file.

  • organism (Organism) – Enum value indicating the organism.

  • task_inputs_dir (Optional[Path]) – Directory for storing task-specific inputs.

  • kwargs (Any) – Additional attributes for the dataset.

Raises:

ValueError – If the dataset path does not exist.

path: pathlib.Path
task_inputs_dir: pathlib.Path
organism: czbenchmarks.datasets.types.Organism
kwargs
abstract load_data() None[source]

Load the dataset from its source file into memory.

Subclasses must implement this method to load their specific data format. For example, SingleCellDataset loads an AnnData object from an h5ad file.

The loaded data should be stored as instance attributes that can be accessed by other methods.

abstract store_task_inputs() pathlib.Path[source]

Store the task-specific inputs extracted from the dataset.

Subclasses must implement this method to store task-specific files in a subdirectory of the dataset path. The subdirectory name is determined by the subclass.

Returns:

The path to the directory storing the task input files.

Return type:

Path

validate() None[source]

Performs general validation checks, such as ensuring the organism is a valid Organism enum value. Calls _validate for subclass-specific validation.

Raises:

ValueError – If validation fails.

class czbenchmarks.datasets.SingleCellLabeledDataset(path: pathlib.Path, organism: czbenchmarks.datasets.types.Organism, label_column_key: str = 'cell_type', task_inputs_dir: pathlib.Path | None = None)[source]

Bases: czbenchmarks.datasets.single_cell.SingleCellDataset

Single cell dataset containing gene expression data and a label column.

This class extends SingleCellDataset to include a label column that contains the expected prediction values for each cell. The labels are extracted from the specified column in adata.obs and stored as a pd.Series in the labels attribute.

labels

Extracted labels for each cell.

Type:

pd.Series

label_column_key

Key for the column in adata.obs containing the labels.

Type:

str

Initialize a SingleCellLabeledDataset instance.

Parameters:
  • path (Path) – Path to the dataset file.

  • organism (Organism) – Enum value indicating the organism.

  • label_column_key (str) – Key for the column in adata.obs containing the labels. Defaults to “cell_type”.

  • task_inputs_dir (Optional[Path]) – Directory for storing task-specific inputs.

labels: pandas.Series
label_column_key: str
load_data() None[source]

Load the dataset and extract labels.

This method loads the dataset using the parent class’s load_data method and extracts the labels from the specified column in adata.obs.

Populates:

labels (pd.Series): Extracted labels for each cell.

store_task_inputs() pathlib.Path[source]

Store task-specific inputs, such as cell type annotations.

This method stores the extracted labels in a JSON file. The filename is dynamically generated based on the label_column_key.

Returns:

Path to the directory storing the task input files.

Return type:

Path

class czbenchmarks.datasets.SingleCellPerturbationDataset(path: pathlib.Path, organism: czbenchmarks.datasets.types.Organism, condition_key: str = 'condition', control_name: str = 'ctrl', de_gene_col: str = 'gene', de_metric_col: str = 'logfoldchange', de_pval_col: str = 'pval_adj', percent_genes_to_mask: float = 0.5, min_de_genes_to_mask: int = 5, pval_threshold: float = 0.0001, min_logfoldchange: float = 1.0, task_inputs_dir: pathlib.Path | None = None, random_seed: int = RANDOM_SEED, target_conditions_override: Dict[str, List[str]] | None = None)[source]

Bases: czbenchmarks.datasets.single_cell.SingleCellDataset

Single cell dataset with perturbation data, containing control and perturbed cells.

This class extends SingleCellDataset to handle datasets with perturbation data. It includes functionality for validating condition formats, and perturbation data with matched control cells.

Input data requirements:

  • H5AD file containing single-cell gene expression data.

  • Must have a column condition_key in adata.obs specifying

    control and perturbed conditions.

  • Condition format must be one of: - {control_name} for control samples. - {perturb} for a single perturbation.

de_results

Differential expression results calculated on ground truth data using matched controls.

Type:

pd.DataFrame

target_conditions_dict

Dictionary that maps each condition to a list of masked genes for that condition.

Type:

Dict[str, List[str]]

control_cells_ids

Dictionary mapping each condition to a dictionary of treatment cell barcodes (keys) to matched control cell barcodes (values). It is used primarily for creation of differential expression results in data processing and may be removed in a future release.

Type:

dict

Instantiate a SingleCellPerturbationDataset instance.

Parameters:
  • path (Path) – Path to the dataset file.

  • organism (Organism) – Enum value indicating the organism.

  • condition_key (str) – Key for the column in adata.obs specifying conditions. Defaults to “condition”.

  • control_name (str) – Name of the control condition. Defaults to “ctrl”.

  • de_gene_col (str) – Column name for the names of genes which are differentially expressed in the differential expression results. Defaults to “gene”.

  • de_metric_col (str) – Column name for the metric of the differential expression results. Defaults to “logfoldchange”.

  • de_pval_col (str) – Column name for the p-value of the differential expression results. Defaults to “pval_adj”.

  • percent_genes_to_mask (float) – Percentage of genes to mask. Default is 0.5.

  • min_de_genes_to_mask (int) – Minimum number of differentially expressed genes required to mask that condition. If not met, no genes are masked. Default is 5.

  • pval_threshold (float) – P-value threshold for differential expression. Default is 1e-4.

  • min_logfoldchange (float) – Minimum log-fold change for differential expression. Default is 1.0.

  • task_inputs_dir (Optional[Path]) – Path to the directory containing the task inputs. Default is None. If not provided, a default path will be used.

  • random_seed (int) – Random seed for reproducibility.

  • target_conditions_override (Optional[Dict[str, List[str]]]) – Dictionary that maps a target condition to a list of genes that the user specified to be masked. This overrides the default sampling of genes for masking in target_conditions_dict. Default is None.

property de_results: pandas.DataFrame
target_conditions_dict: dict
control_cells_ids: dict
UNS_DE_RESULTS_KEY = 'de_results'
UNS_CONTROL_MAP_KEY = 'control_cells_map'
UNS_TARGET_GENES_KEY = 'target_conditions_dict'
UNS_METRIC_COL_KEY = 'metric_column'
UNS_CONFIG_KEY = 'config'
UNS_RANDOM_SEED_KEY = 'random_seed'
random_seed = 42
condition_key = 'condition'
control_name = 'ctrl'
deg_test_name = 'wilcoxon'
de_gene_col = 'gene'
de_metric_col = 'logfoldchange'
de_pval_col = 'pval_adj'
target_conditions_override = None
percent_genes_to_mask = 0.5
min_de_genes_to_mask = 5
pval_threshold = 0.0001
min_logfoldchange = 1.0
load_and_filter_deg_results()[source]

Load and filter differential expression results from adata.uns. - Enforces that de_pval_col and de_metric_col are present in the dataframe and are not null. - Filters out rows where the p-value is greater than the pval_threshold. - Filters out rows where the metric is less than the min_logfoldchange. - Returns the filtered dataframe.

Returns:

Differential expression results dataframe after filtering.

Return type:

pd.DataFrame

load_data() None[source]

Load the dataset and populates the perturbation truth data. - Validates the presence of required keys and values in adata:

  • condition_key in adata.obs

  • control_name present in adata.obs[condition_key]

  • de_results_{self.deg_test_name} in adata.uns

  • control_cells_map in adata.uns

  • Loads and filters differential expression results from adata.uns,

    keeping only genes whose differential expression meets user-defined thresholds.

  • Populates the target_conditions_dict attribute

property metric_column: str
property control_mapping: Dict[str, Dict[str, List[str]]]
property target_genes: Dict[str, List[str]]
set_control_mapping(raw_mapping: Dict) None[source]
get_controls(condition: str, treated_barcode: str | None = None) List[str][source]
get_indices_for(condition: str, treated_barcodes: List[str] | None = None) tuple[numpy.ndarray, numpy.ndarray][source]
store_task_inputs() pathlib.Path[source]

Store all task inputs into a single .h5ad file.

The AnnData object contains in uns: - target_conditions_dict - de_results (DataFrame with required columns) - control_cells_ids

Returns:

Path to the task inputs directory.

Return type:

Path

class czbenchmarks.datasets.Organism(name: str, prefix: str)[source]

Bases: enum.Enum

Create a collection of name/value pairs.

Example enumeration:

>>> class Color(Enum):
...     RED = 1
...     BLUE = 2
...     GREEN = 3

Access them by:

  • attribute access:

    >>> Color.RED
    <Color.RED: 1>
    
  • value lookup:

    >>> Color(1)
    <Color.RED: 1>
    
  • name lookup:

    >>> Color['RED']
    <Color.RED: 1>
    

Enumerations can be iterated over, and know how many members they have:

>>> len(Color)
3
>>> list(Color)
[<Color.RED: 1>, <Color.BLUE: 2>, <Color.GREEN: 3>]

Methods can be added to enumerations, and members can have their own attributes – see the documentation for details.

HUMAN = ('homo_sapiens', 'ENSG')
MOUSE = ('mus_musculus', 'ENSMUSG')
TROPICAL_CLAWED_FROG = ('xenopus_tropicalis', 'ENSXETG')
AFRICAN_CLAWED_FROG = ('xenopus_laevis', 'ENSXLAG')
ZEBRAFISH = ('danio_rerio', 'ENSDARG')
MOUSE_LEMUR = ('microcebus_murinus', 'ENSMICG')
WILD_BOAR = ('sus_scrofa', 'ENSSSCG')
CRAB_EATING_MACAQUE = ('macaca_fascicularis', 'ENSMFAG')
RHESUS_MACAQUE = ('macaca_mulatta', 'ENSMMUG')
PLATYPUS = ('ornithorhynchus_anatinus', 'ENSOANG')
OPOSSUM = ('monodelphis_domestica', 'ENSMODG')
GORILLA = ('gorilla_gorilla', 'ENSGGOG')
CHIMPANZEE = ('pan_troglodytes', 'ENSPTRG')
MARMOSET = ('callithrix_jacchus', 'ENSCJAG')
CHICKEN = ('gallus_gallus', 'ENSGALG')
RABBIT = ('oryctolagus_cuniculus', 'ENSOCUG')
FRUIT_FLY = ('drosophila_melanogaster', 'FBgn')
RAT = ('rattus_norvegicus', 'ENSRNOG')
NAKED_MOLE_RAT = ('heterocephalus_glaber', 'ENSHGLG')
CAENORHABDITIS_ELEGANS = ('caenorhabditis_elegans', 'WBGene')
YEAST = ('saccharomyces_cerevisiae', '')
MALARIA_PARASITE = ('plasmodium_falciparum', 'PF3D7')
SEA_LAMPREY = ('petromyzon_marinus', 'ENSPMAG')
FRESHWATER_SPONGE = ('spongilla_lacustris', 'ENSLPGG')
CORAL = ('stylophora_pistillata', 'LOC')
SEA_URCHIN = ('lytechinus_variegatus', '')
__str__()[source]
__repr__()[source]
property name

The name of the Enum member.

property prefix
czbenchmarks.datasets.list_available_datasets() Dict[str, Dict[str, str]][source]

Return a sorted list of all dataset names defined in the datasets.yaml Hydra configuration.

Returns:

Alphabetically sorted list of available dataset names.

Return type:

List[str]

Notes

  • Loads configuration using Hydra.

  • Extracts dataset names from the datasets section of the configuration.

  • Sorts the dataset names alphabetically for easier readability.

czbenchmarks.datasets.load_dataset(dataset_name: str) czbenchmarks.datasets.dataset.Dataset[source]

Load, download (if needed), and instantiate a dataset using Hydra configuration.

Parameters:

dataset_name (str) – Name of the dataset as specified in the configuration.

Returns:

Instantiated dataset object with data loaded.

Return type:

Dataset

Raises:

ValueError – If the specified dataset is not found in the configuration.

Notes

  • Uses Hydra for instantiation and configuration management.

  • Downloads dataset file if a remote path is specified using download_file_from_remote.

  • The returned dataset object is an instance of the Dataset class or its subclass.

czbenchmarks.datasets.load_custom_dataset(dataset_name: str, custom_dataset_config_path: str | None = None, custom_dataset_kwargs: Dict[str, Any] | None = None, cache_dir: str | None = None) czbenchmarks.datasets.dataset.Dataset[source]

Instantiate a dataset with a custom configuration. This can include but is not limited to a local path for a custom dataset file and/or a dictionary of custom parameters to update the default configuration. If the dataset name does not exist in the default config, this function will add the dataset to the configuration.

Parameters:
  • dataset_name – The name of the dataset, either custom or from the config

  • custom_dataset_config_path – Optional path to a YAML file containing a custom configuration that can be used to update the existing default configuration.

  • custom_dataset_kwargs – Custom configuration dictionary to update the default configuration of the dataset class.

  • cache_dir – Optional directory to cache the dataset file. If not provided, the global cache manager directory will be used.

Returns:

Instantiated dataset object with data loaded.

Example

```python from czbenchmarks.datasets.types import Organism from czbenchmarks.datasets.utils import load_custom_dataset

custom_dataset_config_path = “/path/to/new_dataset.yaml”

my_dataset_name = “my_dataset” custom_dataset_kwargs = {

“organism”: Organism.HUMAN, “path”: “example-small.h5ad”,

}

dataset = load_custom_dataset(

dataset_name=my_dataset_name, custom_dataset_config_path=custom_dataset_config_path, custom_dataset_kwargs=custom_dataset_kwargs

)