czbenchmarks.datasets.dataset
Classes
Abstract base class for datasets. |
Module Contents
- class czbenchmarks.datasets.dataset.Dataset(dataset_type_name: str, path: str | pathlib.Path, organism: czbenchmarks.datasets.types.Organism, task_inputs_dir: pathlib.Path | None = None, **kwargs: Any)[source]
Bases:
abc.ABC
Abstract base class for datasets.
Each concrete Dataset subclass is responsible for extracting and managing the data required for a specific type of task from the provided input file. Subclasses should define instance variables to store these task-specific data items, which can then be accessed as object attributes or written to files for downstream use.
All Dataset instances must specify an Organism enum value to indicate the organism from which the data was derived.
- Subclasses must implement:
load_data: Loads the dataset from the input file and populates relevant instance variables.
store_task_inputs: Stores the extracted task-specific inputs in files or directories as needed.
_validate: Validates dataset-specific constraints and requirements.
- path
The path to the dataset file.
- task_inputs_dir
The directory where task-specific input files are stored.
- organism
The organism from which the data was derived.
Initialize a Dataset instance.
- Parameters:
dataset_type_name (str) – Name of the dataset type (used for directory naming).
path (str | Path) – Path to the dataset file.
organism (Organism) – Enum value indicating the organism.
task_inputs_dir (Optional[Path]) – Directory for storing task-specific inputs.
kwargs (Any) – Additional attributes for the dataset.
- Raises:
ValueError – If the dataset path does not exist.
- path: pathlib.Path
- task_inputs_dir: pathlib.Path
- organism: czbenchmarks.datasets.types.Organism
- kwargs
- abstract load_data() None [source]
Load the dataset from its source file into memory.
Subclasses must implement this method to load their specific data format. For example, SingleCellDataset loads an AnnData object from an h5ad file.
The loaded data should be stored as instance attributes that can be accessed by other methods.
- abstract store_task_inputs() pathlib.Path [source]
Store the task-specific inputs extracted from the dataset.
Subclasses must implement this method to store task-specific files in a subdirectory of the dataset path. The subdirectory name is determined by the subclass.
- Returns:
The path to the directory storing the task input files.
- Return type:
Path
- validate() None [source]
Performs general validation checks, such as ensuring the organism is a valid Organism enum value. Calls _validate for subclass-specific validation.
- Raises:
ValueError – If validation fails.