czbenchmarks.datasets.utils

Functions

load_dataset(→ czbenchmarks.datasets.dataset.Dataset)

Load, download (if needed), and instantiate a dataset using Hydra configuration.

list_available_datasets(→ Dict[str, Dict[str, str]])

Return a sorted list of all dataset names defined in the datasets.yaml Hydra configuration.

load_local_dataset(→ czbenchmarks.datasets.dataset.Dataset)

Instantiate a dataset directly from arguments without requiring a YAML file.

Module Contents

czbenchmarks.datasets.utils.load_dataset(dataset_name: str, config_path: str | None = None) czbenchmarks.datasets.dataset.Dataset[source]

Load, download (if needed), and instantiate a dataset using Hydra configuration.

Parameters:
  • dataset_name (str) – Name of the dataset as specified in the configuration.

  • config_path (Optional[str]) – Optional path to a custom config YAML file. If not provided, only the package’s default config is used.

Returns:

Instantiated dataset object with data loaded.

Return type:

Dataset

Raises:
  • FileNotFoundError – If the custom config file does not exist.

  • ValueError – If the specified dataset is not found in the configuration.

Notes

  • Merges custom config with default config if provided.

  • Downloads dataset file if a remote path is specified using download_file_from_remote.

  • Uses Hydra for instantiation and configuration management.

  • The returned dataset object is an instance of the Dataset class or its subclass.

czbenchmarks.datasets.utils.list_available_datasets() Dict[str, Dict[str, str]][source]

Return a sorted list of all dataset names defined in the datasets.yaml Hydra configuration.

Returns:

Alphabetically sorted list of available dataset names.

Return type:

List[str]

Notes

  • Loads configuration using Hydra.

  • Extracts dataset names from the datasets section of the configuration.

  • Sorts the dataset names alphabetically for easier readability.

czbenchmarks.datasets.utils.load_local_dataset(dataset_class: str, organism: czbenchmarks.datasets.types.Organism, path: str | pathlib.Path, **kwargs) czbenchmarks.datasets.dataset.Dataset[source]

Instantiate a dataset directly from arguments without requiring a YAML file.

This function is completely independent from load_dataset() and directly instantiates the dataset class without using OmegaConf objects.

Parameters:
  • target – The full import path to the Dataset class to instantiate.

  • organism – The organism of the dataset.

  • path – The local or remote path to the dataset file.

  • **kwargs – Additional key-value pairs for the dataset config.

Returns:

Instantiated dataset object with data loaded.

Example

dataset = load_local_dataset(

target=”czbenchmarks.datasets.SingleCellLabeledDataset”, organism=Organism.HUMAN, path=”example-small.h5ad”,

)