czbenchmarks.datasets.utils

Attributes

logger

Functions

list_available_datasets(→ Dict[str, Dict[str, str]])

Return a sorted list of all dataset names defined in the datasets.yaml Hydra configuration.

load_dataset(→ czbenchmarks.datasets.dataset.Dataset)

Load, download (if needed), and instantiate a dataset using Hydra configuration.

load_custom_dataset(...)

Instantiate a dataset with a custom configuration. This can include but

Module Contents

czbenchmarks.datasets.utils.logger
czbenchmarks.datasets.utils.list_available_datasets() Dict[str, Dict[str, str]][source]

Return a sorted list of all dataset names defined in the datasets.yaml Hydra configuration.

Returns:

Alphabetically sorted list of available dataset names.

Return type:

List[str]

Notes

  • Loads configuration using Hydra.

  • Extracts dataset names from the datasets section of the configuration.

  • Sorts the dataset names alphabetically for easier readability.

czbenchmarks.datasets.utils.load_dataset(dataset_name: str) czbenchmarks.datasets.dataset.Dataset[source]

Load, download (if needed), and instantiate a dataset using Hydra configuration.

Parameters:

dataset_name (str) – Name of the dataset as specified in the configuration.

Returns:

Instantiated dataset object with data loaded.

Return type:

Dataset

Raises:

ValueError – If the specified dataset is not found in the configuration.

Notes

  • Uses Hydra for instantiation and configuration management.

  • Downloads dataset file if a remote path is specified using download_file_from_remote.

  • The returned dataset object is an instance of the Dataset class or its subclass.

czbenchmarks.datasets.utils.load_custom_dataset(dataset_name: str, custom_dataset_config_path: str | None = None, custom_dataset_kwargs: Dict[str, Any] | None = None, cache_dir: str | None = None) czbenchmarks.datasets.dataset.Dataset[source]

Instantiate a dataset with a custom configuration. This can include but is not limited to a local path for a custom dataset file and/or a dictionary of custom parameters to update the default configuration. If the dataset name does not exist in the default config, this function will add the dataset to the configuration.

Parameters:
  • dataset_name – The name of the dataset, either custom or from the config

  • custom_dataset_config_path – Optional path to a YAML file containing a custom configuration that can be used to update the existing default configuration.

  • custom_dataset_kwargs – Custom configuration dictionary to update the default configuration of the dataset class.

  • cache_dir – Optional directory to cache the dataset file. If not provided, the global cache manager directory will be used.

Returns:

Instantiated dataset object with data loaded.

Example

```python from czbenchmarks.datasets.types import Organism from czbenchmarks.datasets.utils import load_custom_dataset

custom_dataset_config_path = “/path/to/new_dataset.yaml”

my_dataset_name = “my_dataset” custom_dataset_kwargs = {

“organism”: Organism.HUMAN, “path”: “example-small.h5ad”,

}

dataset = load_custom_dataset(

dataset_name=my_dataset_name, custom_dataset_config_path=custom_dataset_config_path, custom_dataset_kwargs=custom_dataset_kwargs

)