czbenchmarks.datasets.utils
Attributes
Functions
|
Return a sorted list of all dataset names defined in the datasets.yaml Hydra configuration. |
|
Load, download (if needed), and instantiate a dataset using Hydra configuration. |
|
Instantiate a dataset with a custom configuration. This can include but |
Module Contents
- czbenchmarks.datasets.utils.logger
- czbenchmarks.datasets.utils.list_available_datasets() Dict[str, Dict[str, str]][source]
Return a sorted list of all dataset names defined in the datasets.yaml Hydra configuration.
- Returns:
Alphabetically sorted list of available dataset names.
- Return type:
List[str]
Notes
Loads configuration using Hydra.
Extracts dataset names from the datasets section of the configuration.
Sorts the dataset names alphabetically for easier readability.
- czbenchmarks.datasets.utils.load_dataset(dataset_name: str) czbenchmarks.datasets.dataset.Dataset[source]
Load, download (if needed), and instantiate a dataset using Hydra configuration.
- Parameters:
dataset_name (str) – Name of the dataset as specified in the configuration.
- Returns:
Instantiated dataset object with data loaded.
- Return type:
- Raises:
ValueError – If the specified dataset is not found in the configuration.
Notes
Uses Hydra for instantiation and configuration management.
Downloads dataset file if a remote path is specified using download_file_from_remote.
The returned dataset object is an instance of the Dataset class or its subclass.
- czbenchmarks.datasets.utils.load_custom_dataset(dataset_name: str, custom_dataset_config_path: str | None = None, custom_dataset_kwargs: Dict[str, Any] | None = None, cache_dir: str | None = None) czbenchmarks.datasets.dataset.Dataset[source]
Instantiate a dataset with a custom configuration. This can include but is not limited to a local path for a custom dataset file and/or a dictionary of custom parameters to update the default configuration. If the dataset name does not exist in the default config, this function will add the dataset to the configuration.
- Parameters:
dataset_name – The name of the dataset, either custom or from the config
custom_dataset_config_path – Optional path to a YAML file containing a custom configuration that can be used to update the existing default configuration.
custom_dataset_kwargs – Custom configuration dictionary to update the default configuration of the dataset class.
cache_dir – Optional directory to cache the dataset file. If not provided, the global cache manager directory will be used.
- Returns:
Instantiated dataset object with data loaded.
Example
```python from czbenchmarks.datasets.types import Organism from czbenchmarks.datasets.utils import load_custom_dataset
custom_dataset_config_path = “/path/to/new_dataset.yaml”
my_dataset_name = “my_dataset” custom_dataset_kwargs = {
“organism”: Organism.HUMAN, “path”: “example-small.h5ad”,
}
- dataset = load_custom_dataset(
dataset_name=my_dataset_name, custom_dataset_config_path=custom_dataset_config_path, custom_dataset_kwargs=custom_dataset_kwargs
)