czbenchmarks.datasets
Submodules
Attributes
Classes
Single cell dataset containing gene expression data and metadata. |
|
Single cell dataset with perturbation data, containing control and |
|
Helper class that provides a standard way to create an ABC using |
|
Create a collection of name/value pairs. |
|
Create a collection of name/value pairs. |
Functions
|
Download and instantiate a dataset using Hydra configuration. |
|
Lists all available datasets defined in the datasets.yaml configuration file. |
Package Contents
- czbenchmarks.datasets.load_dataset(dataset_name: str, config_path: str | None = None) czbenchmarks.datasets.base.BaseDataset [source]
Download and instantiate a dataset using Hydra configuration.
- Parameters:
dataset_name – Name of dataset as specified in config
config_path – Optional path to config yaml file. If not provided, will use only the package’s default config.
- Returns:
Instantiated dataset object
- Return type:
- czbenchmarks.datasets.list_available_datasets() List[str] [source]
Lists all available datasets defined in the datasets.yaml configuration file.
- Returns:
A sorted list of dataset names available in the configuration.
- Return type:
- class czbenchmarks.datasets.SingleCellDataset(path: str, organism: czbenchmarks.datasets.types.Organism)[source]
Bases:
czbenchmarks.datasets.base.BaseDataset
Single cell dataset containing gene expression data and metadata.
Handles loading and validation of AnnData objects with gene expression data and associated metadata for a specific organism.
- load_data() None [source]
Load the dataset into memory.
This method should be implemented by subclasses to load their specific data format. For example, SingleCellDataset loads an AnnData object from an h5ad file.
The loaded data should be stored as instance attributes that can be accessed by other methods.
- unload_data() None [source]
Unload the dataset from memory.
This method should be implemented by subclasses to free memory by clearing loaded data. For example, SingleCellDataset sets its AnnData object to None.
This is used to clear memory-intensive data before serialization, since serializing large raw data artifacts can be error-prone and inefficient.
Any instance attributes containing loaded data should be cleared or set to None.
- property organism: czbenchmarks.datasets.types.Organism
- property adata: anndata.AnnData
- class czbenchmarks.datasets.PerturbationSingleCellDataset(path: str, organism: czbenchmarks.datasets.types.Organism, condition_key: str = 'condition', split_key: str = 'split')[source]
Bases:
SingleCellDataset
Single cell dataset with perturbation data, containing control and perturbed cells.
Input data requirements:
H5AD file containing single cell gene expression data
Must have a condition column in adata.obs specifying control (“ctrl”) and perturbed conditions.
Must have a split column in adata.obs to identify test samples
Condition format must be one of:
ctrl
for control samples{gene}+ctrl
for single gene perturbations{gene1}+{gene2}
for combinatorial perturbations
- load_data() None [source]
Load the dataset into memory.
This method should be implemented by subclasses to load their specific data format. For example, SingleCellDataset loads an AnnData object from an h5ad file.
The loaded data should be stored as instance attributes that can be accessed by other methods.
- unload_data() None [source]
Unload the dataset from memory.
This method should be implemented by subclasses to free memory by clearing loaded data. For example, SingleCellDataset sets its AnnData object to None.
This is used to clear memory-intensive data before serialization, since serializing large raw data artifacts can be error-prone and inefficient.
Any instance attributes containing loaded data should be cleared or set to None.
- property perturbation_truth: Dict[str, pandas.DataFrame]
- class czbenchmarks.datasets.BaseDataset(path: str, **kwargs: Any)[source]
Bases:
abc.ABC
Helper class that provides a standard way to create an ABC using inheritance.
- path
- kwargs
- property inputs: Dict[czbenchmarks.datasets.types.DataType, czbenchmarks.datasets.types.DataValue]
Get the inputs dictionary.
- property outputs: czbenchmarks.models.types.ModelOutputs
Get the outputs dictionary.
- set_input(data_type: czbenchmarks.datasets.types.DataType, value: czbenchmarks.datasets.types.DataValue) None [source]
Safely set an input with type checking.
- set_output(model_type: czbenchmarks.models.types.ModelType | None, data_type: czbenchmarks.datasets.types.DataType, value: czbenchmarks.datasets.types.DataValue) None [source]
Safely set an output with type checking. :param model_type: The type of model associated with the output.
This parameter is used to differentiate between outputs from various models. It can be set to None if the output is not tied to a specific model type defined in the ModelType enum.
- Parameters:
data_type (DataType) – Specifies the data type of the output.
value (Any) – The value to assign to the output.
- get_input(data_type: czbenchmarks.datasets.types.DataType) czbenchmarks.datasets.types.DataValue [source]
Safely get an input with error handling.
- get_output(model_type: czbenchmarks.models.types.ModelType | None, data_type: czbenchmarks.datasets.types.DataType) czbenchmarks.datasets.types.DataValue [source]
Safely get an output with error handling. :param model_type: The type of model associated with the output.
This parameter is used to differentiate between outputs from various models. It can be set to None if the output is not tied to a specific model type defined in the ModelType enum.
- Parameters:
data_type (DataType) – Specifies the data type of the output.
- Returns:
The value of the output.
- Return type:
DataValue
- abstract load_data() None [source]
Load the dataset into memory.
This method should be implemented by subclasses to load their specific data format. For example, SingleCellDataset loads an AnnData object from an h5ad file.
The loaded data should be stored as instance attributes that can be accessed by other methods.
- abstract unload_data() None [source]
Unload the dataset from memory.
This method should be implemented by subclasses to free memory by clearing loaded data. For example, SingleCellDataset sets its AnnData object to None.
This is used to clear memory-intensive data before serialization, since serializing large raw data artifacts can be error-prone and inefficient.
Any instance attributes containing loaded data should be cleared or set to None.
- serialize(path: str) None [source]
Serialize this dataset instance to disk using dill.
- Parameters:
path – Path where the serialized dataset should be saved
- static deserialize(path: str) BaseDataset [source]
Load a serialized dataset from disk.
- Parameters:
path – Path to the serialized dataset file
- Returns:
The deserialized dataset instance
- Return type:
- class czbenchmarks.datasets.DataType(*args, **kwds)[source]
Bases:
enum.Enum
Create a collection of name/value pairs.
Example enumeration:
>>> class Color(Enum): ... RED = 1 ... BLUE = 2 ... GREEN = 3
Access them by:
attribute access:
>>> Color.RED <Color.RED: 1>
value lookup:
>>> Color(1) <Color.RED: 1>
name lookup:
>>> Color['RED'] <Color.RED: 1>
Enumerations can be iterated over, and know how many members they have:
>>> len(Color) 3
>>> list(Color) [<Color.RED: 1>, <Color.BLUE: 2>, <Color.GREEN: 3>]
Methods can be added to enumerations, and members can have their own attributes – see the documentation for details.
- METADATA
- ANNDATA
- ORGANISM
- EMBEDDING
- CONDITION_KEY
- SPLIT_KEY
- PERTURBATION_PRED
- PERTURBATION_TRUTH
- property spec: DataTypeSpec
- property dtype: Type
- czbenchmarks.datasets.DataValue
- class czbenchmarks.datasets.Organism(name: str, prefix: str)[source]
Bases:
enum.Enum
Create a collection of name/value pairs.
Example enumeration:
>>> class Color(Enum): ... RED = 1 ... BLUE = 2 ... GREEN = 3
Access them by:
attribute access:
>>> Color.RED <Color.RED: 1>
value lookup:
>>> Color(1) <Color.RED: 1>
name lookup:
>>> Color['RED'] <Color.RED: 1>
Enumerations can be iterated over, and know how many members they have:
>>> len(Color) 3
>>> list(Color) [<Color.RED: 1>, <Color.BLUE: 2>, <Color.GREEN: 3>]
Methods can be added to enumerations, and members can have their own attributes – see the documentation for details.
- HUMAN = ('homo_sapiens', 'ENSG')
- MOUSE = ('mus_musculus', 'ENSMUSG')
- TROPICAL_CLAWED_FROG = ('xenopus_tropicalis', 'ENSXETG')
- AFRICAN_CLAWED_FROG = ('xenopus_laevis', 'ENSXLAG')
- ZEBRAFISH = ('danio_rerio', 'ENSDARG')
- MOUSE_LEMUR = ('microcebus_murinus', 'ENSMICG')
- WILD_BOAR = ('sus_scrofa', 'ENSSSCG')
- CRAB_EATING_MACAQUE = ('macaca_fascicularis', 'ENSMFAG')
- RHESUS_MACAQUE = ('macaca_mulatta', 'ENSMMUG')
- PLATYPUS = ('ornithorhynchus_anatinus', 'ENSOANG')
- OPOSSUM = ('monodelphis_domestica', 'ENSMODG')
- GORILLA = ('gorilla_gorilla', 'ENSGGOG')
- CHIMPANZEE = ('pan_troglodytes', 'ENSPTRG')
- MARMOSET = ('callithrix_jacchus', 'ENSCJAG')
- CHICKEN = ('gallus_gallus', 'ENSGALG')
- RABBIT = ('oryctolagus_cuniculus', 'ENSOCUG')
- FRUIT_FLY = ('drosophila_melanogaster', 'FBgn')
- RAT = ('rattus_norvegicus', 'ENSRNOG')
- NAKED_MOLE_RAT = ('heterocephalus_glaber', 'ENSHGLG')
- CAENORHABDITIS_ELEGANS = ('caenorhabditis_elegans', 'WBGene')
- YEAST = ('saccharomyces_cerevisiae', '')
- MALARIA_PARASITE = ('plasmodium_falciparum', 'PF3D7')
- SEA_LAMPREY = ('petromyzon_marinus', 'ENSPMAG')
- FRESHWATER_SPONGE = ('spongilla_lacustris', 'ENSLPGG')
- CORAL = ('stylophora_pistillata', 'LOC')
- SEA_URCHIN = ('lytechinus_variegatus', '')
- property name
The name of the Enum member.
- property prefix