Debugging Guideο
This guide provides solutions to common issues you might encounter when using cz-benchmarks. Problems are organized by the stage at which they typically occur, from loading data to running tasks.
Dataset Loading and Validation Errorsο
These errors usually happen when calling load_dataset(), load_custom_dataset(), dataset.load_data(), or dataset.validate().
βοΈ Dataset Not Found in Configurationο
Error:
ValueError: Dataset {dataset_name} not found in config.Cause: The dataset name you passed to
load_dataset()does not have a corresponding entry in the Hydra configuration.Solution:
Check for typos in the dataset name. You can see available datasets with
list_available_datasets().Ensure your custom YAML config is correctly structured and is being loaded.
π File Not Found or Path Issuesο
Error:
FileNotFoundError: Local dataset file not found at path.Cause: The
pathspecified to the custom YAML configuration file (e.g.,datasets.yaml) is incorrect, or the file is not accessible.Solution:
Verify that the path in the YAML config points to the correct
.h5adfile.Ensure the file exists and that the necessary read permissions exist.
If using a custom config with
load_custom_dataset(custom_dataset_config_path=...), make sure the path to the YAML file itself is correct.
π¬ AnnData Validation Errorsο
These errors originate from the content of your .h5ad file not meeting the requirements of the Dataset class.
Error:
ValueError: Dataset does not contain valid gene names...Cause: The
SingleCellDataset._validate()method failed because gene names inadata.var_namesdo not start with the required prefix for the specifiedorganism(e.g.,"ENSG"forHUMAN).Solution: Ensure your
AnnDataobjectβs gene identifiers are correct. The framework can automatically use theensembl_idcolumn if it exists inadata.var. Check thatadata.var['ensembl_id']contains the correct, prefixed gene IDs.
Error:
ValueError: Dataset does not contain '{key}' column in obs.Cause: A required metadata column is missing from
adata.obs. This often happens with:SingleCellLabeledDataset: Thelabel_column_key(e.g.,"cell_type") is missing.SingleCellPerturbationDataset: Thecondition_keyis missing.
Solution: Add the required column with the correct data to your
AnnDataobjectβs.obsDataFrame and save the.h5adfile again.
Error:
ValueError: Unexpected condition label: ``{condition}`` not present in control mapping.Cause: One or more values in the
conditioncolumn in aSingleCellPerturbationDatasetare not present in thecontrol_cells_map.Solution:
The values in the
conditioncolumn in aSingleCellPerturbationDatasetmust all be present in thecontrol_cells_map. Ensure the values are correct inadata.obs[``{condition_key}``]and in thecontrol_cells_mapand re-save the dataset.
Task Execution Errorsο
These errors occur when calling task.run() and are often related to mismatches between the model output (cell_representation) and the dataset inputs.
βοΈ Input Shape and Type Mismatchesο
Error:
ValueError: This task requires a list of cell representationsorValueError: This task requires a single cell representation.Cause: You are passing the wrong input structure to a task.
Solution:
For tasks with
requires_multiple_datasets = True(likeCrossSpeciesIntegrationTask),cell_representationmust be a list of embeddings ([emb1, emb2, ...]).For all other tasks,
cell_representationmust be a singlenumpy.ndarrayorpd.DataFrame.
Error: An error related to mismatched array/DataFrame dimensions during filtering, concatenation, or model fitting (e.g., inside
filter_minimum_classorsklearn).Cause: The number of cells (rows) in your
cell_representationdoes not match the number of cells in the loadeddatasetβs metadata (e.g.,dataset.labelsordataset.adata.obs).Solution: Ensure your modelβs output embedding has the same number of observations and is in the same order as the cells in the original dataset file loaded by
cz-benchmarks.
π― Task-Specific Errorsο
Task:
MetadataLabelPredictionTaskIssue: Very few classes are used for training, or an error occurs in
filter_minimum_class.Cause: Many of your label classes have fewer samples than
min_class_size(default is 10).Solution: Check the distribution of your labels. If necessary, either use a dataset with more balanced classes or adjust the
min_class_sizeparameter in theMetadataLabelPredictionTaskInput.
Task:
CrossSpeciesIntegrationTaskError:
AssertionError: At least two organisms are required...Cause: The task is being run on datasets that are all from the same species.
Solution: This task is specifically for evaluating cross-species alignment and requires inputs from at least two different organisms.
Environment and Dependency Issuesο
π¦ Missing Packagesο
Error:
ImportError: No module named '...'(e.g.,hnswlib,scanpy).Cause: A required dependency is not installed in your Python environment.
Solution: Install the missing package. For all development dependencies, run
pip install -e ".[dev]"from the repository root.
π§ Memory Errorsο
Error:
MemoryErroror your process is killed by the OS.Cause: Loading or processing data (e.g., a large
.h5adfile or a dense embedding matrix) requires more RAM than is available.Solution:
Run your code on a machine with more RAM.
If possible, for initial debugging, create a smaller version of your dataset by subsampling cells.
Check for parts of your code that might be making unnecessary copies of large objects.
General Tips for Debuggingο
Start Small: When debugging, use a small, fast-running dataset (like
tsv2_bladder) or a subset of your custom data to quickly iterate.Isolate the Problem: Determine if the issue is in data loading or task execution. First, ensure your dataset loads and validates successfully:
dataset = load_custom_dataset("my_dataset", custom_dataset_config_path="my_config.yaml") dataset.load_data() dataset.validate() print(dataset.adata)
Check Shapes and Types: Before running a task, print the shapes and types of your inputs to catch mismatches early.
print(f"Embedding shape: {my_embedding.shape}") print(f"Labels length: {len(dataset.labels)}") print(f"Embedding type: {type(my_embedding)}")
Use a Debugger: Use an interactive debugger like
pdbor your IDEβs built-in debugger to step through the code, inspect variables, and understand the execution flow.import pdb; pdb.set_trace()
Dependency Conflicts: Ensure all dependencies are installed in a clean virtual environment. Recreate the environment if needed.
hnswlib package installation error: If the
hnswlibpackage fails to install with an error likefatal error: Python.h: No such file or directory, ensure you have installed Python development headers files and static libraries. On Ubuntu, this can be done viasudo apt-get install python3-dev.