Using `cz-benchmarks`

You may duplicate this notebook and replace the simulated model execution cell with your own model code.

This notebook guides you through loading single-cell datasets, running your model, and evaluating results using standardized tasks and metrics.

All you need to do is swap in your model’s output—no extra setup required. Use the provided examples as templates for your workflow.

[ ]:

# Setup you notebook kernel and install the package
# Install czbenchmarks for the selected Jupyter kernel
!pip install czbenchmarks

1. Datasets

Datasets are wrapped for consistent loading and compatibility:

SingleCellLabeledDataset: Gene expression data with cell labels (supports clustering, embedding, label prediction).
SingleCellPerturbationDataset: Perturbation datasets with control and perturbed cells.

[1]:

import numpy as np
from czbenchmarks.datasets import load_dataset
from czbenchmarks.datasets.single_cell_labeled import SingleCellLabeledDataset

List Available Datasets

This code snippet lists all available datasets in the czbenchmarks library.

[2]:

from czbenchmarks.datasets.utils import list_available_datasets
import pandas as pd

# List all available datasets in czbenchmarks
available_datasets = list_available_datasets()

# Display available datasets as a table
df_datasets = pd.DataFrame({"Dataset": available_datasets})
df_datasets

[2]:

	Dataset
chicken_spermatogenesis	{'organism': 'gallus_gallus', 'url': 's3://cz-...
chimpanzee_spermatogenesis	{'organism': 'pan_troglodytes', 'url': 's3://c...
gorilla_spermatogenesis	{'organism': 'gorilla_gorilla', 'url': 's3://c...
human_spermatogenesis	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
marmoset_spermatogenesis	{'organism': 'callithrix_jacchus', 'url': 's3:...
mouse_spermatogenesis	{'organism': 'mus_musculus', 'url': 's3://cz-b...
opossum_spermatogenesis	{'organism': 'monodelphis_domestica', 'url': '...
platypus_spermatogenesis	{'organism': 'ornithorhynchus_anatinus', 'url'...
replogle_k562_essential_perturbpredict	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
rhesus_macaque_spermatogenesis	{'organism': 'macaca_mulatta', 'url': 's3://cz...
tsv2_bladder	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_blood	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_bone_marrow	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_ear	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_eye	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_fat	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_heart	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_large_intestine	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_liver	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_lung	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_lymph_node	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_mammary	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_muscle	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_ovary	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_prostate	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_salivary_gland	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_skin	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_small_intestine	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_spleen	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_stomach	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_testis	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_thymus	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_tongue	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_trachea	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_uterus	{'organism': 'homo_sapiens', 'url': 's3://cz-b...
tsv2_vasculature	{'organism': 'homo_sapiens', 'url': 's3://cz-b...

Load a Dataset

Load the pre-configured tsv2_prostate dataset, which you can find in the list above. The library will automatically download, cache, and load this dataset as a SingleCellLabeledDataset object. This makes it easy to reuse the data for your analysis without extra setup.

Loaded dataset provides:

dataset.adata: AnnData object with gene expression data.
dataset.labels: pandas Series of cell type labels.

[8]:

# The 'dataset' object is a validated AnnData wrapper, ensuring efficient downstream processing.
dataset: SingleCellLabeledDataset = load_dataset("tsv2_prostate")
dataset.adata

INFO:czbenchmarks.file_utils:File already exists in cache: /Users/sgupta/.cz-benchmarks/datasets/homo_sapiens_10df7690-6d10-4029-a47e-0f071bb2df83_Prostate_v2_curated.h5ad
INFO:czbenchmarks.datasets.single_cell:Loading dataset from /Users/sgupta/.cz-benchmarks/datasets/homo_sapiens_10df7690-6d10-4029-a47e-0f071bb2df83_Prostate_v2_curated.h5ad in memory mode.

[8]:

AnnData object with n_obs × n_vars = 2044 × 21808
    obs: 'donor_id', 'tissue_in_publication', 'anatomical_position', 'method', 'cdna_plate', 'library_plate', 'notes', 'cdna_well', 'assay_ontology_term_id', 'sample_id', 'replicate', '10X_run', 'ambient_removal', 'donor_method', 'donor_assay', 'donor_tissue', 'donor_tissue_assay', 'cell_type_ontology_term_id', 'compartment', 'broad_cell_class', 'free_annotation', 'manually_annotated', 'published_2022', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt', 'total_counts_ercc', 'pct_counts_ercc', '_scvi_batch', '_scvi_labels', 'scvi_leiden_donorassay_full', 'ethnicity_original', 'sample_number', 'organism_ontology_term_id', 'suspension_type', 'tissue_type', 'tissue_ontology_term_id', 'disease_ontology_term_id', 'is_primary_data', 'sex_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'development_stage_ontology_term_id', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid', 'dataset_id'
    var: 'ensembl_id', 'genome', 'mt', 'ercc', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'mean', 'std', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type', 'feature_id'
    uns: '_scvi_manager_uuid', '_scvi_uuid', '_training_mode', 'assay_ontology_term_id_colors', 'citation', 'compartment_colors', 'donor_id_colors', 'leiden', 'method_colors', 'neighbors', 'pca', 'schema_reference', 'schema_version', 'sex_ontology_term_id_colors', 'tissue_in_publication_colors', 'title', 'umap'
    obsm: 'X_pca', 'X_scvi', 'X_umap', 'X_umap_scvi_full_donorassay', 'X_uncorrected_alltissues_umap', 'X_uncorrected_umap'
    varm: 'PCs'
    layers: 'X_original', 'decontXcounts', 'scale_data'
    obsp: 'connectivities', 'distances'

2. Model

Tasks expect a CellRepresentation, which is a numpy.ndarray with cells as rows and embedding features as columns. For demonstration, we simulate model output with random data.

For this example, we will use random numbers to simulate what a real model would produce. In your own work, you should replace this with the actual output from your model—such as the embeddings generated by your neural network or other method.

Tip: You can copy this notebook and swap out the code below for your own model’s import, inference, or training steps. Just make sure the final output is a NumPy array in the correct shape.

[6]:

# Simulated 10-dimensional embedding for each cell
# Replace this with your model's actual code to generate output embeddings for tasks like clustering, embedding, or label prediction.
from czbenchmarks.tasks.types import CellRepresentation

model_output: CellRepresentation = np.random.rand(dataset.adata.shape[0], 10)

3. Task

Each task defines an evaluation workflow with run() and compute_baseline() methods.

Task Name	Class	Purpose
Clustering	`ClusteringTask`	Evaluate cell group separation
Embedding Quality	`EmbeddingTask`	Assess embedding structure
Label Prediction	`MetadataLabelPredictionTask`	Predict labels from embeddings
Batch Integration	`BatchIntegrationTask`	Evaluate batch integration
Cross-Species	`CrossSpeciesIntegrationTask`	Integrate data across species

Task Metrics

Metrics are managed by MetricRegistry and returned as MetricResult objects.

MetricType: Enum of metric names (e.g., ADJUSTED_RAND_INDEX, SILHOUETTE_SCORE)
MetricResult: Stores metric type, value, and parameters

All tasks compute and return metrics automatically.

Example: Run a Clustering Task

Evaluate the embedding by measuring clustering performance using Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI). The task compares Leiden clusters from the embedding to true labels. Higher scores indicate better clustering.

Compare clustering_results to clustering_baseline_results to assess model performance against the PCA baseline.

[7]:

from czbenchmarks.tasks import (
    ClusteringTask,
)
from czbenchmarks.tasks.clustering import ClusteringTaskInput

# Evaluate the embedding by measuring clustering performance using Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI). The task compares Leiden clusters from the embedding to true labels. Higher scores indicate better clustering. Compare `clustering_results` to `clustering_baseline_results` to assess model performance against the PCA baseline.

# 1. Initialize the task
clustering_task = ClusteringTask()

# 2. Define the inputs for the task
clustering_task_input = ClusteringTaskInput(
    obs=dataset.adata.obs,  # The full observation metadata
    input_labels=dataset.labels,  # The ground-truth labels for comparison
)

# 3. Run the task on your model's output
clustering_results = clustering_task.run(
    cell_representation=model_output,
    task_input=clustering_task_input,
)

# 4. Compute and run the baseline for comparison
expression_data = dataset.adata.X
clustering_baseline = clustering_task.compute_baseline(expression_data)
clustering_baseline_results = clustering_task.run(
    cell_representation=clustering_baseline,
    task_input=clustering_task_input,
)

print("--- Clustering Model Results ---")
for result in clustering_results:
    print(result.model_dump_json(indent=2))

print("\n--- Clustering Baseline Results ---")
for result in clustering_baseline_results:
    print(result.model_dump_json(indent=2))

--- Clustering Model Results ---
{
  "metric_type": "adjusted_rand_index",
  "value": -0.00019227160583039173,
  "params": {}
}
{
  "metric_type": "normalized_mutual_info",
  "value": 0.022823018925207977,
  "params": {}
}

--- Clustering Baseline Results ---
{
  "metric_type": "adjusted_rand_index",
  "value": 0.6421494136697635,
  "params": {}
}
{
  "metric_type": "normalized_mutual_info",
  "value": 0.8331383925676068,
  "params": {}
}

Using cz-benchmarks