Benchmarks
The VCP CLI provides commands to utilize the capabilities of the Virtual Cells Platform
Overview
Benchmarking in VCP allows comparison of different models across various tasks and datasets. The benchmarking system consists of three main components:
Models: Pre-trained machine learning models (e.g., scVI, TRANSCRIPTFORMER)
Datasets: Single-cell datasets for evaluation (e.g., Tabula Sapiens datasets)
Tasks: Specific evaluation tasks (e.g., clustering, embedding, label prediction)
The Datasets and Task implementations are provided by the cz-benchmarks package.
Commands
vcp benchmarks list
Lists the benchmarks that have been computed by and published on the Virtual Cells Platform.
This output provides the combinations of datasets, models, and tasks for which benchmarks were computed.
See vcp benchmarks get below for how to retrieve the benchmark metric results for specific benchmarks.
Basic Usage
vcp benchmarks list
See Output Fields for a description of the output fields.
Options
Benchmarks List Command Options
vcp benchmarks list
List available model, dataset and task benchmark combinations.
Shows benchmarks from the VCP as well as locally cached results from previous user benchmark runs. You can filter results by dataset, model, or task using glob patterns.
vcp benchmarks list [OPTIONS]
Options
- -b, --benchmark-key <benchmark_key>
Retrieve by benchmark key. Mutually-exclusive with filter options.
- -m, --model-filter <model_filter>
Filter by model key (substring match with ‘*’ wildcards, e.g. ‘scvi*v1’).
- -d, --dataset-filter <dataset_filter>
Filter by dataset key (substring match with ‘*’ wildcards, e.g. ‘tsv2*liver’).
- -t, --task-filter <task_filter>
Filter by task key (substring match with ‘*’ wildcards, e.g. ‘label*pred’).
- -f, --format <format>
Output format
- Options:
table | json
- --fit, --full
Column display for table format (default: fit). Use –full to show full column content; pair with a pager like ‘less -S’ for horizontal scrolling. Only applies to –format=table.
- --user-runs, --no-user-runs
Include or exclude locally cached results from previous user benchmark runs (default: include).
Notes:
A benchmark key is a unique identifier that combines a specific model, dataset, and task. For example,
f47892309c571cdfrepresents a specific combination of TRANSCRIPTFORMER model, tsv2_blood dataset, and embedding task. It is returned in results when using the filter options and can be used to identify a specific benchmark when using thevcp benchmarks getandvcp benchmarks listcommands.The filter options allow use of
*as a wildcard. Filters use substring matching and are case-insensitive. Filter values match across both the name and key of a given entity type (model, dataset, entity).
Examples
List all available benchmarks:
vcp benchmarks list
Filter by dataset, model, and task with table output:
vcp benchmarks list --dataset-filter tsv2_blood --model-filter TRANSCRIPT --task-filter embedding
Find specific benchmark by key:
vcp benchmarks list --benchmark-key f47892309c571cdf
Search for scVI models on any dataset with JSON output:
vcp benchmarks list --model-filter "scvi*" --format json
Display full column content (useful with pagers):
vcp benchmarks list --full | less -S
vcp benchmarks system-check
Displays system hardware information and validates whether your system meets the baseline requirements for running VCP benchmarks.
Basic Usage
vcp benchmarks system-check
Description
This command checks your system’s hardware specifications and compares them against baseline requirements for running benchmarks. It displays:
System RAM: Total RAM available and whether it meets the minimum requirement (≥32 GB)
GPU Information (Linux only): NVIDIA GPU model, memory, and availability
CUDA Version (Linux only): Installed CUDA version
NVIDIA Driver (Linux only): Driver version
Docker Support (Linux only): Whether Docker with GPU support is available
The output shows each component with:
Component: Hardware component being checked
Expected: Minimum requirement
Actual: Your system’s specification
Status: ✅ Pass or ❌ Fail
Verbose Mode
When using the --verbose or -v flag, the command displays additional NVIDIA diagnostic information after the standard system check table:
nvidia-smi: GPU utilization and memory usagenvidia-container-cli info: NVIDIA container runtime and GPU device informationldconfig -p | grep libnvidia-ml: Available NVIDIA libraries
This information is useful for troubleshooting issues when VCP models fail to run.
Example:
vcp benchmarks system-check --verbose
Notes
GPU detection and validation only works on Linux systems
macOS systems will only show RAM information
Model inference requires Linux with NVIDIA GPU support. Check each model’s docs for specific hardware requirements. See Virtual Cell Platform Models for details on supported models.
Docker GPU support is recommended for containerized model execution
Example Output
The command displays a table showing system specifications:
System Hardware Information
┌─────────────────┬──────────────┬─────────────────────┬──────────┐
│ Component │ Expected │ Actual │ Status │
├─────────────────┼──────────────┼─────────────────────┼──────────┤
│ System RAM │ ≥ 32 GB │ 64.0 GB │ ✅ Pass │
│ GPU Model │ NVIDIA GPU x1│ NVIDIA RTX A6000 │ ✅ Pass │
│ GPU Memory │ ≥ 16 GB │ 48.0 GB │ ✅ Pass │
│ CUDA Version │ ≥ 11.0 │ 12.2 │ ✅ Pass │
│ NVIDIA Driver │ Present │ 535.104.05 │ ✅ Pass │
│ Docker GPU │ Available │ Available │ ✅ Pass │
└─────────────────┴──────────────┴─────────────────────┴──────────┘
vcp benchmarks run
Computes a benchmark task and generates performance metrics using a specific model and dataset.
Basic Usage
To reproduce a benchmark published on the Virtual Cells Platform:
vcp benchmarks run <TASK> --model-key MODEL_KEY --dataset-key DATASET_KEY
Where <TASK> is one of: embedding, clustering, label_prediction, batch_integration, cross-species_integration, etc.
Each Task may accept task-specific parameters. Some of these parameters require a list, DataFrame, or matrix value (e.g. --labels requires a list of string values). These special parameter types can be provided the necessary data indirectly by using the “AnnData Reference” syntax (below) to specify the data to be extracted from a dataset’s AnnData object.
AnnData Reference Syntax
The AnnData Reference syntax allows you to reference specific attributes and data within an AnnData object using a string notation starting with @. This is particularly useful for specifying labels, metadata, or representations stored in the dataset.
Basic Syntax:
@<attribute>[:<key>]
Supported Attributes:
Reference |
Description |
Example |
|---|---|---|
|
Access a column from the |
|
|
Access a column from the |
|
|
Access a matrix from the |
|
|
Access a matrix from the |
|
|
Access a layer from the |
|
|
Access the main expression matrix |
|
|
Access the entire |
|
|
Access the entire |
|
|
Access the observation index |
|
|
Access the variable index |
|
Multi-Dataset References (for cross-species tasks):
When working with multiple datasets, you can specify which dataset to reference using an index prefix:
@<dataset_index>:<attribute>[:<key>]
Examples:
@0:obs:cell_type- Cell type labels from the first dataset@1:obs:cell_type- Cell type labels from the second dataset@0:obsm:X_pca- PCA embeddings from the first dataset
Common Examples:
# Reference cell type labels from obs
--input-labels @obs:cell_type
# Reference batch information
--batch-labels @obs:batch
# Use PCA representation from obsm
--use-rep @obsm:X_pca
# For cross-species tasks, specify labels from each dataset
--labels @0:obs:cell_type --labels @1:obs:cell_type
Options
The available options for the vcp benchmarks run command, including all task-specific options are as follows:
Benchmarks Run Command Options
vcp benchmarks run
Run a benchmark task on a cell representation, which can be provided in one of the following ways: 1) generate a cell representation by performing model inference on a specified dataset, using a specified model, or 2) specify use a previously-computed cell representation (skips performing model inference), or 3) have the task generate a baseline cell representation that is computed from a specified dataset.
Use vcp benchmarks run <task> –help to see all available options for that task.
vcp benchmarks run [OPTIONS] COMMAND [ARGS]...
Options
- -b, --benchmark-key <benchmark_key>
Run a benchmark using the model, dataset, and task of a VCP-published benchmark (run vcp benchmarks list for available benchmark keys).
batch_integration
Task for evaluating batch integration quality.
This task computes metrics to assess how well different batches are integrated in the embedding space while preserving biological signals.
Specify one of –model-key, –cell-representation, or –compute-baseline to generate or provide the benchmarked cell representation to the task.
Specify one of –dataset-key or –user-dataset to specify the associated dataset file(s) that contain ground truth data needed by the task for evaluation. These dataset options may be specified multiple times for multi-dataset tasks.
If –model-key is specified, dataset(s) will provide the input data to the model. If –compute-baseline is specified, dataset(s) will be used to compute a baseline cell representation. If –cell-representation is specified, a dataset is only used if task-specific option arguments reference ground truth data within the dataset.
vcp benchmarks run batch_integration [OPTIONS]
Options
- -m, --model-key <model_key>
Model key (e.g. SCVI-v1-homo_sapiens; run vcp benchmarks list for available model keys).
- -d, --dataset-key <dataset_key>
Dataset key from czbenchmarks datasets (e.g., tsv2_blood; run czbenchmarks list datasets for available dataset keys). Can be used multiple times.
- -u, --user-dataset <user_dataset>
Path to a user-provided .h5ad file. Provide as a JSON string with keys: ‘dataset_class’, ‘organism’, and ‘path’. Example: ‘{“dataset_class”: “czbenchmarks.datasets.SingleCellLabeledDataset”, “organism”: “HUMAN”, “path”: “~/mydata.h5ad”}’. Can be used multiple times.
- -c, --cell-representation <cell_representation>
Path to precomputed cell embeddings (.npy file) or AnnData reference (e.g., ‘@X’, ‘@obsm:X_pca’). Can be used multiple times.
- -B, --compute-baseline
Compute baseline for comparison. Cannot be used with –model-key or –cell-representation.
- -r, --random-seed <random_seed>
Set a random seed for reproducibility.
- -n, --no-cache
Disable caching. Forces all steps to run from scratch.
- -f, --format <format>
Output format (default: table).
- Options:
table | json
- --fit, --full
Column display for table format (default: fit). Use –full to show full column content; pair with a pager like ‘less -S’ for horizontal scrolling. Only applies to –format=table.
- --use-gpu, --no-use-gpu
Enable GPU support for model inference (default: enabled).
- --batch-labels <batch_labels>
Batch labels for each cell (e.g. obs.batch from an AnnData object). Supports AnnData reference syntax (e.g. ‘@obs:batch’).
- --labels <labels>
Ground truth labels for metric calculation (e.g. obs.cell_type from an AnnData object). Supports AnnData reference syntax (e.g. ‘@obs:cell_type’).
- --baseline-n-top-genes <baseline_n_top_genes>
Number of highly variable genes for PCA baseline. [Default: 3000] [Required: False]
- --baseline-n-pcs <baseline_n_pcs>
Number of principal components for PCA baseline. [Default: 50] [Required: False]
- --baseline-obsm-key <baseline_obsm_key>
AnnData .obsm key to store the baseline PCA embedding. [Default: emb] [Required: False]
clustering
Task for evaluating clustering performance against ground truth labels.
This task performs clustering on embeddings and evaluates the results using multiple clustering metrics (ARI and NMI).
Specify one of –model-key, –cell-representation, or –compute-baseline to generate or provide the benchmarked cell representation to the task.
Specify one of –dataset-key or –user-dataset to specify the associated dataset file(s) that contain ground truth data needed by the task for evaluation. These dataset options may be specified multiple times for multi-dataset tasks.
If –model-key is specified, dataset(s) will provide the input data to the model. If –compute-baseline is specified, dataset(s) will be used to compute a baseline cell representation. If –cell-representation is specified, a dataset is only used if task-specific option arguments reference ground truth data within the dataset.
vcp benchmarks run clustering [OPTIONS]
Options
- -m, --model-key <model_key>
Model key (e.g. SCVI-v1-homo_sapiens; run vcp benchmarks list for available model keys).
- -d, --dataset-key <dataset_key>
Dataset key from czbenchmarks datasets (e.g., tsv2_blood; run czbenchmarks list datasets for available dataset keys). Can be used multiple times.
- -u, --user-dataset <user_dataset>
Path to a user-provided .h5ad file. Provide as a JSON string with keys: ‘dataset_class’, ‘organism’, and ‘path’. Example: ‘{“dataset_class”: “czbenchmarks.datasets.SingleCellLabeledDataset”, “organism”: “HUMAN”, “path”: “~/mydata.h5ad”}’. Can be used multiple times.
- -c, --cell-representation <cell_representation>
Path to precomputed cell embeddings (.npy file) or AnnData reference (e.g., ‘@X’, ‘@obsm:X_pca’). Can be used multiple times.
- -B, --compute-baseline
Compute baseline for comparison. Cannot be used with –model-key or –cell-representation.
- -r, --random-seed <random_seed>
Set a random seed for reproducibility.
- -n, --no-cache
Disable caching. Forces all steps to run from scratch.
- -f, --format <format>
Output format (default: table).
- Options:
table | json
- --fit, --full
Column display for table format (default: fit). Use –full to show full column content; pair with a pager like ‘less -S’ for horizontal scrolling. Only applies to –format=table.
- --use-gpu, --no-use-gpu
Enable GPU support for model inference (default: enabled).
- --obs <obs>
Cell metadata DataFrame (e.g. the obs from an AnnData object). [Default: None] [Required: True] Supports AnnData reference syntax (e.g. ‘@obs’).
- --input-labels <input_labels>
Ground truth labels for metric calculation (e.g. obs.cell_type from an AnnData object). Supports AnnData reference syntax (e.g. ‘@obs:cell_type’).
- --use-rep <use_rep>
Data representation to use for clustering (e.g. the ‘X’ or obsm[‘X_pca’] from an AnnData object). [Default: X] [Required: False] Supports AnnData reference syntax (e.g. ‘X’).
- --n-iterations <n_iterations>
Number of iterations for the Leiden algorithm. [Default: 2] [Required: False]
- --flavor <flavor>
Algorithm for Leiden community detection. [Default: igraph] [Required: False] [Options : ‘leidenalg’, ‘igraph’]
- Options:
leidenalg | igraph
- --key-added <key_added>
Key in AnnData.obs where cluster assignments are stored. [Default: leiden] [Required: False]
- --baseline-n-top-genes <baseline_n_top_genes>
Number of highly variable genes for PCA baseline. [Default: 3000] [Required: False]
- --baseline-n-pcs <baseline_n_pcs>
Number of principal components for PCA baseline. [Default: 50] [Required: False]
- --baseline-obsm-key <baseline_obsm_key>
AnnData .obsm key to store the baseline PCA embedding. [Default: emb] [Required: False]
cross-species_integration
Task for evaluating cross-species integration quality.
This task computes metrics to assess how well different species’ data are integrated in the embedding space while preserving biological signals. It operates on multiple datasets from different species.
Specify one of –model-key, –cell-representation, or –compute-baseline to generate or provide the benchmarked cell representation to the task.
Specify one of –dataset-key or –user-dataset to specify the associated dataset file(s) that contain ground truth data needed by the task for evaluation. These dataset options may be specified multiple times for multi-dataset tasks.
If –model-key is specified, dataset(s) will provide the input data to the model. If –compute-baseline is specified, dataset(s) will be used to compute a baseline cell representation. If –cell-representation is specified, a dataset is only used if task-specific option arguments reference ground truth data within the dataset.
vcp benchmarks run cross-species_integration [OPTIONS]
Options
- -m, --model-key <model_key>
Model key (e.g. SCVI-v1-homo_sapiens; run vcp benchmarks list for available model keys).
- -d, --dataset-key <dataset_key>
Dataset key from czbenchmarks datasets (e.g., tsv2_blood; run czbenchmarks list datasets for available dataset keys). Can be used multiple times.
- -u, --user-dataset <user_dataset>
Path to a user-provided .h5ad file. Provide as a JSON string with keys: ‘dataset_class’, ‘organism’, and ‘path’. Example: ‘{“dataset_class”: “czbenchmarks.datasets.SingleCellLabeledDataset”, “organism”: “HUMAN”, “path”: “~/mydata.h5ad”}’. Can be used multiple times.
- -c, --cell-representation <cell_representation>
Path to precomputed cell embeddings (.npy file) or AnnData reference (e.g., ‘@X’, ‘@obsm:X_pca’). Can be used multiple times.
- -B, --compute-baseline
Compute baseline for comparison. Cannot be used with –model-key or –cell-representation.
- -r, --random-seed <random_seed>
Set a random seed for reproducibility.
- -n, --no-cache
Disable caching. Forces all steps to run from scratch.
- -f, --format <format>
Output format (default: table).
- Options:
table | json
- --fit, --full
Column display for table format (default: fit). Use –full to show full column content; pair with a pager like ‘less -S’ for horizontal scrolling. Only applies to –format=table.
- --use-gpu, --no-use-gpu
Enable GPU support for model inference (default: enabled).
- --labels <labels>
List of ground truth labels for each species dataset (e.g., cell types). Can be specified multiple times. Supports AnnData reference syntax (e.g. ‘@obs:cell_type’).
- --organism-list <organism_list>
List of organisms corresponding to each dataset for cross-species evaluation. Can be specified multiple times.
cross-species_label_prediction
Task for cross-species label prediction evaluation.
This task evaluates cross-species transfer by training classifiers on one species and testing on another species. It computes accuracy, F1, precision, recall, and AUROC for multiple classifiers (Logistic Regression, KNN, Random Forest).
The task can optionally aggregate cell-level embeddings to sample/donor level before running classification.
Specify one of –model-key, –cell-representation, or –compute-baseline to generate or provide the benchmarked cell representation to the task.
Specify one of –dataset-key or –user-dataset to specify the associated dataset file(s) that contain ground truth data needed by the task for evaluation. These dataset options may be specified multiple times for multi-dataset tasks.
If –model-key is specified, dataset(s) will provide the input data to the model. If –compute-baseline is specified, dataset(s) will be used to compute a baseline cell representation. If –cell-representation is specified, a dataset is only used if task-specific option arguments reference ground truth data within the dataset.
vcp benchmarks run cross-species_label_prediction [OPTIONS]
Options
- -m, --model-key <model_key>
Model key (e.g. SCVI-v1-homo_sapiens; run vcp benchmarks list for available model keys).
- -d, --dataset-key <dataset_key>
Dataset key from czbenchmarks datasets (e.g., tsv2_blood; run czbenchmarks list datasets for available dataset keys). Can be used multiple times.
- -u, --user-dataset <user_dataset>
Path to a user-provided .h5ad file. Provide as a JSON string with keys: ‘dataset_class’, ‘organism’, and ‘path’. Example: ‘{“dataset_class”: “czbenchmarks.datasets.SingleCellLabeledDataset”, “organism”: “HUMAN”, “path”: “~/mydata.h5ad”}’. Can be used multiple times.
- -c, --cell-representation <cell_representation>
Path to precomputed cell embeddings (.npy file) or AnnData reference (e.g., ‘@X’, ‘@obsm:X_pca’). Can be used multiple times.
- -B, --compute-baseline
Compute baseline for comparison. Cannot be used with –model-key or –cell-representation.
- -r, --random-seed <random_seed>
Set a random seed for reproducibility.
- -n, --no-cache
Disable caching. Forces all steps to run from scratch.
- -f, --format <format>
Output format (default: table).
- Options:
table | json
- --fit, --full
Column display for table format (default: fit). Use –full to show full column content; pair with a pager like ‘less -S’ for horizontal scrolling. Only applies to –format=table.
- --use-gpu, --no-use-gpu
Enable GPU support for model inference (default: enabled).
- --labels <labels>
List of ground truth labels for each species dataset (e.g., cell types). Can be specified multiple times. Supports AnnData reference syntax (e.g. ‘@obs:cell_type’).
- --organisms <organisms>
List of organisms corresponding to each dataset for cross-species evaluation. Can be specified multiple times.
- --sample-ids <sample_ids>
Optional list of sample/donor IDs for aggregation, one per dataset. Can be specified multiple times. Supports AnnData reference syntax (e.g. ‘None’).
- --aggregation-method <aggregation_method>
Method to aggregate cells with the same sample_id (‘none’, ‘mean’, or ‘median’). [Default: mean] [Required: False] [Options : ‘none’, ‘mean’, ‘median’]
- Options:
none | mean | median
- --n-folds <n_folds>
Number of cross-validation folds for intra-species evaluation. [Default: 5] [Required: False]
embedding
Task for evaluating cell representation quality using labeled data.
This task computes quality metrics for cell representations using ground truth labels. Currently supports silhouette score evaluation.
Specify one of –model-key, –cell-representation, or –compute-baseline to generate or provide the benchmarked cell representation to the task.
Specify one of –dataset-key or –user-dataset to specify the associated dataset file(s) that contain ground truth data needed by the task for evaluation. These dataset options may be specified multiple times for multi-dataset tasks.
If –model-key is specified, dataset(s) will provide the input data to the model. If –compute-baseline is specified, dataset(s) will be used to compute a baseline cell representation. If –cell-representation is specified, a dataset is only used if task-specific option arguments reference ground truth data within the dataset.
vcp benchmarks run embedding [OPTIONS]
Options
- -m, --model-key <model_key>
Model key (e.g. SCVI-v1-homo_sapiens; run vcp benchmarks list for available model keys).
- -d, --dataset-key <dataset_key>
Dataset key from czbenchmarks datasets (e.g., tsv2_blood; run czbenchmarks list datasets for available dataset keys). Can be used multiple times.
- -u, --user-dataset <user_dataset>
Path to a user-provided .h5ad file. Provide as a JSON string with keys: ‘dataset_class’, ‘organism’, and ‘path’. Example: ‘{“dataset_class”: “czbenchmarks.datasets.SingleCellLabeledDataset”, “organism”: “HUMAN”, “path”: “~/mydata.h5ad”}’. Can be used multiple times.
- -c, --cell-representation <cell_representation>
Path to precomputed cell embeddings (.npy file) or AnnData reference (e.g., ‘@X’, ‘@obsm:X_pca’). Can be used multiple times.
- -B, --compute-baseline
Compute baseline for comparison. Cannot be used with –model-key or –cell-representation.
- -r, --random-seed <random_seed>
Set a random seed for reproducibility.
- -n, --no-cache
Disable caching. Forces all steps to run from scratch.
- -f, --format <format>
Output format (default: table).
- Options:
table | json
- --fit, --full
Column display for table format (default: fit). Use –full to show full column content; pair with a pager like ‘less -S’ for horizontal scrolling. Only applies to –format=table.
- --use-gpu, --no-use-gpu
Enable GPU support for model inference (default: enabled).
- --input-labels <input_labels>
Ground truth labels for metric calculation (e.g. obs.cell_type from an AnnData object). Supports AnnData reference syntax (e.g. ‘@obs:cell_type’).
- --baseline-n-top-genes <baseline_n_top_genes>
Number of highly variable genes for PCA baseline. [Default: 3000] [Required: False]
- --baseline-n-pcs <baseline_n_pcs>
Number of principal components for PCA baseline. [Default: 50] [Required: False]
- --baseline-obsm-key <baseline_obsm_key>
AnnData .obsm key to store the baseline PCA embedding. [Default: emb] [Required: False]
label_prediction
Task for predicting labels from embeddings using cross-validation.
Evaluates multiple classifiers (Logistic Regression, KNN) using k-fold cross-validation. Reports standard classification metrics.
Specify one of –model-key, –cell-representation, or –compute-baseline to generate or provide the benchmarked cell representation to the task.
Specify one of –dataset-key or –user-dataset to specify the associated dataset file(s) that contain ground truth data needed by the task for evaluation. These dataset options may be specified multiple times for multi-dataset tasks.
If –model-key is specified, dataset(s) will provide the input data to the model. If –compute-baseline is specified, dataset(s) will be used to compute a baseline cell representation. If –cell-representation is specified, a dataset is only used if task-specific option arguments reference ground truth data within the dataset.
vcp benchmarks run label_prediction [OPTIONS]
Options
- -m, --model-key <model_key>
Model key (e.g. SCVI-v1-homo_sapiens; run vcp benchmarks list for available model keys).
- -d, --dataset-key <dataset_key>
Dataset key from czbenchmarks datasets (e.g., tsv2_blood; run czbenchmarks list datasets for available dataset keys). Can be used multiple times.
- -u, --user-dataset <user_dataset>
Path to a user-provided .h5ad file. Provide as a JSON string with keys: ‘dataset_class’, ‘organism’, and ‘path’. Example: ‘{“dataset_class”: “czbenchmarks.datasets.SingleCellLabeledDataset”, “organism”: “HUMAN”, “path”: “~/mydata.h5ad”}’. Can be used multiple times.
- -c, --cell-representation <cell_representation>
Path to precomputed cell embeddings (.npy file) or AnnData reference (e.g., ‘@X’, ‘@obsm:X_pca’). Can be used multiple times.
- -B, --compute-baseline
Compute baseline for comparison. Cannot be used with –model-key or –cell-representation.
- -r, --random-seed <random_seed>
Set a random seed for reproducibility.
- -n, --no-cache
Disable caching. Forces all steps to run from scratch.
- -f, --format <format>
Output format (default: table).
- Options:
table | json
- --fit, --full
Column display for table format (default: fit). Use –full to show full column content; pair with a pager like ‘less -S’ for horizontal scrolling. Only applies to –format=table.
- --use-gpu, --no-use-gpu
Enable GPU support for model inference (default: enabled).
- --labels <labels>
Ground truth labels for prediction (e.g. obs.cell_type from an AnnData object) Supports AnnData reference syntax (e.g. ‘@obs:cell_type’).
- --n-folds <n_folds>
Number of folds for stratified cross-validation. [Default: 5] [Required: False]
- --min-class-size <min_class_size>
Minimum number of samples required for a class to be included in evaluation. [Default: 10] [Required: False]
perturbation_expression_prediction
Task for evaluating perturbation-induced expression predictions against their ground truth values. This is done by calculating metrics derived from predicted and ground truth log fold change values for each condition. Currently, Spearman rank correlation is supported.
The following arguments are required and must be supplied by the task input class (PerturbationExpressionPredictionTaskInput) when running the task. These parameters are described below for documentation purposes:
- predictions_adata (ad.AnnData):
The anndata containing model predictions
- dataset_adata (ad.AnnData):
The anndata object from SingleCellPerturbationDataset.
- pred_effect_operation (Literal[“difference”, “ratio”]):
How to compute predicted effect between treated and control mean predictions over genes.
“ratio” uses \(\log\left(\frac{\text{mean}(\text{treated}) + \varepsilon}{\text{mean}(\text{control}) + \varepsilon}\right)\) when means are positive.
“difference” uses \(\text{mean}(\text{treated}) - \text{mean}(\text{control})\) and is generally safe across scales (probabilities, z-scores, raw expression).
Default is “ratio”.
- gene_index (Optional[pd.Index]):
The index of the genes in the predictions AnnData.
- cell_index (Optional[pd.Index]):
The index of the cells in the predictions AnnData.
Specify one of –model-key, –cell-representation, or –compute-baseline to generate or provide the benchmarked cell representation to the task.
Specify one of –dataset-key or –user-dataset to specify the associated dataset file(s) that contain ground truth data needed by the task for evaluation. These dataset options may be specified multiple times for multi-dataset tasks.
If –model-key is specified, dataset(s) will provide the input data to the model. If –compute-baseline is specified, dataset(s) will be used to compute a baseline cell representation. If –cell-representation is specified, a dataset is only used if task-specific option arguments reference ground truth data within the dataset.
vcp benchmarks run perturbation_expression_prediction [OPTIONS]
Options
- -m, --model-key <model_key>
Model key (e.g. SCVI-v1-homo_sapiens; run vcp benchmarks list for available model keys).
- -d, --dataset-key <dataset_key>
Dataset key from czbenchmarks datasets (e.g., tsv2_blood; run czbenchmarks list datasets for available dataset keys). Can be used multiple times.
- -u, --user-dataset <user_dataset>
Path to a user-provided .h5ad file. Provide as a JSON string with keys: ‘dataset_class’, ‘organism’, and ‘path’. Example: ‘{“dataset_class”: “czbenchmarks.datasets.SingleCellLabeledDataset”, “organism”: “HUMAN”, “path”: “~/mydata.h5ad”}’. Can be used multiple times.
- -c, --cell-representation <cell_representation>
Path to precomputed cell embeddings (.npy file) or AnnData reference (e.g., ‘@X’, ‘@obsm:X_pca’). Can be used multiple times.
- -B, --compute-baseline
Compute baseline for comparison. Cannot be used with –model-key or –cell-representation.
- -r, --random-seed <random_seed>
Set a random seed for reproducibility.
- -n, --no-cache
Disable caching. Forces all steps to run from scratch.
- -f, --format <format>
Output format (default: table).
- Options:
table | json
- --fit, --full
Column display for table format (default: fit). Use –full to show full column content; pair with a pager like ‘less -S’ for horizontal scrolling. Only applies to –format=table.
- --use-gpu, --no-use-gpu
Enable GPU support for model inference (default: enabled).
- --adata <adata>
AnnData object from SingleCellPerturbationDataset containing perturbation data and metadata. [Default: None] [Required: True]
- --pred-effect-operation <pred_effect_operation>
Method to compute predicted effect: ‘difference’ (mean(treated) - mean(control)) or ‘ratio’ (log ratio of means). [Default: ratio] [Required: False] [Options : ‘difference’, ‘ratio’]
- Options:
difference | ratio
- --gene-index <gene_index>
Optional gene index for predictions to align model predictions with dataset genes.
- --cell-index <cell_index>
Optional cell index for predictions to align model predictions with dataset cells.
sequential_organization
Task for evaluating sequential consistency in embeddings.
This task computes sequential quality metrics for embeddings using time point labels. Evaluates how well embeddings preserve sequential organization between cells.
Specify one of –model-key, –cell-representation, or –compute-baseline to generate or provide the benchmarked cell representation to the task.
Specify one of –dataset-key or –user-dataset to specify the associated dataset file(s) that contain ground truth data needed by the task for evaluation. These dataset options may be specified multiple times for multi-dataset tasks.
If –model-key is specified, dataset(s) will provide the input data to the model. If –compute-baseline is specified, dataset(s) will be used to compute a baseline cell representation. If –cell-representation is specified, a dataset is only used if task-specific option arguments reference ground truth data within the dataset.
vcp benchmarks run sequential_organization [OPTIONS]
Options
- -m, --model-key <model_key>
Model key (e.g. SCVI-v1-homo_sapiens; run vcp benchmarks list for available model keys).
- -d, --dataset-key <dataset_key>
Dataset key from czbenchmarks datasets (e.g., tsv2_blood; run czbenchmarks list datasets for available dataset keys). Can be used multiple times.
- -u, --user-dataset <user_dataset>
Path to a user-provided .h5ad file. Provide as a JSON string with keys: ‘dataset_class’, ‘organism’, and ‘path’. Example: ‘{“dataset_class”: “czbenchmarks.datasets.SingleCellLabeledDataset”, “organism”: “HUMAN”, “path”: “~/mydata.h5ad”}’. Can be used multiple times.
- -c, --cell-representation <cell_representation>
Path to precomputed cell embeddings (.npy file) or AnnData reference (e.g., ‘@X’, ‘@obsm:X_pca’). Can be used multiple times.
- -B, --compute-baseline
Compute baseline for comparison. Cannot be used with –model-key or –cell-representation.
- -r, --random-seed <random_seed>
Set a random seed for reproducibility.
- -n, --no-cache
Disable caching. Forces all steps to run from scratch.
- -f, --format <format>
Output format (default: table).
- Options:
table | json
- --fit, --full
Column display for table format (default: fit). Use –full to show full column content; pair with a pager like ‘less -S’ for horizontal scrolling. Only applies to –format=table.
- --use-gpu, --no-use-gpu
Enable GPU support for model inference (default: enabled).
- --obs <obs>
Cell metadata DataFrame (e.g. the obs from an AnnData object). [Default: None] [Required: True] Supports AnnData reference syntax (e.g. ‘@obs’).
- --input-labels <input_labels>
Ground truth labels for metric calculation (e.g. obs.cell_type from an AnnData object). Supports AnnData reference syntax (e.g. ‘@obs:cell_type’).
- --k <k>
Number of nearest neighbors for k-NN based metrics. [Default: 15] [Required: False]
- --normalize
Whether to normalize the embedding for k-NN based metrics.
- --adaptive-k
Whether to use an adaptive number of nearest neighbors for k-NN based metrics.
- --baseline-n-top-genes <baseline_n_top_genes>
Number of highly variable genes for PCA baseline. [Default: 3000] [Required: False]
- --baseline-n-pcs <baseline_n_pcs>
Number of principal components for PCA baseline. [Default: 50] [Required: False]
- --baseline-obsm-key <baseline_obsm_key>
AnnData .obsm key to store the baseline PCA embedding. [Default: emb] [Required: False]
Examples
Run benchmark using a VCP benchmark key:
vcp benchmarks run --benchmark-key 40e2c4837bf36ae1
Embedding task with custom labels:
vcp benchmarks run embedding --model-key SCVI-v1-homo_sapiens --dataset-key tsv2_blood \
--input-labels cell_type --random-seed 42 --no-cache
Clustering task with advanced options:
vcp benchmarks run clustering --model-key SCVI-v1-homo_sapiens --dataset-key tsv2_blood \
--input-labels cell_type --use-rep X --n-iterations 3 --flavor igraph --key-added my_clusters \
--random-seed 42 --no-cache
Label prediction with cross-validation settings:
vcp benchmarks run label_prediction --model-key SCVI-v1-homo_sapiens --dataset-key tsv2_blood \
--labels cell_type --n-folds 3 --min-class-size 5 --random-seed 42 --no-cache
Batch integration with custom batch labels:
vcp benchmarks run batch_integration --model-key SCVI-v1-homo_sapiens --dataset-key tsv2_blood \
--batch-labels "@obs:batch" --labels cell_type --random-seed 42 --no-cache
Cross-species integration:
vcp benchmarks run cross-species_integration --model-key UCE-v1-4l \
--dataset-key mouse_spermatogenesis --organisms mus_musculus:ENSMUSG --cross-species-labels "@0:obs:cell_type" \
--dataset-key rhesus_macaque_spermatogenesis --organisms macaca_mulatta:ENSMMUG --cross-species-labels "@1:obs:cell_type" \
--random-seed 42 --no-cache
Use precomputed cell representations:
vcp benchmarks run label_prediction \
--cell-representation './user_model_output.npy' \
--user-dataset '{"dataset_class": "czbenchmarks.datasets.SingleCellLabeledDataset", "organism": "HUMAN", "path": "~/user_dataset.h5ad"}' \
--labels @obs:cell_type --n-folds 5 --min-class-size 10 --random-seed 100 --no-cache
Running Baseline Benchmarks
Baseline benchmarks provide a reference point for comparing model performance. Instead of using a trained model, baselines compute a simple PCA-based cell representation directly from the dataset. This helps establish a performance floor and validate that models provide meaningful improvements over basic dimensionality reduction.
Basic baseline benchmark:
vcp benchmarks run embedding --compute-baseline --dataset-key tsv2_blood \
--input-labels cell_type --random-seed 42
Baseline with custom PCA parameters:
vcp benchmarks run clustering --compute-baseline --dataset-key tsv2_blood \
--input-labels cell_type \
--baseline-n-top-genes 5000 \
--baseline-n-pcs 100 \
--random-seed 42
Comparing baseline to model performance:
# Run baseline
vcp benchmarks run embedding --compute-baseline --dataset-key tsv2_blood \
--input-labels cell_type --random-seed 42
# Run model benchmark
vcp benchmarks run embedding --model-key SCVI-v1-homo_sapiens --dataset-key tsv2_blood \
--input-labels cell_type --random-seed 42
# Compare results
vcp benchmarks get --dataset-filter tsv2_blood --task-filter embedding
Baseline options:
--baseline-n-top-genes: Number of highly variable genes to select (default: 3000)--baseline-n-pcs: Number of principal components to compute (default: 50)--baseline-obsm-key: AnnData .obsm key to store the baseline embedding (default: ‘emb’)
Note: Baseline options vary by task and can be listed with
vcp benchmarks run <TASK> --help.
User Dataset Format
When using --user-dataset, provide a JSON string with the following keys:
dataset_class: The dataset class to use (typicallyczbenchmarks.datasets.SingleCellLabeledDataset)organism: The organism type (HUMAN,MOUSE, etc.)path: Path to the .h5ad file
Example:
{
"dataset_class": "czbenchmarks.datasets.SingleCellLabeledDataset",
"organism": "HUMAN",
"path": "~/mydata.h5ad"
}
Task Arguments and Reference Format
Task-specific arguments are provided via command-line options. Label options support both direct column names and AnnData reference format:
Direct format: --labels cell_type or --input-labels cell_type
Reference format: --labels @obs:cell_type or --input-labels @obs:cell_type
For embedding tasks:
vcp benchmarks run embedding --model-key MODEL_KEY --dataset-key DATASET_KEY \
--input-labels cell_type
For clustering tasks:
vcp benchmarks run clustering --model-key MODEL_KEY --dataset-key DATASET_KEY \
--input-labels cell_type --use-rep X --n-iterations 2 --flavor igraph --key-added leiden
For label prediction tasks:
vcp benchmarks run label_prediction --model-key MODEL_KEY --dataset-key DATASET_KEY \
--labels cell_type --n-folds 5 --min-class-size 10
For batch integration tasks:
vcp benchmarks run batch_integration --model-key MODEL_KEY --dataset-key DATASET_KEY \
--batch-labels "@obs:batch" --labels cell_type
For cross-species integration tasks:
vcp benchmarks run cross-species_integration --model-key MODEL_KEY \
--dataset-key DATASET_KEY_1 --organisms homo_sapiens:ENSG --cross-species-labels "@0:obs:cell_type" \
--dataset-key DATASET_KEY_2 --organisms mus_musculus:ENSMUSG --cross-species-labels "@1:obs:cell_type"
vcp benchmarks get
Retrieves and displays benchmark results that have been computed by and published by either the Virtual Cells Platform or computed locally by the user using the vcp benchmarks run command.
If filters match benchmarks from both the VCP and a user’s locally run benchmarks, all of the matching benchmarks will be output together. This supports comparison of user benchmarks against VCP benchmarks.
Basic Usage
vcp benchmarks get
See Output Fields for a description of the output fields.
Options
Benchmarks Get Command Options
vcp benchmarks get
Fetch and display benchmark results with metrics.
Shows benchmarks from the VCP as well as locally cached results from previous user benchmark runs. Use filters to select by model, dataset, or task. Results include detailed performance metrics for each benchmark.
vcp benchmarks get [OPTIONS]
Options
- -b, --benchmark-key <benchmark_key>
Retrieve by benchmark key (exact match). Mutually-exclusive with filter options.
- -m, --model-filter <model_filter>
Filter by model key (substring match with ‘*’ wildcards, e.g. ‘scvi*v1’)
- -d, --dataset-filter <dataset_filter>
Filter by dataset key (substring match with ‘*’ wildcards, e.g ‘tsv2*liver’)
- -t, --task-filter <task_filter>
Filter by task key (substring match with ‘*’ wildcards, e.g. ‘label*pred’)
- -f, --format <format>
Output format
- Options:
table | json
- --fit, --full
Column display for table format (default: fit). Use –full to show full column content; pair with a pager like ‘less -S’ for horizontal scrolling. Only applies to –format=table.
- --user-runs, --no-user-runs
Include or exclude locally cached results from previous user benchmark runs (default: include).
Notes:
The filter options allow use of
*as a wildcard. Filters use substring matching and are case-insensitive. Filter values match across both the name and key of a given entity type (model, dataset, entity).
Examples
Get all available results:
vcp benchmarks get
Filter results by model and dataset:
vcp benchmarks get --model-filter test --dataset-filter tsv2_blood
Get results for specific benchmark:
vcp benchmarks get --benchmark-key f47892309c571cdf
Filter by task and model with JSON output:
vcp benchmarks get --model-filter scvi --dataset-filter tsv2_blood --task-filter clustering --format json
Output Fields
The vcp benchmarks get and vcp benchmarks list commands output the following attributes:
Benchmark Key: Unique identifier for the benchmark
Model Key/Name: Model identifier and display name
Dataset Keys/Names: Dataset identifier and display name
Task Key/Name: Task identifier and display name
Metric: Metric name (for
getresults only).Value: Metric value (for
getresults only)
For further details about the supported Tasks and Metrics see the cz-benchmarks Tasks documentation.
Advanced Usage Patterns
Reproducible Experiments
Always use the --random-seed option for reproducible results:
vcp benchmarks run clustering --model-key SCVI-v1-homo_sapiens --dataset-key tsv2_blood --random-seed 42
Bypassing Cache
Use --no-cache to ensure fresh computation:
vcp benchmarks run clustering --model-key SCVI-v1-homo_sapiens --dataset-key tsv2_blood --no-cache
GPU Usage
Control GPU usage for model inference with --use-gpu (enabled by default) or --no-use-gpu:
# Disable GPU (use CPU only)
vcp benchmarks run embedding --model-key SCVI-v1-homo_sapiens --dataset-key tsv2_blood --no-use-gpu
Reproducing VCP Results
Combine list and run commands for systematic evaluation:
# First, list available benchmarks
vcp benchmarks list --model-filter "scvi*" --format json > available_benchmarks.json
# Then run specific benchmarks
vcp benchmarks run --benchmark-key BENCHMARK_KEY_FROM_LIST
Baseline Benchmarking
Establish baseline performance using simple PCA-based representations:
# Run baseline with default parameters
vcp benchmarks run clustering --compute-baseline --dataset-key tsv2_blood \
--input-labels cell_type --random-seed 42
# Run baseline with custom PCA settings for more comprehensive evaluation
vcp benchmarks run clustering --compute-baseline --dataset-key tsv2_blood \
--input-labels cell_type --baseline-n-top-genes 5000 --baseline-n-pcs 100 --random-seed 42
# Compare baseline vs model performance
vcp benchmarks get --dataset-filter tsv2_blood --task-filter embedding
User Datasets
Evaluate models on user datasets while comparing to existing benchmarks:
# Specify a user's local dataset file with custom labels
vcp benchmarks run embedding --model-key SCVI-v1-homo_sapiens \
--user-dataset '{"dataset_class": "czbenchmarks.datasets.SingleCellLabeledDataset", "organism": "HUMAN", "path": "~/custom.h5ad"}' \
--input-labels custom_cell_type
# Compare with existing results
vcp benchmarks get --model-filter SCVI-v1-homo_sapiens --task-filter embedding
Task-Specific Workflows
Use specialized options for different benchmark tasks:
# Advanced clustering with custom parameters
vcp benchmarks run clustering --model-key SCVI-v1-homo_sapiens --dataset-key tsv2_blood \
--input-labels cell_type --use-rep X --n-iterations 5 --flavor leidenalg \
--key-added custom_clusters --random-seed 42
# Cross-validation with custom settings for label prediction
vcp benchmarks run label_prediction \
--cell-representation embeddings.npy \
--user-dataset '{"dataset_class": "czbenchmarks.datasets.SingleCellLabeledDataset", "organism": "HUMAN", "path": "~/data.h5ad"}' \
--labels @obs:cell_type --n-folds 10 --min-class-size 3 --random-seed 42
# Batch integration with custom batch labels
vcp benchmarks run batch_integration --model-key SCVI-v1-homo_sapiens --dataset-key tsv2_blood \
--batch-labels "@obs:batch" --labels cell_type --random-seed 42
Best Practices
Use specific filters: Narrow down results with appropriate filters to find relevant benchmarks quickly
Set random seeds: Ensure reproducibility by always setting random seeds for experiments
Establish baselines: Run baseline benchmarks (
--compute-baseline) to establish reference performance before evaluating modelsReference format: Use
@obs:column_nameformat when your dataset uses non-standard column namesCache management: Use
--no-cachesparingly, as caching significantly speeds up repeated experimentsOutput format selection: Use JSON format for programmatic processing, table format for human review
Table display: By default, tables use
--fitto display columns compactly. Use--fullto show full column content; pair with a pager likeless -Sfor horizontal scrollingTask-specific tuning: Adjust parameters like
--n-folds,--n-iterationsbased on dataset size and requirementsProgressive filtering: Start with broad filters and progressively narrow down to find specific benchmarks