Benchmarks
The VCP CLI provides commands to utilize the capabilities of the Virtual Cell Platform
Overview
Benchmarking in VCP allows comparison of different models across various tasks and datasets. The benchmarking system consists of three main components:
Models: Pre-trained machine learning models (e.g., scVI, TRANSCRIPTFORMER)
Datasets: Single-cell datasets for evaluation (e.g., Tabula Sapiens datasets)
Tasks: Specific evaluation tasks (e.g., clustering, embedding, label prediction)
The Datasets and Task implementations are provided by the cz-benchmarks package.
Commands
vcp benchmarks list
Lists the benchmarks that have been computed by and published on the Virtual Cell Platform.
This output provides the combinations of datasets, models, and tasks for which benchmarks were computed.
See vcp benchmarks get
below for how to retrieve the benchmark metric results for specific benchmarks.
Basic Usage
vcp benchmarks list
See Output Fields for a description of the output fields.
Options
Option |
Short |
Description |
Example |
---|---|---|---|
|
|
Filter by specific benchmark key |
|
|
|
Filter by model key pattern |
|
|
|
Filter by dataset key pattern |
|
|
|
Filter by task key pattern |
|
|
|
Output format (table or json) |
|
A benchmark key is a unique identifier that combines a specific model, dataset, and task. For example,
f47892309c571cdf
represents a specific combination of TRANSCRIPTFORMER model, tsv2_blood dataset, and embedding task. It is returned in results when using the filter options and can be used to identify a specific benchmark when using thevcp benchmarks get
andvcp benchmarks list
commands.
The filter options allow use of *
as a wildcard. Filters use substring matching and are case-insensitive. Filter values match across both the name and key of a given entity type (model, dataset, entity).
Examples
List all available benchmarks:
vcp benchmarks list
Filter by dataset, model, and task with table output:
vcp benchmarks list -d tsv2_blood -m TRANSCRIPT -t embedding
Find specific benchmark by key:
vcp benchmarks list -b f47892309c571cdf
Search for scVI models on any dataset with JSON output:
vcp benchmarks list -m "scvi*" -f json
vcp benchmarks run
Executes a benchmark task and generates performance metrics using a specific model and dataset.
Basic Usage
To reproduce a benchmark published on the Virtual Cell Platform:
vcp benchmarks run -m MODEL_KEY -d DATASET_KEY -t TASK_KEY
Options
Option |
Short |
Description |
Example |
---|---|---|---|
|
|
Use predefined benchmark combination |
|
|
|
Specify model from registry |
|
|
|
Specify dataset from registry |
|
|
|
Specify benchmark task |
|
|
|
Use custom dataset file |
See user dataset section |
|
|
Use precomputed embeddings |
|
|
|
Parameters for baseline computation |
|
|
|
Set random seed for reproducibility |
|
|
|
Disable caching, run from scratch |
|
Task-Specific Options
Embedding Task:
Option |
Description |
Example |
---|---|---|
|
Cell type labels column (also supports ‘@obs:column’ format) |
|
Clustering Task:
Option |
Description |
Example |
---|---|---|
|
Cell type labels column (also supports ‘@obs:column’ format) |
|
|
Representation to use for clustering (default: ‘X’) |
|
|
Number of Leiden algorithm iterations (default: 2) |
|
|
Leiden algorithm flavor: ‘leidenalg’ or ‘igraph’ (default: ‘igraph’) |
|
|
Key for storing cluster assignments (default: ‘leiden’) |
|
Label Prediction Task:
Option |
Description |
Example |
---|---|---|
|
Cell type labels column (also supports ‘@obs:column’ format) |
|
|
Number of cross-validation folds (default: 5) |
|
|
Minimum samples per class for inclusion (default: 10) |
|
Batch Integration Task:
Option |
Description |
Example |
---|---|---|
|
Cell type labels column (also supports ‘@obs:column’ format) |
|
|
Batch information column (default: ‘batch’) |
|
Cross-Species Integration Task:
Option |
Description |
Example |
---|---|---|
|
Organism specification for cross-species (format: |
|
|
Cell type labels column for each dataset in cross-species tasks |
|
Examples
Run benchmark using a VCP benchmark key:
vcp benchmarks run -b 40e2c4837bf36ae1
Embedding task with custom labels:
vcp benchmarks run -m SCVI-v1-homo_sapiens -d tsv2_blood -t embedding --labels cell_type -r 42 -n
Clustering task with advanced options:
vcp benchmarks run -m SCVI-v1-homo_sapiens -d tsv2_blood -t clustering \
--labels cell_type --use-rep X --n-iterations 3 --flavor igraph --key-added my_clusters -r 42 -n
Label prediction with cross-validation settings:
vcp benchmarks run -m SCVI-v1-homo_sapiens -d tsv2_blood -t label_prediction \
--labels cell_type --n-folds 3 --min-class-size 5 -r 42 -n
Batch integration with custom batch column:
vcp benchmarks run -m SCVI-v1-homo_sapiens -d tsv2_blood -t batch_integration \
--batch-column batch_id --labels cell_type -r 42 -n
Cross-species integration:
vcp benchmarks run -t cross-species_integration -m UCE-v1-4l \
-d mouse_spermatogenesis --organisms mus_musculus:ENSMUSG --cross-species-labels "@0:obs:cell_type" \
-d rhesus_macaque_spermatogenesis --organisms macaca_mulatta:ENSMMUG --cross-species-labels "@1:obs:cell_type" \
-r 42 -n
Use precomputed cell representations with reference format:
vcp benchmarks run -c './user_model_output.npy' \
-u '{"dataset_class": "czbenchmarks.datasets.SingleCellLabeledDataset", "organism": "HUMAN", "path": "~/user_dataset.h5ad"}' \
-t label_prediction --labels @obs:cell_type --n-folds 5 --min-class-size 10 -r 100 -n
User Dataset Format
When using --user-dataset
, provide a JSON string with the following keys:
dataset_class
: The dataset class to use (typicallyczbenchmarks.datasets.SingleCellLabeledDataset
)organism
: The organism type (HUMAN
,MOUSE
, etc.)path
: Path to the .h5ad file
Example:
{
"dataset_class": "czbenchmarks.datasets.SingleCellLabeledDataset",
"organism": "HUMAN",
"path": "~/mydata.h5ad"
}
Task Arguments and Reference Format
Task-specific arguments can be provided via command-line options or through the --baseline-args
JSON parameter. The --labels
option supports both direct column names and AnnData reference format:
Direct format: --labels cell_type
Reference format: --labels @obs:cell_type
For embedding tasks:
# Command-line options (recommended)
--labels cell_type
# Via baseline-args
--baseline-args '{"input_labels": "@obs:cell_type"}'
For clustering tasks:
# Command-line options (recommended)
--labels cell_type --use-rep X --n-iterations 2 --flavor igraph --key-added leiden
# Via baseline-args
--baseline-args '{"obs": "@obs", "input_labels": "@obs:cell_type", "use_rep": "X", "n_iterations": 2, "flavor": "igraph", "key_added": "leiden"}'
For label prediction tasks:
# Command-line options (recommended)
--labels cell_type --n-folds 5 --min-class-size 10
# Via baseline-args
--baseline-args '{"labels": "@obs:cell_type", "n_folds": 5, "min_class_size": 10}'
For batch integration tasks:
# Command-line options (recommended)
--batch-column batch --labels cell_type
# Via baseline-args
--baseline-args '{"batch_labels": "@obs:batch", "labels": "@obs:cell_type"}'
For cross-species integration tasks:
# Command-line options (recommended)
--cross-species-organisms homo_sapiens:ENSG --cross-species-organisms mus_musculus:ENSMUSG
--cross-species-labels "@0:obs:cell_type" --cross-species-labels "@1:obs:cell_type"
# Via baseline-args
--baseline-args '{"organism_list": [["homo_sapiens", "ENSG"], ["mus_musculus", "ENSMUSG"]], "labels": ["@0:obs:cell_type", "@1:obs:cell_type"]}'
vcp benchmarks get
Retrieves and displays benchmark results that have been computed by and published by either the the Virtual Cell Platform or computed locally by the user using the vcp benchmarks run
command.
If filters match benchmarks from both the VCP and a user’s locally run benchmarks, all of the matching benchmarks will be output together. This supports comparison of user benchmarks against VCP benchmarks.
Basic Usage
vcp benchmarks get
See Output Fields for a description of the output fields.
Options
Option |
Short |
Description |
Example |
---|---|---|---|
|
|
Filter by benchmark key pattern |
|
|
|
Filter by model key pattern |
|
|
|
Filter by dataset key pattern |
|
|
|
Filter by task key pattern |
|
|
|
Output format (table or json) |
|
The filter options allow use of *
as a wildcard. Filters use substring matching and are case-insensitive. Filter values match across both the name and key of a given entity type (model, dataset, entity).
Examples
Get all available results:
vcp benchmarks get
Filter results by model and dataset:
vcp benchmarks get -m test -d tsv2_blood
Get results for specific benchmark:
vcp benchmarks get -b f47892309c571cdf
Filter by task and model with JSON output:
vcp benchmarks get -m scvi -d tsv2_blood -t clustering -f json
Output Fields
The vcp benchmarks get
and vcp benchmarks list
commands output the following attributes:
Benchmark Key: Unique identifier for the benchmark
Model Key/Name: Model identifier and display name
Dataset Keys/Names: Dataset identifier and display name
Task Key/Name: Task identifier and display name
Metric: Metric name (for
get
results only).Value: Metric value (for
get
results only)
For further details about the supported Tasks and Metrics see the cz-benchmarks Tasks documentation.
Advanced Usage Patterns
Reproducible Experiments
Always use the --random-seed
option for reproducible results:
vcp benchmarks run -m SCVI-v1-homo_sapiens -d tsv2_blood -t clustering -r 42
Bypassing Cache
Use --no-cache
to ensure fresh computation:
vcp benchmarks run -m SCVI-v1-homo_sapiens -d tsv2_blood -t clustering --no-cache
Reproducing VCP Results
Combine list
and run
commands for systematic evaluation:
# First, list available benchmarks
vcp benchmarks list -m "scvi*" -f json > available_benchmarks.json
# Then run specific benchmarks
vcp benchmarks run -b BENCHMARK_KEY_FROM_LIST
User Datasets
Evaluate models on user datasets while comparing to existing benchmarks:
# Specify a user's local dataset file with custom labels
vcp benchmarks run -m SCVI-v1-homo_sapiens \
-u '{"dataset_class": "czbenchmarks.datasets.SingleCellLabeledDataset", "organism": "HUMAN", "path": "~/custom.h5ad"}' \
-t embedding --labels custom_cell_type
# Compare with existing results
vcp benchmarks get -m SCVI-v1-homo_sapiens -t embedding
Task-Specific Workflows
Use specialized options for different benchmark tasks:
# Advanced clustering with custom parameters
vcp benchmarks run -m SCVI-v1-homo_sapiens -d tsv2_blood -t clustering \
--labels cell_type --use-rep X --n-iterations 5 --flavor leidenalg --key-added custom_clusters -r 42
# Cross-validation with custom settings for label prediction
vcp benchmarks run -c embeddings.npy \
-u '{"dataset_class": "czbenchmarks.datasets.SingleCellLabeledDataset", "organism": "HUMAN", "path": "~/data.h5ad"}' \
-t label_prediction --labels @obs:cell_type --n-folds 10 --min-class-size 3 -r 42
# Batch integration with alternative column names
vcp benchmarks run -m SCVI-v1-homo_sapiens -d tsv2_blood -t batch_integration \
--batch-column sample_id --labels cell_type -r 42
Best Practices
Use specific filters: Narrow down results with appropriate filters to find relevant benchmarks quickly
Set random seeds: Ensure reproducibility by always setting random seeds for experiments
Reference format: Use
@obs:column_name
format when your dataset uses non-standard column namesCache management: Use
--no-cache
sparingly, as caching significantly speeds up repeated experimentsOutput format selection: Use JSON format for programmatic processing, table format for human review
Task-specific tuning: Adjust parameters like
--n-folds
,--n-iterations
based on dataset size and requirementsProgressive filtering: Start with broad filters and progressively narrow down to find specific benchmarks