Assets

Table of Contents

Task Descriptions
Model Descriptions
Data Descriptions
Task Details
Guidelines for Included Assets

Task Descriptions

Task	Description
Cell clustering (in embedding space)	Cluster cells in embedding space and evaluate against known labels (e.g. cell type)
Cell type classification	Use classifiers to predict cell type from embeddings
Cross-Species Batch Integration	Evaluate whether embeddings can align multiple species in a shared space
Genetic perturbation prediction	[In progress, subject to further validation] Compare predicted vs ground-truth expression shifts under genetic perturbation

Models Descriptions

Model	Description	Link
AIDO.Cell 3M	Transformer-based foundation model capable of handling the entire human transcriptome as input and demonstrating performance on tasks such as zero-shot clustering, cell type classification, and perturbation modeling.	Model card
Geneformer gf-12L-95M-i4096	A foundation model for single-cell data that generates meaningful embeddings of cells that can then be used for a wide variety of downstream tasks in a zero-shot manner.	Hugging face
scGenePTGO-all, fine tuned, Adamson	A single-cell model for perturbation prediction. This is a model variation fine-tuned on the gene ontology annotations, molecular function annotations, cellular component annotations, biological processes annotations, and Adamson et al. datasets.	Model card
scGenePTGO−all, fine-tuned, Norman	A single-cell model for perturbation prediction. This is a model variation fine-tuned on the gene ontology annotations, molecular function annotations, cellular component annotations,biological processes annotations, and Norman et al. datasets.	Model card
scGenePTGO-C, fine-tuned, Adamson	A single-cell model for perturbation prediction. This is a model variation fine-tuned on gene ontology annotation, gene cellular component annotations, and Adamson et al. datasets.	Model card
scGenePTGO-C, fine-tuned, Norman	A single-cell model for perturbation prediction. This is a model variation fine-tuned on gene ontology annotation, gene cellular component annotations, and Norman et al. datasets.	Model card
scGenePTNCBI+UniProt, fine-tuned, Adamson	A single-cell model for perturbation prediction. This is a model variation fine-tuned on NCBI Gene Card Summaries, UniProt protein summaries, and Adamson et al. datasets.	Model card
scGenePTNCBI+UniProt, fine-tuned, Norman	A single-cell model for perturbation prediction. This is a model variation fine-tuned on NCBI Gene Card Summaries, UniProt protein summaries, and Norman et al. datasets.	Model card
scGPT - whole human	A foundation model designed to integrate and analyze large-scale single-cell multi-omics data using a generative pre-trained transformer (GPT) architecture.	Model card
scVI - Version: CxG scVI trained on Census 2023-12-15, homo sapiens, 63M cells	Uses autoencoding-variational Bayesian optimization to learn the underlying latent state of gene expression and to approximate the distributions that underlie observed expression values, while accounting for batch effects and limited sensitivity.	Model card
TF-Exemplar	A generative model trained on 110 million cells from human and four model organisms that demonstrates zero-shot performance for cell type classification across species.	Model card
TF-Metazoa	A generative model trained on 112 million cells spanning all twelve species, demonstrating zero-shot performance for cell type classification across species.	Model card
TF-Sapiens	A generative model trained on 57 million human-only cells trained for tasks such as disease state identification in human cells prediction of cell type specific transcription factors and gene-gene regulatory relationships in humans.	Model card
UCE - 33 layer	A zero-shot foundation model for single-cell biology, representing any cell across species, tissues, and disease states in a fixed embedding space where cell organization emerges without predefined cell types.	Repo
UCE - 4 layer	A zero-shot foundation model for single-cell biology, representing any cell across species, tissues, and disease states in a fixed embedding space where cell organization emerges without predefined cell types.	Repo

Data Descriptions

Dataset	Description	Link
Tabula sapiens V2	Part of a reference human cell atlas that includes single-cell transcriptomic data for over 500,000 cells representing 26 tissues sampled from male (n = 2) and female (n = 7) donors. Tissues include: bladder, blood, bone marrow, ear, eye, fat, heart, large intestine, liver, lung, lymph node, mammary, muscle, ovary, prostate, salivary gland, skin, small intestine, spleen, stomach, testis thymus, tongue, trachea uterus, and vasculature.	s3://cz-benchmarks-data/datasets/v1/cell_atlases/Homo_sapiens/Tabula_Sapiens_v2/
Spermatogenesis	Includes single-nucleus RNA sequencing (snRNA-seq) data for testes from eleven species, including ten representative mammals and a bird. Species include human, mouse, Rhesus macaque, gorilla, chimpanzee, marmoset, chicken, opossum, and platypus.	s3://cz-benchmarks-data/datasets/v1/evo_distance/testis/
Adamson et al.	Comprises single-cell RNA sequencing (scRNA-seq) data generated from a multiplexed CRISPR screening platform. It captures transcriptional profiles resulting from targeted genetic perturbations, facilitating the systematic study of the unfolded protein response (UPR) at a single-cell resolution.	Data card
Norman et al.	Comprises single-cell RNA sequencing (scRNA-seq) data obtained from Perturb-seq experiments. It captures transcriptional profiles resulting from genetic perturbations, facilitating the study of genetic interactions and cellular state landscapes.	Data card

Task Details

Cell Clustering (in embedding space)

This task evaluates how well the model’s embedding space separates different cell types. There is a forward pass of the data to produce embeddings. The embeddings are then clustered and compared to known cell type labels.

Task: Cell Clustering (in embedding space)

Mode	Metrics	Metric description
Clustering Task	ARI	Adjusted Rand Index of biological labels and leiden clusters. Described in Luecken et al. and implemented in scib-metrics.
NMI	Normalized Mutual Information of biological labels and leiden clusters. Described in Luecken et al. and implemented in scib-metrics.
Embedding Task	Silhouette score	Measures cluster separation based on within-cluster and between-cluster distances to evaluate the quality of clusters with respect to biological labels. Described in Luecken et al. and implemented in scib-metrics.

The following models were benchmarked using the Tabula Sapiens v2 dataset, per tissue:

AIDO.Cell 3M
Geneformer gf-12L-95M-i4096
Linear baseline
scGPT
scVI - Census 2023-12-15
Transcriptformer Examplar
Transcriptformer Metazoa
Transcriptformer Sapiens
UCE 33-layer
UCE 4 -layer

Metadata label prediction - Cell type classification

This task evaluates how well model embeddings capture information relevant to cell identity. This is achieved by a forward pass of the data through each model to retrieve embeddings, and then using the embeddings to train different classifiers, in this case we are using Logistic Regression, KNN, and RandomForest,to predict the cell type. To ensure a reliable evaluation, a 5-fold cross-validation strategy is employed. For each split, the classifier’s predictions on the held-out data, along with the true cell type labels, are used to compute a range of classification metrics. The final benchmark output for each metric is the average across the 5 cross-validation folds.

Task: Metadata label prediction - Cell type classification

Metrics	Description
Macro F1	Measures the harmonic mean of precision and recall; ( 2tp ) / ( 2 tp + fp + fn ) where tp = true positives, fn = false negatives, fp = false positives. Implemented here.
Accuracy	Proportion of correct predictions over total predictions. Implemented here.
Precision	Measures the proportion of true positive predictions among all positive predictions; tp / (tp + fp) where tp = true positives, fp = false positives. Implemented here.
Recall	Measures the proportion of actual positive instances that were correctly identified; tp / (tp + fn) where tp = true positives, fn = false negatives. Implemented here.
AUROC	Measures the probability that the model will rank a randomly chosen data point belonging to that category higher than a randomly chosen data point not belonging to that category. Implemented here.

The following models were benchmarked using the Tabula Sapiens v2 dataset, per tissue:

AIDO.Cell 3M
Geneformer gf-12L-95M-i4096
Linear baseline
scGPT
scVI - Census 2023-12-15
Transcriptformer Exemplar
Transcriptformer Metazoa
Transcriptformer Sapiens
UCE 33-layer
UCE 4-layer

Cross-Species Batch Integration

This task evaluates the model’s ability to learn representations that are consistent across different species. There is a forward pass of the data (each species is treated as an individual dataset) through the model. Once embeddings are generated for each species, they are concatenated into a single embedding matrix to enable cross-species comparison. Finally, the concatenated embeddings, along with the corresponding species labels, are used to compute evaluation metrics.

Task: Cross-Species Batch Integration

Metrics	Description
Entropy per cell	Measures the average entropy of the batch labels within the local neighborhood of each cell. Implemented here.
Batch silhouette	A modified silhouette score to measure the extent of batch mixing within biological labels. Described by Luecken et al.

The following models were benchmarked using the Spermatogenesis dataset, per species:

Transcriptformer Exemplar
Transcriptformer Metazoa
UCE 33-layer
UCE 4-Layer

Genetic Perturbation Prediction

Warning: This task is still in progress. Results are subject to further validation.

This task evaluates the performance of models fine-tuned to predict cellular responses to genetic perturbations. The process involves applying the fine-tuned model to a test dataset and comparing its predictions with observed ground-truth perturbation profiles. Predicted gene expression profiles after perturbation are generated by running a held-out dataset through the fine-tuned model. These predicted profiles are then compared to ground-truth gene expression profiles for the applied perturbations.

Task: Genetic Perturbation Prediction

Metrics	Description
MSE - top 20 DE genes
MSE - all genes
Pearson Delta Correlation - top 20 DE genes
Pearson Delta Correlation - all genes
Jaccardian Similarity

The following models were benchmarked using the Adamson et al. dataset:
- scGenePTGO-all, fine tuned, Adamson
- scGenePTGO-C, fine-tuned, Adamson
- scGenePTNCBI+UniProt, fine-tuned, Adamson
The following models were benchmarked using the Norman et al. dataset:
- scGenePTGO−all, fine-tuned, Norman
- scGenePTGO-C, fine-tuned, Norman
- scGenePTNCBI+UniProt, fine-tuned, Norman

Guidelines for Included Assets

As cz-benchmarks develops, robust governance policies will be developed to support direct community contribution.

At this stage, the cz-benchmarks project represents an initial prototype and policy and project governance are intended to provide transparency and support the project in its current phase. Initial guidelines are as follows:

All content (models, tasks, metrics) included in cz-benchmarks currently represents a subset of recommendations from CZI staff.
Models included within the package have been contributed by CZI, on behalf of model developers. Feedback from model developers is being sourced via direct outreach to these individuals.
Future versions will incorporate an expanded and refined set of assets. However, not all assets are appropriate for inclusion in a benchmarking platform. Benchmark assets are chosen based on overall quality in relation to comparable reference points, current standards in the research community, and relationship to supported priority benchmark domains as outlined in the roadmap. Formal asset contribution and asset governance policies are in development.
Note: TranscriptFormer was developed by the CZI AI team using separate task implementations. The cz-benchmarks task definitions, developed by the CZI SciTech team, were not included as a part of TranscriptFormer training and evaluation.
At this phase, the CZI SciTech team will guide initial decisions, coordinate updates, and ensure that all assets conform to policy requirements (licensing, versioning, etc.) through direct collaboration with working groups, composed of domain-specific experts from the broader scientific community and partners.
We value your feedback – feel free to open a GitHub issue or reach out to us at virtualcellmodels@chanzuckerberg.com.