Assets
Table of Contents
Task Descriptions
Task |
Description |
---|---|
Cell clustering (in embedding space) |
Cluster cells in embedding space and evaluate against known labels (e.g. cell type) |
Use classifiers to predict cell type from embeddings |
|
Evaluate whether embeddings can align multiple species in a shared space |
|
Evaluate sequential consistency in embeddings using time point labels and k-NN based metrics |
|
Evaluates a model’s ability to predict expression for masked genes, given the remaining (unmasked) genes in a cell as context, under CRISPRi perturbation |
Dataset Descriptions
Dataset |
Description |
Link |
---|---|---|
Tabula sapiens V2 |
Part of a reference human cell atlas that includes single-cell transcriptomic data for over 500,000 cells representing 26 tissues sampled from male (n = 2) and female (n = 7) donors. Tissues include: bladder, blood, bone marrow, ear, eye, fat, heart, large intestine, liver, lung, lymph node, mammary, muscle, ovary, prostate, salivary gland, skin, small intestine, spleen, stomach, testis thymus, tongue, trachea uterus, and vasculature. |
s3://cz-benchmarks-data/datasets/v1/cell_atlases/Homo_sapiens/Tabula_Sapiens_v2/ |
Spermatogenesis |
Includes single-nucleus RNA sequencing (snRNA-seq) data for testes from eleven species, including ten representative mammals and a bird. Species include human, mouse, Rhesus macaque, gorilla, chimpanzee, marmoset, chicken, opossum, and platypus. |
s3://cz-benchmarks-data/datasets/v1/evo_distance/testis/ |
Repogle K52 essentials |
Contains 310,385 single-cell RNA-seq profiles from the K562 chronic myeloid leukemia cell line, including 10,691 control cells and 299,694 cells with CRISPRi perturbations. It spans 2,057 distinct gene knockdown conditions with matched controls, capturing transcriptomic responses to CRISPRi-mediated perturbations of essential genes. |
|
Sound of Life |
Contains longitudinal single-cell RNA-seq profiles from over 13 million peripheral blood mononuclear cells (PBMCs), collected from more than 300 healthy adults. It includes young (25–35 years) and older (55–65 years) adults, with 96 participants followed over two years with yearly vaccination. The dataset captures 71 immune cell subsets, including B cells, CD4⁺ and CD8⁺ T cells, NK cells, monocytes, and dendritic cells. |
|
Human Kidney Disease |
Contains single-cell (sc) and single-nucleus (sn) RNA sequencing data generated from 304,652 cells that were collected from healthy reference kidneys (45 donors) and kidneys from 48 patients with acute kidney failure or chronic kidney disease. Data were generated across distinct kidney tissue sources including cortex, renal medulla, and renal papilla. The dataset captures a wide spectrum of kidney cell types and states, including rare and novel populations, as well as cellular programs altered in injury such as cycling, repair, transitioning, and degenerative states. |
|
Mouse Kidney |
Contains single-nucleus (sn) RNA sequencing data generated from 309,666 cells from 24 mouse kidneys across two fibrosis models. It captures 50 cell types and states spanning epithelial, endothelial, immune, and stromal populations, and reveals shared and unique epithelial injury responses, including early proximal tubule states with dysregulated lipid and amino acid metabolism, as well as heterogeneous stromal populations contributing to fibrogenesis through epithelial–stromal crosstalk. Two dataset versions are provided including the full mouse kidney dataset and a version mapped to human orthologs. |
Task Details
Cell Clustering (in embedding space)
This task evaluates how well the model’s embedding space separates different cell types. There is a forward pass of the data to produce embeddings. The embeddings are then clustered and compared to known cell type labels.
Task: Cell Clustering (in embedding space)
Mode |
Metrics |
Metric description |
---|---|---|
Clustering Task |
ARI |
Adjusted Rand Index of biological labels and leiden clusters. Described in Luecken et al. and implemented in scib-metrics. |
NMI |
Normalized Mutual Information of biological labels and leiden clusters. Described in Luecken et al. and implemented in scib-metrics. |
|
Embedding Task |
Silhouette score |
Measures cluster separation based on within-cluster and between-cluster distances to evaluate the quality of clusters with respect to biological labels. Described in Luecken et al. and implemented in scib-metrics. |
Metadata label prediction - Cell type classification
This task evaluates how well model embeddings capture information relevant to cell identity. This is achieved by a forward pass of the data through each model to retrieve embeddings, and then using the embeddings to train different classifiers, in this case we are using Logistic Regression, KNN, and RandomForest,to predict the cell type. To ensure a reliable evaluation, a 5-fold cross-validation strategy is employed. For each split, the classifier’s predictions on the held-out data, along with the true cell type labels, are used to compute a range of classification metrics. The final benchmark output for each metric is the average across the 5 cross-validation folds.
Task: Metadata label prediction - Cell type classification
Metrics |
Description |
---|---|
Macro F1 |
Measures the harmonic mean of precision and recall; ( 2*tp ) / ( 2 * tp + fp + fn ) where tp = true positives, fn = false negatives, fp = false positives. Implemented here. |
Accuracy |
Proportion of correct predictions over total predictions. Implemented here. |
Precision |
Measures the proportion of true positive predictions among all positive predictions; tp / (tp + fp) where tp = true positives, fp = false positives. Implemented here. |
Recall |
Measures the proportion of actual positive instances that were correctly identified; |
AUROC |
Measures the probability that the model will rank a randomly chosen data point belonging to that category higher than a randomly chosen data point not belonging to that category. Implemented here. |
Cross-Species Batch Integration
This task evaluates the model’s ability to learn representations that are consistent across different species. There is a forward pass of the data (each species is treated as an individual dataset) through the model. Once embeddings are generated for each species, they are concatenated into a single embedding matrix to enable cross-species comparison. Finally, the concatenated embeddings, along with the corresponding species labels, are used to compute evaluation metrics.
Task: Cross-Species Batch Integration
Metrics |
Description |
---|---|
Entropy per cell |
Measures the average entropy of the batch labels within the local neighborhood of each cell. Implemented here. |
Batch silhouette |
A modified silhouette score to measure the extent of batch mixing within biological labels. Described by Luecken et al. |
Cross-Species Label Prediction
This task evaluates how well model embeddings can be used to determine cell properties across species. It task evaluates cross-species transfer by training classifiers on one species and testing on another species. It computes accuracy, F1, precision, recall, and AUROC for multiple classifiers (Logistic Regression, KNN, Random Forest). The task can optionally aggregate cell-level embeddings to sample/donor level before running classification.
Task: Cross-Species Label Prediction
Metrics |
Description |
---|---|
Macro F1 |
Measures the harmonic mean of precision and recall; ( 2*tp ) / ( 2 * tp + fp + fn ) where tp = true positives, fn = false negatives, fp = false positives. Implemented here. |
Accuracy |
Proportion of correct predictions over total predictions. Implemented here. |
Precision |
Measures the proportion of true positive predictions among all positive predictions; tp / (tp + fp) where tp = true positives, fp = false positives. Implemented here. |
Recall |
Measures the proportion of actual positive instances that were correctly identified; |
AUROC |
Measures the probability that the model will rank a randomly chosen data point belonging to that category higher than a randomly chosen data point not belonging to that category. Implemented here. |
Scorer |
A flexible metric wrapper that allows custom scoring functions to be used for model evaluation. It enables integration of user-defined or standard metrics by wrapping them with |
Sequential Organization
This task evaluates sequential consistency in embeddings using time point labels and k-NN based metrics. It assesses how well embeddings preserve the sequential organization between cells, which is important for time-series or developmental trajectory data.
Task: Sequential Organization
Metrics |
Description |
---|---|
Silhouette score |
Measures cluster separation based on within-cluster and between-cluster distances using sequential labels to evaluate embedding quality with respect to sequential organization. |
Sequential alignment |
A k-NN based metric that evaluates how well the embedding preserves sequential relationships by measuring the consistency of sequential neighbors in the embedding space compared to the original sequential ordering. |
Genetic Perturbation Prediction
Warning: This task is still in progress. Results are subject to further validation.
This task evaluates the performance of a model in predicting cellular responses to genetic perturbations. The process involves using the model to predict expression values from datasets with a subset of their differential expressed genes randomly masked. These predictions are then correlated with their respective ground-truth values for each condition.
Task: Genetic Perturbation Prediction
Metrics |
Description |
---|---|
Spearman correlation |
Spearman correlation between ground truth and model-predicted gene expression values. |
Guidelines for Included Assets
As cz-benchmarks develops, robust governance policies will be developed to support direct community contribution.
At this stage, the cz-benchmarks project represents an initial prototype and policy and project governance are intended to provide transparency and support the project in its current phase. Initial guidelines are as follows:
All content (datasets, tasks, metrics) included in cz-benchmarks currently represents a subset of recommendations from CZI staff.
Future versions will incorporate an expanded and refined set of assets. However, not all assets are appropriate for inclusion in a benchmarking platform. Benchmark assets are chosen based on overall quality in relation to comparable reference points, current standards in the research community, and relationship to supported priority benchmark domains as outlined in the roadmap. Formal asset contribution and asset governance policies are in development.
Note: TranscriptFormer was developed by the CZI AI team using separate task implementations. The cz-benchmarks task definitions, developed by the CZI SciTech team, were not included as a part of TranscriptFormer training and evaluation.
At this phase, the CZI SciTech team will guide initial decisions, coordinate updates, and ensure that all assets conform to policy requirements (licensing, versioning, etc.) through direct collaboration with working groups, composed of domain-specific experts from the broader scientific community and partners.
We value your feedback – feel free to open a GitHub issue or reach out to us at virtualcellmodels@chanzuckerberg.com.