Definitions

Term	Definition
Benchmark Asset	A component of the benchmarking framework that contributes to evaluating AI models on biologically relevant tasks. Benchmark assets include datasets, models, tasks, and metrics, the combination of which can be configured to produce benchmark results.
Dataset	A structured collection of biological data curated for model evaluation. Datasets may include experimental results (e.g., single-cell RNAseq, proteomics), reference annotations, or synthetic data generated for benchmarking purposes. Datasets may be publicly available or privately hosted.
Model	A method trained on biological data to make predictions or generate insights. Models can range from simple baselines and traditional statistical approaches to deep learning architectures, including transformer-based foundational models.
Task	A structured problem that a model can be evaluated on. Tasks are designed to reflect real-world biological questions, such as classifying cell types, segmenting images, predicting disease state, gene function, etc.
Metric	A quantitative measure used to assess model performance on a given task. Metrics should be chosen based on biological relevance and statistical rigor. Examples include accuracy, F1 score, AUROC, and domain-specific measures like the ability to recover known gene regulatory interactions.
Baseline	A simple method used as a reference point for benchmarking. Baselines provide a minimal level of performance, allowing researchers to quantify the improvement achieved by more advanced models on a particular task. Within cz-benchmarks, a Baseline is implemented in a Task-specific manner, and treated like a Model in downstream results interpretation.
Execution System	The infrastructure and computational environment used to run benchmarks. This includes hardware (e.g., cloud compute, GPUs), software dependencies (primarily, the cz-benchmarks repository), and reproducibility standards ensuring that benchmarking results are consistent and comparable.
Platform	The ecosystem that hosts and integrates benchmarking assets, providing tools for dataset integration, model evaluation, and result visualization. The platform facilitates collaboration by enabling the research community to contribute and refine benchmarks.
Domain	The specific biological context or application area in which benchmarking is conducted. Domains may include transcriptomics, imaging, genomics, and multi-modal methods, each with unique datasets, models, and tasks.