Work in Progress Sequencing Schema
Contact: joan.wong@czbiohub.org
Document Status: Draft
Version: 1.0.0 (follows Semantic Versioning)
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED” “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14, RFC2119, and RFC8174 when, and only when, they appear in all capitals, as shown here.
Background
Sequencing-based profiling of biological conditions plays a crucial role in many flagship projects in the CZ ecosystem, generating rich datasets that support both primary research and secondary applications such as cross-study multi-omic analysis and AI-driven biological model training. To maximize the utility and reusability of sequencing datasets, this document outlines the sequencing data and metadata standards used across the CZ ecosystem to promote consistency, interoperability, and long-term value.
I. Key Terms
Definitions of terms used throughout this schema:
Sample: biological material collected from a donor or source organism, from which assays are performed
Dataset: file(s) generated from a single sequencing run, representing the raw data output from the sequencer, which may include multiple files corresponding to different lanes, read pairs, or libraries multiplexed in the same run
Raw data: unprocessed output from the sequencing platform (typically FASTQ files)
Assay: laboratory method or protocol used to generate data from a sample
Sequencing instrument: hardware device used to perform sequencing
Ontology term: a standardized identifier and label used to define biological or experimental concepts
II. Sample-level Metadata
Sample-level metadata describes the biological source and context of the data, providing the necessary information to find, group, and interpret samples across experiments and platforms. Cross-modality metadata applies across all data types in the CZ ecosystem, including imaging, mass spectrometry, and sequencing. This includes information about the assay used, the developmental stage, organism, tissue, tissue type, and disease context. These fields must be annotated consistently during sample registration and remain stable throughout downstream processing. Sequencing metadata captures general technical parameters required to interpret, reproduce, and process datasets from all sequencing runs, while assay-specific metadata captures details that are unique to certain protocols such as 10x Genomics.
Cross-Modality Metadata Requirements
(See full guidance.)
assay_ontology_term_id
Key | assay_ontology_term_id |
Description | Assay that was used to create the dataset |
Annotator | Submitter MUST annotate. |
Value | List[String]. A List element MUST be an Experimental Factor Ontology (EFO) term such as "EFO:0022605" .
|
development_stage_ontology_term_id
Key | development_stage_ontology_term_id | |||||||||||||
Description | Development stage(s) of the patients or organisms from which assayed biosamples were derived | |||||||||||||
Annotator | Submitter MUST annotate. | |||||||||||||
Value | List[String]. If unavailable, then the List element MUST be "unknown".
For all other organisms, a List element MUST be the most accurate descendant of |
disease_ontology_term_id
Key | disease_ontology_term_id |
Description | Disease of the patients or organisms from which assayed biosamples were derived |
Annotator | Submitter MUST annotate. |
Value | List[String]. A List element MUST be one of:
|
organism_ontology_term_id
Key | organism_ontology_term_id |
Description | Organism from which assayed biosamples were derived |
Annotator | Submitter MUST annotate. |
Value | List[String]. A List element MUST be an NCBI organismal classification term such as "NCBITaxon:9606" .
|
For registration, there MUST be a one-to-one mapping between the List of tissue_ontology_term_id(s) and the List of tissue_type(s). For example, tissue_type[0] MUST be the tissue type for tissue_ontology_term_id[0]. In the Discover API, this is modeled as:
'tissue': [{'label': 'spleen',
'ontology_term_id': 'UBERON:0002106',
'tissue_type': 'tissue'}]
tissue_ontology_term_id
Key | tissue_ontology_term_id | ||||||||||||||||
Description | Tissues from which assayed biosamples were derived | ||||||||||||||||
Annotator | Submitter MUST annotate. | ||||||||||||||||
Value | List[String]. If the corresponding tissue_type is "tissue" or "organoid" then:
For all other organisms, a List element MUST be the most accurate descendant of If the corresponding tissue_type is
Otherwise, for all other organisms, a List element MUST be a CL term. |
tissue_type
Key | tissue_type |
Description | Type of tissue from which assayed biosamples were derived |
Annotator | Submitter MUST annotate. |
Value | List[String]. A List element MUST be one of "tissue" , "organoid" , or "cell culture" .
|
In addition, all datasets MUST be programmatically annotated with human-readable metadata fields containing human-readable names assigned to the term identifiers by its ontology. For example, if assay_ontology_term_id[0] is ”EFO:0022605”
, then assay[0] MUST be ”10x 5’ v3”
.
assay
Key | assay |
Value | List[String]. A List element MUST be the human-readable name assigned to the corresponding element in assay_ontology_term_id .
|
development_stage
Key | development_stage |
Value | List[String]. A List element MUST be "unknown" if the corresponding element in development_stage_ontology_term_id is "unknown" ; otherwise, it MUST be the human-readable name assigned to the corresponding element in development_stage_ontology_term_id .
|
disease
Key | disease |
Value | List[String]. A List element MUST be the human-readable name assigned to the corresponding element in disease_ontology_term_id .
|
organism
Key | organism |
Value | List[String]. A List element MUST be the human-readable name assigned to the corresponding element in organism_ontology_term_id .
|
tissue
Key | tissue |
Value | List[String]. A List element MUST be the human-readable name assigned to the corresponding element in tissue_ontology_term_id .
|
Sequencing Metadata Requirements
The following requirements apply to all sequencing assays.
sequencing_instrument
Key | sequencing_instrument |
Description | Name of the technology platform and instrument used for sequencing |
Annotator | Submitter MUST annotate. |
Value | String. MUST be an EFO term classified under EFO:0002699 (e.g., "Illumina NovaSeq X" ).
|
sequencing_run_id
Key | sequencing_run_id |
Description | Unique identifier for the sequencing run from which the dataset was generated, typically automatically assigned from the instrument |
Annotator | Submitter MUST annotate. Use "unavailable" if no run-level information exists. |
Value | String. MUST be unique within each sequencing facility. Illumina run IDs MUST follow the format: YYMMDD_InstrumentSerial#Run#_FlowCell#
|
file_type
Key | file_type |
Description | Type of file submitted |
Annotator | Submitter MUST annotate. |
Value | String. MUST be one of "fastq" , "bam" , "h5" , or "h5ad" .
|
III. Additional Assay-Specific Metadata Requirements
10x Genomics single-cell RNA sequencing
This section applies to 10x Genomics 3’ and 5’ single-cell RNA sequencing assays (EFO term: 10x transcription profiling
). The typical data processing workflow for these assays begins with generating paired-end FASTQ files, followed by using software such as Cell Ranger to align reads to a reference genome, assign reads to transcripts using a gene annotation file, and produce the main output files: aligned reads (.bam) and gene expression count matrices (.h5). If BAM and/or HDF5 files are included, to ensure that those files are interpretable and reproducible, submitters MUST provide the reference genome, gene annotation, and processing software used during analysis. If neither BAM nor HDF5 files are included, these properties MUST NOT be submitted. These requirements are described below.
reference_genome
Key | reference_genome |
Description | Genome name and version used for alignment or quantification |
Annotator | Submitter MUST annotate. |
Value | String. Ensembl genomes MUST be selected from a list of genome builds (see Appendix A). NCBI assembly accessions MUST follow one of two formats: [GCA][ _ ][nine digits][.][version number] [GCF][ _ ][nine digits][.][version number]
|
reference_annotation
Key | reference_annotation |
Description | Gene annotation and version used for the reference genome build |
Annotator | Submitter MUST annotate. |
Value | String. For Ensembl, MUST follow the format "Ensembl vN" , where N is the Ensembl release number, with supported values ranging from 75 to 117. The value MUST correspond to the source and version for the selected reference_genome .
|
alignment_software
Key | alignment_software |
Description | Name and version of the software used to generate aligned reads |
Annotator | Submitter MUST annotate. |
Value | String. MUST include software name and version.
Examples: |
IV. Required Ontologies
To standardize collected data, we will follow biologically relevant ontology standards.
Ontology | OBO Prefix | Description |
C. elegans Development Ontology | WBls | Standardized ontology of Caenorhabditis elegans developmental stages |
C. elegans Gross Anatomy Ontology | WBbt | Standardized ontology of C. elegans anatomical structures |
Cell Ontology | CL | Standardized ontology of cell types across animal species |
Drosophila Anatomy Ontology | FBbt | Standardized ontology of Drosophila melanogaster anatomical structures |
Drosophila Development Ontology | FBdv | Standardized ontology of D. melanogaster developmental stages |
Experimental Factor Ontology | EFO | Standardized ontology covering experimental variables, assays, sample attributes, and other related terms in biomedical research |
Human Developmental Stages | HsapDv | Standardized ontology of human developmental stages |
Mondo Disease Ontology | MONDO | Comprehensive ontology that integrates multiple disease resources into a unified, logically defined vocabulary for human and animal diseases. |
Mouse Developmental Stages | MmusDv | Standardized ontology of mouse developmental stages |
NCBI organismal classification | NCBITaxon | Standardized taxonomy of organisms |
Phenotype and Trait Ontology | PATO | Standardized ontology of phenotypic qualities and traits designed to support the logical representation and cross-species integration of phenotype data |
Uberon multi-species anatomy ontology | UBERON | Cross-species ontology covering anatomical structures in animals. It provides a standardized framework for describing anatomical entities and their relationships across different species. |
Zebrafish Anatomy Ontology | ZFA
ZFS |
Standardized ontology of Danio rerio anatomical structures |
V. Required File Formats
This section describes the required file formats for sequencing outputs, including raw reads and processed results. FASTQ files require sequencing_run_id
, sequencing_platform
, and file_type
. BAM, H5 and H5AD files require reference_genome
, reference_annotation
, alignment_software
, and file_type
.
FASTQ
Key | file_type (fastq) |
Description | Raw sequencing read data, including base calls and quality scores |
Value | String. File must be in standard FASTQ format, typically gzip-compressed (.fastq.gz). Must contain 4-line entries per read. For paired-end data, R1 and R2 files must be clearly labeled. Files should be parsable with standard tools such as FastQC, seqtk, or fastp. |
BAM
Key | file_type (bam) |
Description | Aligned sequence data mapped to a reference genome |
Value | String. File must be in BAM format (binary SAM) with optional .bai index file. File must be sorted and contain a header with reference genome and aligner metadata. For 10x data, cell barcode and UMI tags (e.g., CB, UB) should be included. Must be readable with samtools or equivalent tools. |
H5
Key | file_type (h5) |
Description | Hierarchical data file in HDF5 format used for structured outputs |
Value | String. File must be a valid HDF5 (.h5 ) file that complies with the HDF5 v1.8+ standard. The internal structure of the file MUST be documented and, if it follows a known format (e.g., kallisto output, loom), it MUST conform to the schema expected by that tool. The file MUST be readable using standard HDF5 utilities such as h5py , h5dump , or HDFView.
|
H5AD
Key | file_type (h5ad) |
Description | Hierarchical data file in AnnData (.h5ad ) format
|
Value | String. File must be a valid .h5ad file, which uses the HDF5 format to store data according to the AnnData schema. It MUST include the X matrix (expression data), along with obs (cell-level metadata) and var (gene-level metadata). The file MUST be readable using standard Python tools such as scanpy.read_h5ad() or anndata.read_h5ad() .
|
Appendix A
Accepted values for reference_genome
and reference_annotation
are listed in this section and apply only to Ensembl genome builds. For non-Ensembl genomes, values must follow the specified format rather than being selected from a predefined list.
Species | reference_genome | reference_annotation | Source | Genome Release Dates |
---|---|---|---|---|
Homo sapiens | GRCh38.p14 | Ensembl v110-v117 | Ensembl | Jul 2023 |
Homo sapiens | GRCh38.p13 | Ensembl v98-v109 | Ensembl | Sep 2019 |
Homo sapiens | GRCh38.p12 | Ensembl v92-97 | Ensembl | Apr 2018 |
Homo sapiens | GRCh38.p10 | Ensembl v88-91 | Ensembl | Mar 2017 |
Homo sapiens | GRCh38.p7 | Ensembl v85-87 | Ensembl | Jul 2016 |
Homo sapiens | GRCh38.p5 | Ensembl v83-84 | Ensembl | Dec 2015 |
Homo sapiens | GRCh38.p3 | Ensembl v81-82 | Ensembl | Jul 2015 |
Homo sapiens | GRCh38.p2 | Ensembl v79-80 | Ensembl | Mar 2015 |
Homo sapiens | GRCh38 | Ensembl v76-78 | Ensembl | Aug 2014 |
Homo sapiens | GRCh37.p13 | Ensembl v75-v117 | Ensembl | Dec 2013 |
Caenorhabditis elegans | WBcel235 | Ensembl v85-117 | Ensembl | Jul 2016 |
Callithrix jacchus | mCalJac1.pat.X | Ensembl v105-v117 | Ensembl | Dec 2021 |
Danio rerio | GRCz11 | Ensembl v92–v117 | Ensembl | Apr 2018 |
Danio rerio | GRCz10 | Ensembl v85-v91 | Ensembl | Jul 2016 |
Drosophila melanogaster | BDGP6.46 | Ensembl v110-v113 | Ensembl | Jul 2023 |
Drosophila melanogaster | BDGP6.54 | Ensembl v114-v117 | Ensembl | May 2025 |
Gorilla gorilla gorilla | gorGor4 | Ensembl v91-117 | Ensembl | Dec 2017 |
Macaca fascicularis | Macaca_fascicularis_6.0 | Ensembl v103-117 | Ensembl | Feb 2021 |
Macaca mulatta | Mmul_10 | Ensembl v98-117 | Ensembl | Sep 2019 |
Microcebus murinus | Mmur_3.0 | Ensembl v91-117 | Ensembl | Dec 2017 |
Mus musculus | GRCm39 | Ensembl v103–v117 | Ensembl | Feb 2021 |
Mus musculus | GRCm38.p6 | Ensembl v92–v102 | Ensembl | Apr 2018 |
Mus musculus | GRCm38.p5 | Ensembl v87–v91 | Ensembl | Dec 2016 |
Oryctolagus cuniculus | OryCun2.0 | Ensembl v85-117 | Ensembl | Jul 2016 |
Pan troglodytes | Pan_tro_3.0 | Ensembl v91-117 | Ensembl | Dec 2017 |
Rattus norvegicus | GRCr8 | Ensembl v114-v117 | Ensembl | May 2025 |
Rattus norvegicus | mRatBN7.2 | Ensembl v105-113 | Ensembl | Dec 2021 |
SARS-CoV-2 | ASM985889v3 | N/A | Ensembl | Apr 2020 |
Sus scrofa | Sscrofa11.1 | Ensembl v90-114 | Ensembl | Aug 2017 |
synthetic construct | ThermoFisher ERCC RNA Spike-In Control Mixes (Cat # 4456740, 4456739) | N/A | ThermoFisher ERCC |
Sourced from https://github.com/chanzuckerberg/data-guidance/blob/main/standards/sequencing/1.0/cz_seq_schema.md