Work in Progress Sequencing Schema

Contact: joan.wong@czbiohub.org

Document Status: Draft

Version: 1.0.0 (follows Semantic Versioning)

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED” “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14, RFC2119, and RFC8174 when, and only when, they appear in all capitals, as shown here.


Background

Sequencing-based profiling of biological conditions plays a crucial role in many flagship projects in the CZ ecosystem, generating rich datasets that support both primary research and secondary applications such as cross-study multi-omic analysis and AI-driven biological model training. To maximize the utility and reusability of sequencing datasets, this document outlines the sequencing data and metadata standards used across the CZ ecosystem to promote consistency, interoperability, and long-term value.

I. Key Terms

Definitions of terms used throughout this schema:

  1. Sample: biological material collected from a donor or source organism, from which assays are performed

  2. Dataset: file(s) generated from a single sequencing run, representing the raw data output from the sequencer, which may include multiple files corresponding to different lanes, read pairs, or libraries multiplexed in the same run

  3. Raw data: unprocessed output from the sequencing platform (typically FASTQ files)

  4. Assay: laboratory method or protocol used to generate data from a sample

  5. Sequencing instrument: hardware device used to perform sequencing

  6. Ontology term: a standardized identifier and label used to define biological or experimental concepts

II. Sample-level Metadata

Sample-level metadata describes the biological source and context of the data, providing the necessary information to find, group, and interpret samples across experiments and platforms. Cross-modality metadata applies across all data types in the CZ ecosystem, including imaging, mass spectrometry, and sequencing. This includes information about the assay used, the developmental stage, organism, tissue, tissue type, and disease context. These fields must be annotated consistently during sample registration and remain stable throughout downstream processing. Sequencing metadata captures general technical parameters required to interpret, reproduce, and process datasets from all sequencing runs, while assay-specific metadata captures details that are unique to certain protocols such as 10x Genomics.

Cross-Modality Metadata Requirements

(See full guidance.)

assay_ontology_term_id

Key assay_ontology_term_id
Description Assay that was used to create the dataset
Annotator Submitter MUST annotate.
Value List[String]. A List element MUST be an Experimental Factor Ontology (EFO) term such as "EFO:0022605".

development_stage_ontology_term_id

Key development_stage_ontology_term_id
Description Development stage(s) of the patients or organisms from which assayed biosamples were derived
Annotator Submitter MUST annotate.
Value List[String]. If unavailable, then the List element MUST be "unknown".
For organism_ontology_term_id Value
"NCBITaxon:6239" for Caenorhabditis elegans A List element MUST be WBls:0000669 for unfertilized egg Ce, the most accurate descendant of WBls:0000803 for C. elegans life stage occurring during embryogenesis, or the most accurate descendant of WBls:0000804 for C. elegans life stage occurring post-embryogenesis.
"NCBITaxon:7955" for Danio rerio A List element MUST be the most accurate descendant of ZFS:0100000 for zebrafish stage excluding ZFS:0000000 for Unknown.
"NCBITaxon:7227" for Drosophila melanogaster A List element MUST be either the most accurate descendant of FBdv:00007014 for adult age in days or the most accurate descendant of FBdv:00005259 for developmental stage excluding FBdv:00007012 for life stage.
"NCBITaxon:9606" for Homo sapiens A List element MUST be the most accurate descendant of HsapDv:0000001 for life cycle.
"NCBITaxon:10090" for Mus musculus or one of its descendants A List element MUST be the accurate descendant of MmusDv:0000001 for life cycle.

For all other organisms, a List element MUST be the most accurate descendant of UBERON:0000105 for life cycle stage, excluding UBERON:0000071 for death stage.

disease_ontology_term_id

Key disease_ontology_term_id
Description Disease of the patients or organisms from which assayed biosamples were derived
Annotator Submitter MUST annotate.
Value List[String]. A List element MUST be one of:

organism_ontology_term_id

Key organism_ontology_term_id
Description Organism from which assayed biosamples were derived
Annotator Submitter MUST annotate.
Value List[String]. A List element MUST be an NCBI organismal classification term such as "NCBITaxon:9606".

For registration, there MUST be a one-to-one mapping between the List of tissue_ontology_term_id(s) and the List of tissue_type(s). For example, tissue_type[0] MUST be the tissue type for tissue_ontology_term_id[0]. In the Discover API, this is modeled as:

 'tissue': [{'label': 'spleen',
    'ontology_term_id': 'UBERON:0002106',
    'tissue_type': 'tissue'}]

tissue_ontology_term_id

Key tissue_ontology_term_id
Description Tissues from which assayed biosamples were derived
Annotator Submitter MUST annotate.
Value List[String]. If the corresponding tissue_type is "tissue" or "organoid" then:
For organism_ontology_term_id Value
"NCBITaxon:6239" for Caenorhabditis elegans A List element MUST be either an UBERON term or the most accurate descendant of WBbt:0005766 for anatomy excluding WBbt:0007849 for hermaphrodite,WBbt:0007850 for male,WBbt:0008595 for female,WBbt:0004017 for cell and its descendants, and WBbt:00006803 for nucleus and its descendants.
"NCBITaxon:7955" for Danio rerio A List element MUST be either an UBERON term or the most accurate descendant of ZFA:0100000 for zebrafish anatomical entity excluding ZFA:0001093 for unspecified and ZFA:0009000 for cell and its descendants.
"NCBITaxon:7227" for Drosophila melanogaster A List element MUST be either an UBERON term or the most accurate descendant of FBbt:10000000 for anatomical entity excluding FBbt:00007002 for cell and its descendants.

For all other organisms, a List element MUST be the most accurate descendant of UBERON:0001062 for anatomical entity.

If the corresponding tissue_type is "cell culture", the following Cell Ontology (CL) terms MUST NOT be used:

For organism_ontology_term_id Value
"NCBITaxon:6239" for Caenorhabditis elegans A List element MUST be either a CL term or the most accurate descendant of WBbt:0004017 for cell excluding WBbt:0006803 for nucleus and its descendants.
"NCBITaxon:7955" for Danio rerio A List element MUST be either a CL term or the most accurate descendant of ZFA:0009000 for cell.
"NCBITaxon:7227" for Drosophila melanogaster A List element MUST be either a CL term or the most accurate descendant of FBbt:00007002 for cell.

Otherwise, for all other organisms, a List element MUST be a CL term.

tissue_type

Key tissue_type
Description Type of tissue from which assayed biosamples were derived
Annotator Submitter MUST annotate.
Value List[String]. A List element MUST be one of "tissue", "organoid", or "cell culture".

In addition, all datasets MUST be programmatically annotated with human-readable metadata fields containing human-readable names assigned to the term identifiers by its ontology. For example, if assay_ontology_term_id[0] is ”EFO:0022605”, then assay[0] MUST be ”10x 5’ v3”.

assay

Key assay
Value List[String]. A List element MUST be the human-readable name assigned to the corresponding element in assay_ontology_term_id.

development_stage

Key development_stage
Value List[String]. A List element MUST be "unknown" if the corresponding element in development_stage_ontology_term_id is "unknown"; otherwise, it MUST be the human-readable name assigned to the corresponding element in development_stage_ontology_term_id.

disease

Key disease
Value List[String]. A List element MUST be the human-readable name assigned to the corresponding element in disease_ontology_term_id.

organism

Key organism
Value List[String]. A List element MUST be the human-readable name assigned to the corresponding element in organism_ontology_term_id.

tissue

Key tissue
Value List[String]. A List element MUST be the human-readable name assigned to the corresponding element in tissue_ontology_term_id.

Sequencing Metadata Requirements

The following requirements apply to all sequencing assays.

sequencing_instrument

Key sequencing_instrument
Description Name of the technology platform and instrument used for sequencing
Annotator Submitter MUST annotate.
Value String. MUST be an EFO term classified under EFO:0002699 (e.g., "Illumina NovaSeq X").

sequencing_run_id

Key sequencing_run_id
Description Unique identifier for the sequencing run from which the dataset was generated, typically automatically assigned from the instrument
Annotator Submitter MUST annotate. Use "unavailable" if no run-level information exists.
Value String. MUST be unique within each sequencing facility. Illumina run IDs MUST follow the format: YYMMDD_InstrumentSerial#Run#_FlowCell#

file_type

Key file_type
Description Type of file submitted
Annotator Submitter MUST annotate.
Value String. MUST be one of "fastq", "bam", "h5", or "h5ad".

III. Additional Assay-Specific Metadata Requirements

10x Genomics single-cell RNA sequencing

This section applies to 10x Genomics 3’ and 5’ single-cell RNA sequencing assays (EFO term: 10x transcription profiling). The typical data processing workflow for these assays begins with generating paired-end FASTQ files, followed by using software such as Cell Ranger to align reads to a reference genome, assign reads to transcripts using a gene annotation file, and produce the main output files: aligned reads (.bam) and gene expression count matrices (.h5). If BAM and/or HDF5 files are included, to ensure that those files are interpretable and reproducible, submitters MUST provide the reference genome, gene annotation, and processing software used during analysis. If neither BAM nor HDF5 files are included, these properties MUST NOT be submitted. These requirements are described below.

reference_genome

Key reference_genome
Description Genome name and version used for alignment or quantification
Annotator Submitter MUST annotate.
Value String. Ensembl genomes MUST be selected from a list of genome builds (see Appendix A).
NCBI assembly accessions MUST follow one of two formats:
[GCA][ _ ][nine digits][.][version number]
[GCF][ _ ][nine digits][.][version number]

reference_annotation

Key reference_annotation
Description Gene annotation and version used for the reference genome build
Annotator Submitter MUST annotate.
Value String. For Ensembl, MUST follow the format "Ensembl vN", where N is the Ensembl release number, with supported values ranging from 75 to 117. The value MUST correspond to the source and version for the selected reference_genome.

alignment_software

Key alignment_software
Description Name and version of the software used to generate aligned reads
Annotator Submitter MUST annotate.
Value String. MUST include software name and version.

Examples: "Cell Ranger v7.1.0", "STARsolo v3.1", "Kallisto v0.51.1"

IV. Required Ontologies

To standardize collected data, we will follow biologically relevant ontology standards.

Ontology OBO Prefix Description
C. elegans Development Ontology WBls Standardized ontology of Caenorhabditis elegans developmental stages
C. elegans Gross Anatomy Ontology WBbt Standardized ontology of C. elegans anatomical structures
Cell Ontology CL Standardized ontology of cell types across animal species
Drosophila Anatomy Ontology FBbt Standardized ontology of Drosophila melanogaster anatomical structures
Drosophila Development Ontology FBdv Standardized ontology of D. melanogaster developmental stages
Experimental Factor Ontology EFO Standardized ontology covering experimental variables, assays, sample attributes, and other related terms in biomedical research
Human Developmental Stages HsapDv Standardized ontology of human developmental stages
Mondo Disease Ontology MONDO Comprehensive ontology that integrates multiple disease resources into a unified, logically defined vocabulary for human and animal diseases.
Mouse Developmental Stages MmusDv Standardized ontology of mouse developmental stages
NCBI organismal classification NCBITaxon Standardized taxonomy of organisms
Phenotype and Trait Ontology PATO Standardized ontology of phenotypic qualities and traits designed to support the logical representation and cross-species integration of phenotype data
Uberon multi-species anatomy ontology UBERON Cross-species ontology covering anatomical structures in animals. It provides a standardized framework for describing anatomical entities and their relationships across different species.
Zebrafish Anatomy Ontology ZFA

ZFS

Standardized ontology of Danio rerio anatomical structures

V. Required File Formats

This section describes the required file formats for sequencing outputs, including raw reads and processed results. FASTQ files require sequencing_run_id, sequencing_platform, and file_type. BAM, H5 and H5AD files require reference_genome, reference_annotation, alignment_software, and file_type.

FASTQ

Key file_type (fastq)
Description Raw sequencing read data, including base calls and quality scores
Value String. File must be in standard FASTQ format, typically gzip-compressed (.fastq.gz). Must contain 4-line entries per read. For paired-end data, R1 and R2 files must be clearly labeled. Files should be parsable with standard tools such as FastQC, seqtk, or fastp.

BAM

Key file_type (bam)
Description Aligned sequence data mapped to a reference genome
Value String. File must be in BAM format (binary SAM) with optional .bai index file. File must be sorted and contain a header with reference genome and aligner metadata. For 10x data, cell barcode and UMI tags (e.g., CB, UB) should be included. Must be readable with samtools or equivalent tools.

H5

Key file_type (h5)
Description Hierarchical data file in HDF5 format used for structured outputs
Value String. File must be a valid HDF5 (.h5) file that complies with the HDF5 v1.8+ standard. The internal structure of the file MUST be documented and, if it follows a known format (e.g., kallisto output, loom), it MUST conform to the schema expected by that tool. The file MUST be readable using standard HDF5 utilities such as h5py, h5dump, or HDFView.

H5AD

Key file_type (h5ad)
Description Hierarchical data file in AnnData (.h5ad) format
Value String. File must be a valid .h5ad file, which uses the HDF5 format to store data according to the AnnData schema. It MUST include the X matrix (expression data), along with obs (cell-level metadata) and var (gene-level metadata). The file MUST be readable using standard Python tools such as scanpy.read_h5ad() or anndata.read_h5ad().


Appendix A

Accepted values for reference_genome and reference_annotation are listed in this section and apply only to Ensembl genome builds. For non-Ensembl genomes, values must follow the specified format rather than being selected from a predefined list.

Species reference_genome reference_annotation Source Genome Release Dates
Homo sapiens GRCh38.p14 Ensembl v110-v117 Ensembl Jul 2023
Homo sapiens GRCh38.p13 Ensembl v98-v109 Ensembl Sep 2019
Homo sapiens GRCh38.p12 Ensembl v92-97 Ensembl Apr 2018
Homo sapiens GRCh38.p10 Ensembl v88-91 Ensembl Mar 2017
Homo sapiens GRCh38.p7 Ensembl v85-87 Ensembl Jul 2016
Homo sapiens GRCh38.p5 Ensembl v83-84 Ensembl Dec 2015
Homo sapiens GRCh38.p3 Ensembl v81-82 Ensembl Jul 2015
Homo sapiens GRCh38.p2 Ensembl v79-80 Ensembl Mar 2015
Homo sapiens GRCh38 Ensembl v76-78 Ensembl Aug 2014
Homo sapiens GRCh37.p13 Ensembl v75-v117 Ensembl Dec 2013
Caenorhabditis elegans WBcel235 Ensembl v85-117 Ensembl Jul 2016
Callithrix jacchus mCalJac1.pat.X Ensembl v105-v117 Ensembl Dec 2021
Danio rerio GRCz11 Ensembl v92–v117 Ensembl Apr 2018
Danio rerio GRCz10 Ensembl v85-v91 Ensembl Jul 2016
Drosophila melanogaster BDGP6.46 Ensembl v110-v113 Ensembl Jul 2023
Drosophila melanogaster BDGP6.54 Ensembl v114-v117 Ensembl May 2025
Gorilla gorilla gorilla gorGor4 Ensembl v91-117 Ensembl Dec 2017
Macaca fascicularis Macaca_fascicularis_6.0 Ensembl v103-117 Ensembl Feb 2021
Macaca mulatta Mmul_10 Ensembl v98-117 Ensembl Sep 2019
Microcebus murinus Mmur_3.0 Ensembl v91-117 Ensembl Dec 2017
Mus musculus GRCm39 Ensembl v103–v117 Ensembl Feb 2021
Mus musculus GRCm38.p6 Ensembl v92–v102 Ensembl Apr 2018
Mus musculus GRCm38.p5 Ensembl v87–v91 Ensembl Dec 2016
Oryctolagus cuniculus OryCun2.0 Ensembl v85-117 Ensembl Jul 2016
Pan troglodytes Pan_tro_3.0 Ensembl v91-117 Ensembl Dec 2017
Rattus norvegicus GRCr8 Ensembl v114-v117 Ensembl May 2025
Rattus norvegicus mRatBN7.2 Ensembl v105-113 Ensembl Dec 2021
SARS-CoV-2 ASM985889v3 N/A Ensembl Apr 2020
Sus scrofa Sscrofa11.1 Ensembl v90-114 Ensembl Aug 2017
synthetic construct ThermoFisher ERCC RNA Spike-In Control Mixes (Cat # 4456740, 4456739) N/A ThermoFisher ERCC

Spike-Ins


Sourced from https://github.com/chanzuckerberg/data-guidance/blob/main/standards/sequencing/1.0/cz_seq_schema.md