Work in Progress Sequencing Schema

Document Status: Draft

Version: 1.0.0 (follows Semantic Versioning)

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED” “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14, RFC2119, and RFC8174 when, and only when, they appear in all capitals, as shown here.

Background

Sequencing-based profiling of biological conditions plays a crucial role in many flagship projects in the CZ ecosystem, generating rich datasets that support both primary research and secondary applications such as cross-study multi-omic analysis and AI-driven biological model training. To maximize the utility and reusability of sequencing datasets, this document outlines the sequencing data and metadata standards used across the CZ ecosystem to promote consistency, interoperability, and long-term value.

I. Key Terms

Definitions of terms used throughout this schema:

Sample: biological material collected from a donor or source organism, from which assays are performed
Dataset: file(s) generated from a single sequencing run, representing the raw data output from the sequencer, which may include multiple files corresponding to different lanes, read pairs, or libraries multiplexed in the same run
Raw data: unprocessed output from the sequencing platform (typically FASTQ files)
Assay: laboratory method or protocol used to generate data from a sample
Sequencing instrument: hardware device used to perform sequencing
Ontology term: a standardized identifier and label used to define biological or experimental concepts

II. Sample-level Metadata

Sample-level metadata describes the biological source and context of the data, providing the necessary information to find, group, and interpret samples across experiments and platforms. Cross-modality metadata applies across all data types in the CZ ecosystem, including imaging, mass spectrometry, and sequencing. This includes information about the assay used, the developmental stage, organism, tissue, tissue type, and disease context. These fields must be annotated consistently during sample registration and remain stable throughout downstream processing. Sequencing metadata captures general technical parameters required to interpret, reproduce, and process datasets from all sequencing runs, while assay-specific metadata captures details that are unique to certain protocols such as 10x Genomics.

Cross-Modality Metadata Requirements

(See full guidance.)

assay_ontology_term_id

Key	assay_ontology_term_id
Description	Assay that was used to create the dataset
Annotator	Submitter MUST annotate.
Value	List[String]. A List element MUST be an Experimental Factor Ontology (EFO) term such as `"EFO:0022605"`.

development_stage_ontology_term_id

Key development_stage_ontology_term_id

Description Development stage(s) of the patients or organisms from which assayed biosamples were derived

Annotator Submitter MUST annotate.

Value

List[String]. If unavailable, then the List element MUST be "unknown".

For `organism_ontology_term_id`	Value
`"NCBITaxon:6239"` for Caenorhabditis elegans	A List element MUST be `WBls:0000669` for unfertilized egg Ce, the most accurate descendant of `WBls:0000803` for C. elegans life stage occurring during embryogenesis, or the most accurate descendant of `WBls:0000804` for C. elegans life stage occurring post-embryogenesis.
`"NCBITaxon:7955"` for Danio rerio	A List element MUST be the most accurate descendant of `ZFS:0100000` for zebrafish stage excluding `ZFS:0000000` for Unknown.
`"NCBITaxon:7227"` for Drosophila melanogaster	A List element MUST be either the most accurate descendant of `FBdv:00007014` for adult age in days or the most accurate descendant of `FBdv:00005259` for developmental stage excluding `FBdv:00007012` for life stage.
`"NCBITaxon:9606"` for Homo sapiens	A List element MUST be the most accurate descendant of `HsapDv:0000001` for life cycle.
`"NCBITaxon:10090"` for Mus musculus or one of its descendants	A List element MUST be the accurate descendant of `MmusDv:0000001` for life cycle.

For all other organisms, a List element MUST be the most accurate descendant of UBERON:0000105 for life cycle stage, excluding UBERON:0000071 for death stage.

disease_ontology_term_id

Key	disease_ontology_term_id
Description	Disease of the patients or organisms from which assayed biosamples were derived
Annotator	Submitter MUST annotate.
Value	List[String]. A List element MUST be one of: `"PATO:0000461"` for normal or healthy the most accurate descendant of `"MONDO:0000001"` for disease `"MONDO:0021178"` for injury or preferably its most accurate descendant.

organism_ontology_term_id

Key	organism_ontology_term_id
Description	Organism from which assayed biosamples were derived
Annotator	Submitter MUST annotate.
Value	List[String]. A List element MUST be an NCBI organismal classification term such as `"NCBITaxon:9606"`.

For registration, there MUST be a one-to-one mapping between the List of tissue_ontology_term_id(s) and the List of tissue_type(s). For example, tissue_type[0] MUST be the tissue type for tissue_ontology_term_id[0]. In the Discover API, this is modeled as:

 'tissue': [{'label': 'spleen',
    'ontology_term_id': 'UBERON:0002106',
    'tissue_type': 'tissue'}]

tissue_ontology_term_id

Key tissue_ontology_term_id

Description Tissues from which assayed biosamples were derived

Annotator Submitter MUST annotate.

Value

List[String]. If the corresponding tissue_type is "tissue" or "organoid" then:

For `organism_ontology_term_id`	Value
`"NCBITaxon:6239"` for Caenorhabditis elegans	A List element MUST be either an UBERON term or the most accurate descendant of `WBbt:0005766` for anatomy excluding `WBbt:0007849` for hermaphrodite,`WBbt:0007850` for male,`WBbt:0008595` for female,`WBbt:0004017` for cell and its descendants, and `WBbt:00006803` for nucleus and its descendants.
`"NCBITaxon:7955"` for Danio rerio	A List element MUST be either an UBERON term or the most accurate descendant of `ZFA:0100000` for zebrafish anatomical entity excluding `ZFA:0001093` for unspecified and `ZFA:0009000` for cell and its descendants.
`"NCBITaxon:7227"` for Drosophila melanogaster	A List element MUST be either an UBERON term or the most accurate descendant of `FBbt:10000000` for anatomical entity excluding `FBbt:00007002` for cell and its descendants.

For all other organisms, a List element MUST be the most accurate descendant of UBERON:0001062 for anatomical entity.

If the corresponding tissue_type is "cell culture", the following Cell Ontology (CL) terms MUST NOT be used:

”CL:0000255” for eukaryotic cell
”CL:0000257” for Eumycetozoan cell
”CL:0000548” for animal cell

For `organism_ontology_term_id`	Value
`"NCBITaxon:6239"` for Caenorhabditis elegans	A List element MUST be either a CL term or the most accurate descendant of `WBbt:0004017` for cell excluding `WBbt:0006803` for nucleus and its descendants.
`"NCBITaxon:7955"` for Danio rerio	A List element MUST be either a CL term or the most accurate descendant of `ZFA:0009000` for cell.
`"NCBITaxon:7227"` for Drosophila melanogaster	A List element MUST be either a CL term or the most accurate descendant of `FBbt:00007002` for cell.

Otherwise, for all other organisms, a List element MUST be a CL term.

tissue_type

Key	tissue_type
Description	Type of tissue from which assayed biosamples were derived
Annotator	Submitter MUST annotate.
Value	List[String]. A List element MUST be one of `"tissue"`, `"organoid"`, or `"cell culture"`.

In addition, all datasets MUST be programmatically annotated with human-readable metadata fields containing human-readable names assigned to the term identifiers by its ontology. For example, if assay_ontology_term_id[0] is ”EFO:0022605”, then assay[0] MUST be ”10x 5’ v3”.

assay

Key	assay
Value	List[String]. A List element MUST be the human-readable name assigned to the corresponding element in `assay_ontology_term_id`.

development_stage

Key	development_stage
Value	List[String]. A List element MUST be `"unknown"` if the corresponding element in `development_stage_ontology_term_id` is `"unknown"`; otherwise, it MUST be the human-readable name assigned to the corresponding element in `development_stage_ontology_term_id`.

disease

Key	disease
Value	List[String]. A List element MUST be the human-readable name assigned to the corresponding element in `disease_ontology_term_id`.

organism

Key	organism
Value	List[String]. A List element MUST be the human-readable name assigned to the corresponding element in `organism_ontology_term_id`.

tissue

Key	tissue
Value	List[String]. A List element MUST be the human-readable name assigned to the corresponding element in `tissue_ontology_term_id`.

Sequencing Metadata Requirements

The following requirements apply to all sequencing assays.

sequencing_instrument

Key	sequencing_instrument
Description	Name of the technology platform and instrument used for sequencing
Annotator	Submitter MUST annotate.
Value	String. MUST be an EFO term classified under `EFO:0002699` (e.g., `"Illumina NovaSeq X"`).

sequencing_run_id

Key	sequencing_run_id
Description	Unique identifier for the sequencing run from which the dataset was generated, typically automatically assigned from the instrument
Annotator	Submitter MUST annotate. Use "unavailable" if no run-level information exists.
Value	String. MUST be unique within each sequencing facility. Illumina run IDs MUST follow the format: `YYMMDD_InstrumentSerial#Run#_FlowCell#`

file_type

Key	file_type
Description	Type of file submitted
Annotator	Submitter MUST annotate.
Value	String. MUST be one of `"fastq"`, `"bam"`, `"h5"`, or `"h5ad"`.

III. Additional Assay-Specific Metadata Requirements

10x Genomics single-cell RNA sequencing

This section applies to 10x Genomics 3’ and 5’ single-cell RNA sequencing assays (EFO term: 10x transcription profiling). The typical data processing workflow for these assays begins with generating paired-end FASTQ files, followed by using software such as Cell Ranger to align reads to a reference genome, assign reads to transcripts using a gene annotation file, and produce the main output files: aligned reads (.bam) and gene expression count matrices (.h5). If BAM and/or HDF5 files are included, to ensure that those files are interpretable and reproducible, submitters MUST provide the reference genome, gene annotation, and processing software used during analysis. If neither BAM nor HDF5 files are included, these properties MUST NOT be submitted. These requirements are described below.

reference_genome

Key	reference_genome
Description	Genome name and version used for alignment or quantification
Annotator	Submitter MUST annotate.
Value	String. Ensembl genomes MUST be selected from a list of genome builds (see Appendix A). NCBI assembly accessions MUST follow one of two formats: `[GCA][ _ ][nine digits][.][version number]` `[GCF][ _ ][nine digits][.][version number]`

reference_annotation

Key	reference_annotation
Description	Gene annotation and version used for the reference genome build
Annotator	Submitter MUST annotate.
Value	String. For Ensembl, MUST follow the format `"Ensembl vN"`, where N is the Ensembl release number, with supported values ranging from 75 to 117. The value MUST correspond to the source and version for the selected `reference_genome`.

alignment_software

Key	alignment_software
Description	Name and version of the software used to generate aligned reads
Annotator	Submitter MUST annotate.
Value	String. MUST include software name and version. Examples: `"Cell Ranger v7.1.0"`, `"STARsolo v3.1"`, `"Kallisto v0.51.1"`

IV. Required Ontologies

To standardize collected data, we will follow biologically relevant ontology standards.

Ontology	OBO Prefix	Description
C. elegans Development Ontology	WBls	Standardized ontology of Caenorhabditis elegans developmental stages
C. elegans Gross Anatomy Ontology	WBbt	Standardized ontology of C. elegans anatomical structures
Cell Ontology	CL	Standardized ontology of cell types across animal species
Drosophila Anatomy Ontology	FBbt	Standardized ontology of Drosophila melanogaster anatomical structures
Drosophila Development Ontology	FBdv	Standardized ontology of D. melanogaster developmental stages
Experimental Factor Ontology	EFO	Standardized ontology covering experimental variables, assays, sample attributes, and other related terms in biomedical research
Human Developmental Stages	HsapDv	Standardized ontology of human developmental stages
Mondo Disease Ontology	MONDO	Comprehensive ontology that integrates multiple disease resources into a unified, logically defined vocabulary for human and animal diseases.
Mouse Developmental Stages	MmusDv	Standardized ontology of mouse developmental stages
NCBI organismal classification	NCBITaxon	Standardized taxonomy of organisms
Phenotype and Trait Ontology	PATO	Standardized ontology of phenotypic qualities and traits designed to support the logical representation and cross-species integration of phenotype data
Uberon multi-species anatomy ontology	UBERON	Cross-species ontology covering anatomical structures in animals. It provides a standardized framework for describing anatomical entities and their relationships across different species.
Zebrafish Anatomy Ontology	ZFA ZFS	Standardized ontology of Danio rerio anatomical structures

V. Required File Formats

This section describes the required file formats for sequencing outputs, including raw reads and processed results. FASTQ files require sequencing_run_id, sequencing_platform, and file_type. BAM, H5 and H5AD files require reference_genome, reference_annotation, alignment_software, and file_type.

FASTQ

Key	file_type (fastq)
Description	Raw sequencing read data, including base calls and quality scores
Value	String. File must be in standard FASTQ format, typically gzip-compressed (.fastq.gz). Must contain 4-line entries per read. For paired-end data, R1 and R2 files must be clearly labeled. Files should be parsable with standard tools such as FastQC, seqtk, or fastp.

BAM

Key	file_type (bam)
Description	Aligned sequence data mapped to a reference genome
Value	String. File must be in BAM format (binary SAM) with optional .bai index file. File must be sorted and contain a header with reference genome and aligner metadata. For 10x data, cell barcode and UMI tags (e.g., CB, UB) should be included. Must be readable with samtools or equivalent tools.

H5

Key	file_type (h5)
Description	Hierarchical data file in HDF5 format used for structured outputs
Value	String. File must be a valid HDF5 (`.h5`) file that complies with the HDF5 v1.8+ standard. The internal structure of the file MUST be documented and, if it follows a known format (e.g., kallisto output, loom), it MUST conform to the schema expected by that tool. The file MUST be readable using standard HDF5 utilities such as `h5py`, `h5dump`, or HDFView.

H5AD

Key	file_type (h5ad)
Description	Hierarchical data file in AnnData (`.h5ad`) format
Value	String. File must be a valid `.h5ad` file, which uses the HDF5 format to store data according to the AnnData schema. It MUST include the `X` matrix (expression data), along with `obs` (cell-level metadata) and `var` (gene-level metadata). The file MUST be readable using standard Python tools such as `scanpy.read_h5ad()` or `anndata.read_h5ad()`.

Appendix A

Accepted values for reference_genome and reference_annotation are listed in this section and apply only to Ensembl genome builds. For non-Ensembl genomes, values must follow the specified format rather than being selected from a predefined list.

Species	reference_genome	reference_annotation	Source	Genome Release Dates
Homo sapiens	GRCh38.p14	Ensembl v110-v117	Ensembl	Jul 2023
Homo sapiens	GRCh38.p13	Ensembl v98-v109	Ensembl	Sep 2019
Homo sapiens	GRCh38.p12	Ensembl v92-97	Ensembl	Apr 2018
Homo sapiens	GRCh38.p10	Ensembl v88-91	Ensembl	Mar 2017
Homo sapiens	GRCh38.p7	Ensembl v85-87	Ensembl	Jul 2016
Homo sapiens	GRCh38.p5	Ensembl v83-84	Ensembl	Dec 2015
Homo sapiens	GRCh38.p3	Ensembl v81-82	Ensembl	Jul 2015
Homo sapiens	GRCh38.p2	Ensembl v79-80	Ensembl	Mar 2015
Homo sapiens	GRCh38	Ensembl v76-78	Ensembl	Aug 2014
Homo sapiens	GRCh37.p13	Ensembl v75-v117	Ensembl	Dec 2013
Caenorhabditis elegans	WBcel235	Ensembl v85-117	Ensembl	Jul 2016
Callithrix jacchus	mCalJac1.pat.X	Ensembl v105-v117	Ensembl	Dec 2021
Danio rerio	GRCz11	Ensembl v92–v117	Ensembl	Apr 2018
Danio rerio	GRCz10	Ensembl v85-v91	Ensembl	Jul 2016
Drosophila melanogaster	BDGP6.46	Ensembl v110-v113	Ensembl	Jul 2023
Drosophila melanogaster	BDGP6.54	Ensembl v114-v117	Ensembl	May 2025
Gorilla gorilla gorilla	gorGor4	Ensembl v91-117	Ensembl	Dec 2017
Macaca fascicularis	Macaca_fascicularis_6.0	Ensembl v103-117	Ensembl	Feb 2021
Macaca mulatta	Mmul_10	Ensembl v98-117	Ensembl	Sep 2019
Microcebus murinus	Mmur_3.0	Ensembl v91-117	Ensembl	Dec 2017
Mus musculus	GRCm39	Ensembl v103–v117	Ensembl	Feb 2021
Mus musculus	GRCm38.p6	Ensembl v92–v102	Ensembl	Apr 2018
Mus musculus	GRCm38.p5	Ensembl v87–v91	Ensembl	Dec 2016
Oryctolagus cuniculus	OryCun2.0	Ensembl v85-117	Ensembl	Jul 2016
Pan troglodytes	Pan_tro_3.0	Ensembl v91-117	Ensembl	Dec 2017
Rattus norvegicus	GRCr8	Ensembl v114-v117	Ensembl	May 2025
Rattus norvegicus	mRatBN7.2	Ensembl v105-113	Ensembl	Dec 2021
SARS-CoV-2	ASM985889v3	N/A	Ensembl	Apr 2020
Sus scrofa	Sscrofa11.1	Ensembl v90-114	Ensembl	Aug 2017
synthetic construct	ThermoFisher ERCC RNA Spike-In Control Mixes (Cat # 4456740, 4456739)	N/A	ThermoFisher ERCC Spike-Ins

Sourced from https://github.com/chanzuckerberg/data-guidance/blob/main/standards/sequencing/1.0/cz_seq_schema.md