Work in Progress Mass Spectrometry Schema
Contact: carlos.gonzalez@czbiohub.org
Document Status: Draft
Version: 1.0.0
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED” “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14, RFC2119, and RFC8174 when, and only when, they appear in all capitals, as shown here.
Schema versioning
The Chan Zuckerberg Biohub Mass Spectrometry Platform (CZB-MS) schema version is based on Semantic Versioning.
Major version is incremented when schema updates are incompatible with the MAGE-TAB Proteomics SDRF (SDRF) encodings or incompatible with current pipeline components. Examples include:
Renaming metadata fields
Deprecating metadata fields
Changing the type or format of a metadata field
Loss of metadata fields from mass spectrometer data output
Significant loss in metadata fields previously reported by processing software (e.g., FragPipe, DIA-NN, MaxQuant) due to changes to their schemas
Changes to schemas of input objects such as FASTA file encodings
Minor version is incremented when schema updates may require changes only to the CZB-MS metadata acquisition systems (BS3, BulkLoader). Examples include:
Additions to SDRF metadata fields
Updating pinned ontologies or gene references
Changing the validation requirements for a metadata field
Changes to the source of metadata (e.g., source1 -> source2 or the inverse)
Patch version is incremented for editorial updates and when adding organisms that do not require new metadata fields.
All changes are documented in the schema Changelog.
Background
CZB-MS aims to support the consistent generation, sharing, and exploration of mass spectrometry data across the Biohub Network (BHN) and beyond. In keeping with this goal, we seek to unify all aspects of data collection, storage, feature annotation, and pre-processing efforts across BHN mass spectrometry sites.
In order to accomplish these goals, all mass spectrometer sites across BHN are REQUIRED to record metadata about samples, experiments, and projects acquired on BHN mass spectrometers. This document describes the schema, a type of contract and blueprint, that CZB-MS requires all datasets to adhere to so that it will enable findability and downstream integration of datasets into analyses and models.
Overview
This schema supports multiple levels of metadata including sample, experiment, and project. Each level records various aspects of metadata pertinent to understanding a data set and its larger context within a corpus of data.
This document is organized into two sections: mass spectrometry-specific metadata and sample/experiment/project focused metadata and has the following sections
Definitions: defines terms for greater document clarity
Ontologies: defines the ontologies used to describe data
Mass spectrometer-specific files: description of files used for the purposes of mass spectrometry data acquisition and spectral searching (e.g., search engines)
Sample-level metadata: description of categories collected for each sample, includes cross-modality mapping.
Experiment- and project-level metadata: description of categories collected for a group of experiments and samples
Post-search output tables: description of files generated post spectral alignment
Additional Important Notes
Redundant Metadata. It is RECOMMENDED to avoid multiple metadata fields containing identical or similar information.However, for the purposes of the SDRF this is often unavoidable as sample characteristics are often also experimental factors (for example, characteristic[treatment] is both a sample property and an experimental factor). Additionally, while the original SDRF allowed for columns with the same characteristics term, the CZB-MS subset of SDRF MUST NOT have multiple columns with the same characteristics. Instead, we RECOMMEND the use of distinguishing terms derived from any of the ontology sources listed below.
No Personal Identifiable Information (PII). This is not strictly enforced by validation because it is difficult for software to predict what is and is not PII; however, curators MUST agree to the data acquisition policies of CZB-MS which includes the requirement to remove direct personal identifiers of study subjects in metadata.
A note on types. The types below are python3 types. Note that a python3 str
is a sequence of Unicode code points, which is stored as UTF-8-encoded tab/comma-separated files.
Definitions
Sample-level metadata. This refers to tabular metadata that describes a specific mass spectrometry sample. It is recorded in the SDRF format derived by the Human Proteome Organization’s Proteomic Standards Initiative working group.
Experiment-level metadata. This refers to metadata that records information about a specific experiment, which is defined as a group of samples. This is recorded using a tabular format described below.
Project-level metadata. This refers to metadata that records information about a specific project, which is defined as a group of experiments all aimed at a broader scientific goal. This is recorded using a tabular format described below.
Ontologies
Ontologies used for mass spectrometry data are pinned for this version/release data of the schema:
Ontology |
OBO Prefix |
Version/Release |
Download |
---|---|---|---|
EFO |
|||
CL |
|||
NCBITaxon |
|||
UBERON |
|||
NCIT |
|||
MS |
|||
MONDO |
|||
PRIDE |
|||
PATO |
|||
HANCESTRO |
|||
HsapDV |
|||
MmusDv |
|||
ZFA |
|||
ZFS |
|||
WBLS |
|||
WBbt |
|||
FBBT |
|||
FBdv |
Cross-modality mapping
This refers specifically to how ontology terms from tables/fields defined in this sample-level metadata table below map to cross-modality ontology schema.
CZB-MS |
CZI Crossmodal |
Matching Ontology? |
---|---|---|
commnent[technology type] |
assay |
Yes (EFO) |
factor value[technology_type_id] |
assay_ontology_term_id |
Yes (EFO) |
characteristics[disease] |
disease |
Yes (MONDO, PATO) |
factor value[disease_ontology_term_id] |
disease_ontology_term_id |
Yes (MONDO, PATO) |
characteristics[organism] |
organism |
Yes (NCBITaxon) |
factor value[organism_ontology_term_id] |
organism_ontology_term_id |
Yes (NCBITaxon) |
characteristics[developmental_stage] |
development_stage |
Yes(HsapDV, MmusDv, ZFS, WBLS, FBDV) |
factor value[developmental_stage_ontology_term_id] |
development_stage_ontology_term_id |
Yes (HsapDV, MmusDv, ZFS, WBLS, FBDV) |
characteristics[organism part] |
tissue |
Yes (UBERON, ZFA, FBbt, WBbt) |
factor value[organism_part_ontology_term_id] |
tissue_ontology_term_id |
Yes (UBERON, ZFA, FBbt, WBbt) |
factor value[tissue_class] |
tissue_type |
Yes (NA) |
Mass spectrometer-specific files
Mass spectrometers use several file types as inputs to processing pipelines, which range from protein sequences files (.FASTA files) to workflows and parameter files. Below are the specifications for files that may have an impact on downstream analyses.
FASTA files
FASTA are files plain-text readable files analogous to genome files for sequencing, containing a standardized header and a protein sequence (amino acid sequence) which allow the assignment of collected spectra to peptides and proteins. For CZB-MS, we STRONGLY RECOMMEND that all FASTA files follow the standard Uniprot format. Doing so allows for extensive tool development and failing to adhere to these standards will limit analysis options. To understand this format better, we provide the following example FASTA entry:
>sp|P05067|A4_HUMAN Amyloid-beta precursor protein OS=Homo sapiens OX=9606 GN=APP PE=1 SV=3
MLPGLALLLLAAWTARALEVPTDGNAGLLAEPQIAMFCGRLNMHMNVQNGKWDSDPSGTK
EFVSDALLVPDKCKFLHQERMDVCETHLHWHTVAKETCSEKST…(truncated)
>: Indicates the beginning of a FASTA header line.
sp: Source database.
P05067: This is the UniProt accession number
A4_HUMAN: The UniProt entry name, also known as the mnemonic.
Amyloid-beta precursor protein: The full name of the protein.
OS=Homo sapiens: Organism Species name.
OX=9606: Organism taxonomy identifier.
GN=APP: Gene Name.
PE=1: Protein Existence evidence level.
SV=3: Sequence Version.
This format is present in all FASTA files downloaded from Uniprot.org and is STRONGLY RECOMMENDED. To accommodate custom FASTA we have developed the following system to record FASTA file names:
Standard (“Base”) FASTA filename conventions:
Ensures consistent searches across common organisms across CZB-MS.
Serves as a common base for concatenated FASTA files.
Focuses on single species.
Format: DownloadDate_source_TaxonName_additionalAttributes_standard.fasta
“AdditionalAttributes” can be expanded using dashes to include extra information.
Each section uses camelCase (e.g., HomoSapiens).
The scientific taxon name will be used as the primary source identifier
Date should be formatted using ISO8601 standards (YYYYMMDD).
Examples:
20240727_uniprot_HomoSapiens_Swissprot_standard.fasta 20240624_uniprot_DanioRerio_Swissprot-Trembl_standard.fasta
Custom FASTA filename conventions:
Provides flexibility in searching sequences, with the added benefit of being concatenated from a common sequence source.
Sequestered to a custom folder to be reusable without intermingling with standard sequences.
Date corresponding to file creation date.
Sections use dashes to include additional data sources.
Custom files may have relevant information appended to the “AdditionalAttributes” section, such as custom sequences, concatenated versions of standard FASTA files, or a combination thereof.
Format: CreationDate_sourceTaxa_additionalAttributes_custom.fasta
As before, “AdditionalAttributes” can be extended by using dashes.
Date should be formatted using ISO8601 standards (YYYYMMDD).
Examples:
20240730_uniprot-Cov2Sequencing_HomoSapiens-COV2_CowContam-rev-unrev_custom.fasta 20240520_refseq-wgc_HomoSapiens-Cohort34_TLG34_custom.fasta
Spectral Libraries - Proteomics
Spectra libraries MUST be formatted in a similar manner to those of custom FASTA files. Since proteomic spectral libraries are sourced from standard FASTA files, this file MUST be referenced via a combination of the source acronym and species name. Acceptable source acronyms are:
UP = Uniprot
RS = RefSeq
Alternatively, the FASTA database indexID can be used:
Formats:
Date_FastaID_taxa_AdditionalInfo_custom.tsv or .speclib
Date_sources_taxa_AdditionalInfo_custom.tsv or .speclib
Date should be formatted using ISO8601 standards (YYYYMMDD).
Examples:
20240345_23-43-45_HomoSapiens-EscherichiaColi-DanioRerio_SISO003_custom.tsv 20240523_UP-UP_DanioRerio-KlebsiellaPneumoniae_SIFA002_custom.speclib
Spectral Libraries - Metabolomics
Metabolites have little to no species specificity and thus do not require species-specific spectral libraries. CZB-MS metabolite spectral libraries come from local and global sources.
Local metabolite spectra: Spectra collected from authentic standards using conserved chromatography (e.g., Biohub-HILIC or Biohub-C18), allowing the highest confidence annotations by matching retention time, MS1, and MS2 spectra.
Global metabolite spectra: generated from external sources without known retention times, resulting in lower confidence annotations. See metabolomics standards initiative publications (2007, 2014) for more details.
These files MUST adhere to the following convention: ModeLibrary_Chromatography_VersionDate.msp
Example:
negMSP_HILIC_Oct2023.msp
Spectral Libraries - Lipidomics
Spectral libraries for lipidomics are IBM2 (.ibm2) files embedded in the MS-Dial version used. They are date- and time-stamped at the time of processing. Formatting MUST adhere to the following convention: year_month_day_hour_minute_second_Loaded.msp2
Example:
2024_8_5_14_14_20_Loaded.msp2
Sample-level Metadata
Sample-level metadata is defined as a CSV file containing the sample-specific characteristics for each file in an experiment. The schema used here is the Proteomics-optimized Sample-Data-Relationship Format (SDRF), created by the Human Proteome Organization’s Proteomic Standards Initiative working group to serve as a common platform to store sample-level metadata. It can be explored at the link above, here we will give the pertinent details. It is broken down into three distinct sections:
Sample Characteristics (characteristics[attribute])
Sample characteristics refer to the intrinsic properties of a sample such as its origin, cell type, species origin, etc. Each header column beyond ‘source name’ is formatted in the following manner: characteristics[attribute]. It is REQUIRED to contain the following characteristics at a minimum:
Column |
Description |
Constraints and Comments |
---|---|---|
source name |
Name restricted to a single source sample. Can be entered in multiple rows due to fractions and technical replicates. |
- String |
characteristics[organism] |
NCBI-derived taxonomy term |
- String |
characteristics[disease] |
MONDO derived disease term |
- String |
characteristics[development_stage] |
Refers to discrete organismal developmental stage |
- String |
charactersitics[tissue type] |
Refers to CZI-specific tissue type term |
- String |
characteristics[organism part] |
Refers to source organ or tissue, as noted by UBERON ontology. In the case of cell lines, refer to the original tissue type. |
- String |
characteristics[cell type] |
Refers to the ‘type’ of ontology-driven cell (e.g., columnar, cuboidal, epithelial etc.) |
- String |
characteristics[biological replicate] |
Refers to the biological replicate. |
- String |
Data File Characteristics (comment[attribute])
Data file characteristics refer to technical properties of the sample, often reflecting file names, sample processing agents, and spectral searching parameters. Data file characteristics are formatted in the following manner: comment[attribute]. It is REQUIRED to contain the following characteristics at a minimum:
Column |
Description |
Constraints and Comments |
---|---|---|
assay name |
Unique run identifier applied by CZB-MS pipeline. |
- String |
comment[fraction identifier] |
Fraction number for a corresponding entry in ‘source name’ |
- Numeric |
comment[label] |
Refers to a sample’s labeling strategies such as isobaric labeling, SILAC, ITRAQ, etc. Non-labeled samples are also included in this category |
- String |
comment[data file] |
Name of data file generated by CZB-MS pipeline |
- String |
comment[instrument] |
Name of instrument the sample was captured on |
- String |
technology type |
Refers to ontology terms used that describe the assay technology |
- String |
Experimental factors (factor value[attribute])
Experimental factor values refer to categories of metadata dealing with experimental variables and are the target for downstream analyses. Under normal circumstances, factor values are directly duplicated from characteristics[attribute] columns designated by the experimentalist (for example characteristics[disease] is equvalent to the experimental groups factor value[disease]). Experimental factors are formatted as factor value[attribute]. Experimental factors by design do not follow any ontologies other than what they inherit from characteristics. However, we have co-opted factor values to store additional ontology term ID information in order to adhere to CZ-wide cross-modality requirements and thus the following factors are REQUIRED:
Column |
Description |
Constraints and Comments |
---|---|---|
factor value[assay_ontology_term_id] |
comment[technology type] label’s associated ontology term ID OR duplicate label |
- String |
factor value[development_stage_ontology_term_id] |
characteristics[developmental stage] associated ontology term ID |
- String |
factor value[disease_ontology_term_id] |
characteristics[disease] associated ontology term ID |
- String |
factor value[organism_ontology_term_id] |
characteristics[organism] related term ID |
- String |
factor value[tissue_ontology_term_id] |
characteristics[tissue type] referencing or cellular compartment |
- String |
CZB MS Platform Specific Categories
In addition to the standard metadata categories (characteristics, comment, and factor value) we append additional, non-SDRF-validatable columns that record information critical to running the mass spectrometers, but are not included in SDRFs recorded in the forthcoming Biobhub Metadata Portal (e.g., they are removed from outputs). These are labeled BL[term]
Column |
Constraints and Comments |
---|---|
BL[sample_description] |
- String |
BL[preparation description] |
- String |
BL[fraction description] |
- String |
BL[notes] |
- String |
BL[vessel format] |
- String |
BL[form] |
- String |
BL[quantity submitted] |
- Numeric |
BL[unit] |
- String |
BL[plate id] |
- String |
BL[well position] |
- String |
BL[injection volume] |
- Numeric |
BL[experiment alias] |
- String |
BL[method file] |
- String |
BL[fasta] |
- String |
BL[workflow] |
- String |
Optional categorical variables (all)
While the above list is limited to the REQUIRED categories, the SDRF is by design extensible to the degree necessary within the limits of the SDRF documentation. Below is a list of other commonly seen metadata categories observed by CZB-SF, along with associated ontologies used. “NA” is noted if the category is not from an ontology.
Category |
Commonly seen attribute |
---|---|
characteristics (ontology source, if any) |
developmental stage (TBD), sex (PATO), age, ancestry category (HANCESTRO), cell line (CL), enrichment process (EFO), individual (NA), material type (NA) |
comment (ontology source, if any) |
technical replicate (NA), modification parameters (NA), precursor mass tolerance (NA), fragment mass tolerance (NA), collision energy (NA), file uri (NA), fractionation method (PRIDE), cleavage agents (PRIDE), dissociation method (NA), proteomics data acquisition method (PRIDE) |
Post-translational modifications and cleavage agents
Modifications and cleavage agents are often automatically added by the search engine (e.g., FragPipe) to a technical-focused SDRF, which is appended to the sample-focused SDRF generated by the CZB-MS. They are written as strings in the following convention:
“NT=Glu→pyro-Glu; MT=fixed; PP=Anywhere;AC=Unimod:27; TA=E”.
An extensive explanation of these is beyond the scope of this document but can be read about here.
Augmentations for Metabolomics and Lipidomics
Currently SDRF tables are officially defined for proteomics. However, their structure can un-officially accommodate additional omics types such as metabolomics and lipidomics, with some minor additions and modifications. The benefit is we have a unified format for all mass spectrometry data as opposed to fragmentation (no pun intended). To this end, we will also record metabolomics data using the SDRF with the following additions
Column |
Constraints and Comments |
---|---|
technology type |
- String |
comment[polarity] |
- String |
comment[chromatography type] |
- String |
comment[extraction protocol] |
- String |
Experiment- and Project-level Metadata
Project-level metadata is focused on assigning attributes to a group of related (by the project goal) experiments and their associated assays. This file will also be output as a tab-separated table.
Column |
Description |
Constraints and Comments |
---|---|---|
project[identifier] |
Project Identifier assigned by CZB Metadata Portal |
- String |
project[Title] |
Title of project given by user at Metadata Portal |
- String |
project[description] |
Description of experiment given by user at Metadata Portal |
- String |
experiment[Identifier] |
Experiment Identifier assigned by CZB Metadata Portal. Can be used to link to multimodal projects |
- String |
experiment[title] |
Title of experiment given by user at Metadata Portal |
- String |
experiment[description] |
Description of experiment given by user at Metadata Portal |
- String |
assay[identifier] |
Identifier of data set |
- String |
assay[measurement type] |
Type of technology used to profile samples |
- String |
assay[technology platform] |
Name of mass spectrometer acquiring data for experiment as assigned by MS ontology |
- String |
assay[Experiment Protocol] |
TBD in version 2.0 |
- TBD in version 2.0 |
assay[sdrf table] |
Name of SDRF metadata table |
- String |
assay[raw data] |
Location of raw files in cloud of local cluster (e.g., Bruno hpc) |
- String |
assay[processed data] |
Location of processed tables in cloud of local cluster (e.g., Bruno hpc) |
- String |
Proteomics post-search output tables
Post-search refers to the various tables generated after spectral data has been acquired and subjected to spectral matching via software/algorithms. For the purposes of this document, when referring to CZB-MS proteomics this is specifically identifying the following standardized tables generated from FragPipe v.22 that will be recorded for all proteomic experiments:
Data dependent acquisition (DDA) tables
Peptide-spectral matching (PSM) tables (external reference)
Peptide quantification tables (external reference)
Protein quantification tables (external reference)
MSstats-processed outputs
Data independent acquisition (DIA, using DIANN module included in FragPipe) tables
Main outputs (external reference)
MSstats-processed outputs
DDA and DIA MSstats output tables
Equivalent tables are output for DDA and DIA. While MSstats outputs several tables, we will highlight the tables delivered that are likely to be of most interest for analyses and model ingestion.
Peptide quantification table
Refers to a long-form table that contains peptide quantification data output by MSstats (post-filtering).
File name: MSstats_peptide_feature_data.csv
Column |
Description |
---|---|
PROTEIN |
- String |
PEPTIDE |
- String |
TRANSITION |
- String |
FEATURE |
- String |
LABEL |
- String |
GROUP |
- String |
RUN |
- Integer |
SUBJECT |
- Integer |
FRACTION |
- Integer |
originalRUN |
- String |
censored |
- Boolean |
INTENSITY |
- Numeric or NA |
ABUNDANCE |
- Numeric or NA |
newABUNDANCE |
- Numeric Original or imputed log2 abundance |
predicted |
- Numeric Predicted value for imputation |
feature_quality |
- String |
is_outlier |
- Boolean |
Protein quantification table
Refers to a protein x sample wide-format protein abundance table derived from ‘peptide -> protein roll up’ by MSstats (feature summing). The ‘Protein’ column contains the protein and other columns represent samples with values represented log2(intensity). NA values are present.
File name: Msstats_wide.csv
Sample metadata
Refers to the sample metadata used to process MSstats data and downstream CZB-MS efforts including all CZB-MAP outputs
File name: MSstats_metadata.csv
Column |
Description |
---|---|
File |
- String |
Rep |
- Integer |
Condition |
- String |
short_id |
- String |
wide_id |
- String |
Timepoint |
- Integer |
Appendix A. Changelog
schema v1.0.0
Initial approved schema
“must”, “should”, and select other words have a defined, standard meaning.
Sourced from https://github.com/chanzuckerberg/data-guidance/blob/main/standards/mass-spectrometry/1.0.0/cz_ms_schema.md