Work in Progress Imaging Metadata Schema
Contact: mcaton@chanzuckerberg.com and utz.ermel@czii.org
Document Status: Draft
Version: 1.0.0
The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED” “MAY”, and “OPTIONAL” in this document are to be interpreted as described in BCP 14, RFC2119, and RFC8174 when, and only when, they appear in all capitals, as shown here.
Schema versioning
The cross modality schema version is based on Semantic Versioning.
Major version is incremented when incompatiable schema updates are introduced:
Renaming metadata fields
Deprecating metadata fields
Changing the type or format of a metadata field
Minor version is incremented when additive schema updates are introduced:
Adding metadata fields
Changing the validation requirements for a metadata field
Patch version is incremented for editorial updates.
All changes are documented in the schema Changelog.
Background
Across the CZI network, we aim to standardize imaging data and metadata for ease of sharing, management, and downstream model training. Inline with this goal, we have outlined how the Dynamic Cell Atlas and the CryoET Portal implement the REQUIRED cross-modality schema. Given the variety of data formats and experimental metadata, we will continue to add to this set of requirments in the imaging working group. This document serves as a working draft and set of minimal standards.
Overview
This document is organized into two sections: cross-modality mapping for Dynamic Cell Atlas and cross-modality mapping for the CryoET portal.
Ontologies
These are the ontologies used.
With the exception of Cellosaurus, ontology terms for metadata MUST use OBO-format identifiers, meaning a CURIE (prefixed identifier) of the form Ontology:Identifier. For example, EFO:0000001 is a term in the Experimental Factor Ontology (EFO). Cellosaurus requires a prefixed identifier of the form Ontology_Identifier such as CVCL_1P02.
If ontologies are missing required terms, then ontologists are responsive to New Term Requests [NTR] such as [NTR] Version specific Visium assays which was created for CELLxGENE Discover requirements.
Ontology |
OBO Prefix |
---|---|
WBls |
|
WBbt |
|
CL |
|
CVCL_ |
|
FBbt |
|
FBdv |
|
EFO |
|
GO |
|
HsapDv |
|
MONDO |
|
MmusDv |
|
NCBITaxon |
|
PATO |
|
UBERON |
|
ZFA |
|
CLO |
|
Cross-modality mapping for Dynamic Cell Atlas
This refers specifically to how ontology terms from tables/fields defined in this sample-level metadata table below map to cross-modality ontology schema for the Dynamic Cell Atlas project.
DCA |
CZI Crossmodal |
Matching Ontology? |
---|---|---|
factor value[assay_ontology_term_id] |
assay_ontology_term_id |
Yes (will update microscopy terms) |
factor value[assay] |
assay |
No (FBbi) |
factor value[developmental_stage_ontology_term_id] |
development_stage_ontology_term_id |
Yes (HsapDV, MmusDv, ZFS, WBLS, FBDV) |
factor value[developmental_stage] |
development_stage |
Yes(HsapDV, MmusDv, ZFS, WBLS, FBDV) |
factor value[disease_ontology_term_id] |
disease_ontology_term_id |
Yes (MONDO, PATO) |
factor value[disease] |
disease |
Yes (MONDO, PATO) |
factor value[organism_ontology_term_id] |
organism_ontology_term_id |
Yes (NCBITaxon) |
factor value[organism] |
organism |
Yes (NCBITaxon) |
factor value[tissue_ontology_term_id] |
tissue_ontology_term_id |
Yes (UBERON) |
factor value[tissue] |
tissue |
Yes (UBERON) |
factor value[tissue_type] |
tissue_type |
Yes (NA) |
Additional Dynamic Cell Atlas Schema
The Dynamic Cell Atlas is comprised of multiple fluorescence microscopy datasets transformed into standard zarrv3 format. Therefore, we also include the minimum additional variables for identifying the original images and communicating channel metadata. A shared ontology and schema for recording channel metadata is still under development. In this section, we will describe the current method.
Pathways:
For each converted zarrv3 image, the atlas tracks the pathways to the original, source data for data provenance.
Source_Raw_Path
Key:
Source_Raw_Path
Description: This is the path to the original raw image, which is usually on an external S3 bucket, Google Drive, or website. Most of the original files are .tif or .zarr (version 2), which can be identified from the file path. This information is recorded for data provenance.
Value: List[String]. Each pathway SHOULD end in “.zip”, “.tif”, “.zarr”, etc.
Source_Seg_Path
Key:
Source_Seg_Path
Description: This is the path to the original segmentation image, which is usually on an external S3 bucket, Google Drive, or website. Most of the original files are .tif or .zarr (version 2), which can be identified from the file path. This information is recorded for data provenance. During the zarr conversion these arrays are embedded within the zarrv3 store as labels or segmentations. If the image does not have related segmentations or masks, the column will be left as “Not Applicable”.
Value: List[String]. Each pathway should end in “.zip”, “.tif”, “.zarr”, etc.
Internal_S3_Path
Key:
Internal_S3_Path
Description: This is the path to the zarrv3 converted image in the Dynamic Cell Atlas database. Each of these images lives in an internal S3 bucket that CZI owns for MDR registration. Note that if a file path is provided under Source_Seg_Path, there will also be a “labels” or “segmentations” folder embedded in the zarr store that has the corresponding converted array.
Value: List[String]. Each pathway MUST end in “.zarr” or “ome.zarr”.
Channel Metadata Fields:
For each image, the atlas metadata tracks the illumination type and target for n number of present channels. The channel # corresponds to the order of each in the zarr image (starting with 0).
Channel Illumination Type
Key:
Raw_Image_Channel#_IlluminationType
Description: The illumnation type is the method used to capture the channel.
Value: List[String]. Each element can be one of the following: Transmitted, Fluorescence, Oblique, Nonlinear, and Other.
Channel Targets
Key:
Raw_Image_Channel#_Target
Description: The target field is a descriptive channel parameter rather than an ontology-driven factor. It specifies the molecular or cellular feature imaged in that channel. The most common targets are: DNA, membrane, or a particular gene.
Value: List[String]. Each element SHOULD be one of the following: DNA, Membrane, or the approved gene symbol (HGNC) or UniProt accession for images with a fluorescence illumination type. For images with a brightfield illumination type, these channels will have “Transmitted Light” in this field.
Cell Line Fields:
For each image with tissue type “cell culture”, the atlas metadata tracks the cell line and cell ontology id from the Cell Ontology (http://obofoundry.org/ontology/cl.html).
Key:
cell_ontology_id
Description: this is the CL id from here: http://obofoundry.org/ontology/cl.html
Value: List[String].
Key:
cell_line
Description: this is the Cellosaurus name of the cell line here: https://www.cellosaurus.org/
Value: List[String].
DCA |
CZI Crossmodal |
Matching Ontology? |
---|---|---|
characteristics[Source_Raw_Path] |
Not Applicable |
No |
characteristics[Source_Seg_Path] |
Not Applicable |
No |
characteristics[Internal_S3_Path] |
Not Applicable |
No |
characteristics[Raw_Image_Channel#_IlluminationType] |
Not Applicable |
No |
characteristics[Raw_Image_Channel#_Target] |
Not Applicable |
No |
factor[cell_ontology_id] |
Not Applicable |
Yes (CL) |
factor[cell_line] |
Not Applicable |
No (Cellosaurus) |
Cross-modality mapping for cryoET data portal
On-Disk Dataset Metadata
AssayDetails Metadata
XMS-1.1.0 Field |
cryoET Field |
Requirement |
Description |
Constraints and Comments |
---|---|---|---|---|
|
assay |
MUST |
Defines the human-readable assay name that was used to create the dataset. |
|
|
assay_ontology_term_id |
MUST |
EFO ID corresponding to the assay(s) used. |
|
CellComponent Metadata
cryoET Field |
Requirement |
Description |
Constraints and Comments |
---|---|---|---|
name |
MUST |
Name of the cellular component. |
|
id |
MUST |
The GO identifier for the cellular component or |
|
If the dataset’s cryoET sample_type
is "organelle"
, then the value MUST be a valid descendant of "GO:0005575"
for cellular component
.
If the dataset’s cryoET sample_type
is "virus"
, then the value MUST be "GO:0044423"
for virion component
.
If the dataset’s cryoET sample_type
is any other type, then the value MUST be "not_reported"
.
CellStrain Metadata
cryoET Field |
Requirement |
Description |
Constraints and Comments |
---|---|---|---|
name |
MUST |
Strain information for the sample. |
|
id |
MUST |
The cell line’s cellosaurus term, strain ID, or |
|
If the dataset’s cryoET sample_type
is "cell_line"
, then the value MUST be a valid Cellosaurus term.
If the dataset’s cryoET sample_type
is any other type, then the value may be any other strain ID or "not_reported"
.
CellType Metadata
XMS-1.1.0 Field |
cryoET Field |
Requirement |
Description |
Constraints and Comments |
---|---|---|---|---|
|
name |
MUST |
Name of the cell type from which a biological sample used in a CryoET study is derived from, or the name of the cell line used. |
|
|
id |
MUST |
The UBERON or Cell Ontology identifier for the tissue or |
|
If the dataset’s cryoET sample_type
is "primary_cell_culture”
, the following Cell Ontology (CL) terms MUST NOT be used:
"CL:0000255"
for eukaryotic cell"CL:0000257"
for Eumycetozoan cell"CL:0000548"
for animal cell
For the corresponding |
Value |
---|---|
|
The value MUST be either a CL term or the most accurate descendant of |
|
The value MUST be either a CL term or the most accurate descendant of |
|
The value MUST be either a CL term or the most accurate descendant of |
Otherwise, for all other organisms, the value MUST be a CL or UBERON term.
If the dataset’s cryoET sample_type
is any other type, the value MAY follow the same rules as above, otherwise MUST be "not_reported"
.
CrossReferences Metadata
cryoET Field |
Requirement |
Description |
Constraints and Comments |
---|---|---|---|
publications |
RECOMMENDED |
Comma-separated list of DOIs for publications associated with the dataset. |
string, MUST be DOI format |
related_database_entries |
RECOMMENDED |
Comma-separated list of related database entries for the dataset. |
string, MUST be in appropriate format (EMPIAR-XXXXX, PDB-XXXX, EMDB-XXXXX) |
related_database_links |
OPTIONAL |
Comma-separated list of related database links for the dataset. |
string |
dataset_citations |
OPTIONAL |
Comma-separated list of DOIs for publications citing the dataset. |
string |
DateStamp Metadata
cryoET Field |
Requirement |
Description |
Constraints and Comments |
---|---|---|---|
deposition_date |
MUST |
The date a data item was received by the cryoET data portal. |
date |
release_date |
MUST |
The date a data item was received by the cryoET data portal. |
date |
last_modified_date |
MUST |
The date a piece of data was last modified on the cryoET data portal. |
date |
DevelopmentStageDetails Metadata
XMS-1.1.0 Field |
cryoET Field |
Requirement |
Description |
Constraints and Comments |
---|---|---|---|---|
|
development_stage |
MUST |
Defines the development stage(s) of the patients or organisms from which assayed biosamples were derived. |
|
|
development_stage_ontology_term_id |
MUST |
Organism-specific ontology ID corresponding to the development stage(s). |
|
DevelopmentStageDetails.development_stage_ontology_term_id
Type: string
If the dataset’s cryoET sample_type
is "cell_line"
, the value MUST be "na"
.
If unavailable, the value MUST be "unknown"
.
For the corresponding |
Value |
---|---|
|
The value MUST be |
|
The value MUST be the most accurate descendant of |
|
The value MUST be either the most accurate descendant of |
|
The value MUST be the most accurate descendant of |
|
The value MUST be the accurate descendant of |
Otherwise, for all other organisms, the value MUST be the most accurate descendant of UBERON:0000105
for life cycle stage, excluding UBERON:0000071
for death stage.
DiseaseDetails Metadata
XMS-1.1.0 Field |
cryoET Field |
Requirement |
Description |
Constraints and Comments |
---|---|---|---|---|
|
disease |
MUST |
Defines the disease(s) of the patients or organisms from which assayed biosamples were derived. |
|
|
disease_ontology_term_id |
MUST |
The ontology term ID(s) corresponding to the disease state(s). |
|
FundingDetails Metadata
cryoET Field |
Requirement |
Description |
Constraints and Comments |
---|---|---|---|
funding_agency_name |
RECOMMENDED |
The name of the funding source. |
string |
grant_id |
RECOMMENDED |
Grant identifier provided by the funding agency |
string |
OrganismDetails Metadata
XMS-1.1.0 Field |
cryoET Field |
Requirement |
Description |
Constraints and Comments |
---|---|---|---|---|
|
name |
MUST |
Name of the organism(s) from which a biological sample used in a CryoET study is derived from, e.g. homo sapiens. |
|
|
taxonomy_id |
MUST |
The NCBI taxon ID(s) of the organism(s) |
|
See taxonomy_id below. |
OrganismDetails.taxonomy_id
Type: integer
If the corresponding sample_type
is "organism"
, "tissue"
, "cell"
, "organoid"
, "organelle"
or "virus"
the value MUST be an NCBI organismal classification term such as "9606"
If the corresponding sample_type
is "in_vitro"
, "in_silico"
or "other"
, the value MAY be an NCBI organismal classification term such as "9606"
, otherwise it MUST be None
.
PicturePath
cryoET Field |
Requirement |
Description |
Constraints and Comments |
---|---|---|---|
snapshot |
RECOMMENDED |
Path to the preview image relative to the entity directory root. |
string, - API: MUST be URL format - Metadata: MUST be relative path from dataset root. |
thumbnail |
RECOMMENDED |
Path to the thumbnail of preview image relative to the entity directory root. |
string, - API: MUST be URL format - Metadata: MUST be relative path from dataset root. |
SampleType Enum
XMS-1.1.0 |
cryoET value |
Description |
---|---|---|
|
organism |
Tomographic data of sections through multicellular organisms |
|
tissue |
Tomographic data of tissue sections |
|
cell_line |
Tomographic data of immortalized cells or immortalized cell sections |
|
primary_cell_culture |
Tomographic data of whole primary cells or primary cell sections |
|
organoid |
Tomographic data of organoid-derived samples |
|
organelle |
Tomographic data of purified organelles |
|
virus |
Tomographic data of purified viruses or VLPs |
not registered/mapped in 1.1.0 |
in_vitro |
Tomographic data of in vitro reconstituted systems or mixtures of proteins |
not registered/mapped in 1.1.0 |
in_silico |
Simulated tomographic data |
not registered/mapped in 1.1.0 |
other |
Other type of sample |
TissueDetails Metadata
XMS-1.1.0 Field |
cryoET Field |
Requirement |
Description |
Constraints and Comments |
---|---|---|---|---|
|
name |
MUST |
Name of the tissue from which a biological sample used in a CryoET study is derived from. |
|
|
id |
MUST |
The UBERON identifier for the tissue or |
|
Type: string
If the dataset’s cryoET sample_type
is "organism"
, "tissue"
or "organoid"
then:
For the corresponding |
Value |
---|---|
|
The value MUST be either an UBERON term or the most accurate descendant of |
|
The value MUST be either an UBERON term or the most accurate descendant of |
|
The value MUST be either an UBERON term or the most accurate descendant of |
For all other organisms |
The value MUST be the most accurate descendant of |
If the dataset’s cryoET sample_type
is "primary_cell_culture”
, "cell_line"
or "organelle"
the value MAY follow the definition for "tissue"
, otherwise it MUST be "not_reported"
.
If the dataset’s cryoET sample_type
is "virus"
, "in_vitro"
, "in_silico"
or "other"
then the value MUST be "not_reported"
.
Dataset Metadata
cryoET Field |
Requirement |
Description |
Constraints and Comments |
---|---|---|---|
deposition_id |
MUST |
An identifier for a CryoET deposition, assigned by the Data Portal. Used to identify the deposition the entity is a part of. |
integer |
last_updated_at |
MUST |
POSIX timestamp of the last time this metadata file was updated. |
float |
key_photos |
MUST |
A set of paths to representative images of a piece of data for metadata files. |
|
dataset_identifier |
MUST |
An identifier for a CryoET dataset, assigned by the Data Portal. Used to identify the dataset as the directory name in data tree. |
integer |
dataset_title |
MUST |
Title of a CryoET dataset. |
string |
dataset_description |
MUST |
A short description of a CryoET dataset, similar to an abstract for a journal article or dataset. |
string |
dates |
MUST |
A set of dates at which a data item was deposited, published and last modified. |
|
authors |
MUST |
Author of a scientific data entity. |
list of |
funding |
RECOMMENDED |
A funding source for a scientific data entity (base for JSON and DB representation). |
list of |
cross_references |
OPTIONAL |
A set of cross-references to other databases and publications. |
|
sample_type |
MUST |
Type of sample imaged in a CryoET study. |
|
sample_preparation |
RECOMMENDED |
Describes how the sample was prepared. |
string |
grid_preparation |
RECOMMENDED |
Describes Cryo-ET grid preparation. |
string |
other_setup |
RECOMMENDED |
Describes other setup not covered by sample preparation or grid preparation that may make this dataset unique in the same publication. |
string |
organism |
MUST |
The species from which the sample was derived. |
|
tissue |
MUST |
The type of tissue from which the sample was derived. |
|
cell_type |
MUST |
The cell type from which the sample was derived. |
|
cell_strain |
MUST |
The strain or cell line from which the sample was derived. |
|
cell_component |
MUST |
The cellular component from which the sample was derived. |
|
assay |
MUST |
Defines the assay(s) that was used to create the dataset. |
|
development_stage |
MUST |
Defines the development stage(s) of the patients or organisms from which assayed biosamples were derived. |
|
disease |
MUST |
Defines the disease(s) of the patients or organisms from which assayed biosamples were derived. |
|
Database and API Mapping
Mapping of the Dataset
metadata to the database, GraphQL API and python API client is as shown below.
DB Column |
DB Type |
PK/FK |
Nullable? |
GraphQL API Field |
GraphQL API Type |
Python Client Field |
Python Client Type |
Mapped AWS S3 Metadata Field |
---|---|---|---|---|---|---|---|---|
|
|
PK |
No |
|
|
|
|
|
|
|
FK |
No |
|
|
|
|
|
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
||
|
|
Yes |
|
|
|
|
||
|
|
Yes |
|
|
|
|
||
|
|
Yes |
|
|
|
|
||
|
|
Yes |
|
|
|
|
||
|
|
Yes |
|
|
|
|
||
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
||
|
|
Yes |
|
|
|
|
||
|
|
Yes |
|
|
|
|
||
|
|
No |
|
|
|
|
|
|
|
|
No |
|
|
|
|
|
|
|
|
Yes |
|
|
|
|
computed during DB import |
|
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
|
|
|
|
No |
|
|
|
|
||
|
|
No |
|
|
|
|
Mapping to XMS 1.1.0
XMS-1.1.0 metadata Mapping
Mapping will be specified in terms of Python API client fields (as that is what will be used in automatic MDR registration).
XMS-1.1.0 |
Python Client Field |
Notes |
---|---|---|
|
|
convert to list of string |
|
|
convert to list of string |
|
|
convert to list of string |
|
|
convert to list of string |
|
|
convert to list of string |
|
|
convert to list of string |
|
|
convert to list of string |
|
|
Convert to list of string, prepend “NCBITaxon:”. If |
|
depends on |
See |
|
depends on |
See |
|
depends on |
See |
XMS-1.1.0 tissue_type
mapping
Sample types are mapped as follows:
XMS-1.1.0 |
cryoET value |
Description |
---|---|---|
|
organism |
Tomographic data of sections through multicellular organisms |
|
tissue |
Tomographic data of tissue sections |
|
cell_line |
Tomographic data of immortalized cells or immortalized cell sections |
|
primary_cell_culture |
Tomographic data of whole primary cells or primary cell sections |
|
organoid |
Tomographic data of organoid-derived samples |
|
organelle |
Tomographic data of purified organelles |
|
virus |
Tomographic data of purified viruses or VLPs |
not registered/mapped in 1.1.0 |
in_vitro |
Tomographic data of in vitro reconstituted systems or mixtures of proteins |
not registered/mapped in 1.1.0 |
in_silico |
Simulated tomographic data |
not registered/mapped in 1.1.0 |
other |
Other type of sample |
XMS-1.1.0 tissue_ontology_term_id
mapping
If cryoET sample_type
is ”organism”
or “tissue”
, XMS-1.1.0 tissue_type
is ”tissue”
.
XMS-1.1.0 tissue
and tissue_ontology_term_id
are mapped to the following Python client fields:
XMS-1.1.0 |
Python Client Field |
Notes |
---|---|---|
|
|
convert to list of string |
|
|
convert to list of string |
If cryoET sample_type
is ”cell_line”
, XMS-1.1.0 tissue_type
is ”cell line”
.
XMS-1.1.0 tissue
and tissue_ontology_term_id
are mapped to the following Python client fields:
XMS-1.1.0 |
Python Client Field |
Notes |
---|---|---|
|
|
convert to list of string |
|
|
convert to list of string |
If cryoET sample_type
is ”primary_cell_culture”
, XMS-1.1.0 tissue_type
is ”cell culture”
.
XMS-1.1.0 tissue
and tissue_ontology_term_id
are mapped to the following Python client fields:
XMS-1.1.0 |
Python Client Field |
Notes |
---|---|---|
|
|
convert to list of string |
|
|
convert to list of string |
If cryoET sample_type
is ”organoid”
, XMS-1.1.0 tissue_type
is ”organoid”
.
XMS-1.1.0 tissue
and tissue_ontology_term_id
are mapped to the following Python client fields:
XMS-1.1.0 |
Python Client Field |
Notes |
---|---|---|
|
|
convert to list of string |
|
|
convert to list of string |
If cryoET sample_type
is ”organelle”
or "virus"
, XMS-1.1.0 tissue_type
is ”organelle”
.
XMS-1.1.0 tissue
and tissue_ontology_term_id
are mapped to the following Python client fields:
XMS-1.1.0 |
Python Client Field |
Notes |
---|---|---|
|
|
convert to list of string |
|
|
convert to list of string |
Changelog
v1.0.0
Published minimal set of metadata requirements
Sourced from https://github.com/chanzuckerberg/data-guidance/blob/main/standards/imaging/1.0.0/schema.md