Sequence Based Assay schema and standards

1. Measurement Set Metadata

Overview

A Measurement Set represents a distinct measurement, such as the sequencing of a genomic library generated by an experiment performed on a cell line. It contains the raw data files generated by the measurement.


Required Fields

The following fields are required when creating a Measurement Set:

  • assay_term: The assay used to produce data in this measurement set
  • award: Grant associated with the submission
  • file_set_type: The category that best describes this measurement set (default: "experimental data")
  • lab: Lab associated with the submission
  • samples: The sample(s) associated with this file set (maximum of 1)

Important Rules

  1. Mutually Exclusive Fields: Specification of samples is mutually exclusive with specification of donors.
  2. Status Requirements:
    • Release, archived, and revoked status should have release_timestamp specified
    • Release timestamp is required if status is released, revoked, or archived

Field Descriptions

Basic Information

  • accession: A unique identifier prefixed with PKB (server-assigned)
  • aliases: Lab-specific identifiers to reference an object (Format: lab-name:identifier)
  • description: A plain text description of the object
  • file_set_type: The category of this measurement set (default: "experimental data")
  • status: The status of the metadata object (admin-only, default: "in progress")
  • submitter_comment: Additional information from submitter

Scientific Information

  • assay_term: The assay used to produce data (links to AssayTerm)
  • samples: Sample(s) associated with this file set (maximum of 1)
  • donors: Donors of the samples (if not using samples; not directly submittable)
  • preferred_assay_title: Custom lab preferred label for the experiment (from controlled vocabulary)
  • library_construction_platform: Platform used to construct the library (links to PlatformTerm)
  • sequencing_library_types: Description of libraries sequenced (from controlled list)
  • donor_validation_method: Method for mapping data to donor
  • multiome_size: Number of datasets in a multiome experiment (minimum: 2)
  • targeted_genes: Genes targeted in this assay (maximum: 100)

Related Data

  • control_file_sets: File sets as scientific controls
  • auxiliary_sets: Auxiliary sets of files produced alongside raw data
  • files: Files associated with this file set (not directly submittable)
  • control_for: File sets for which this is a control (not directly submittable)
  • input_file_set_for: File sets that use this as input (not directly submittable)
  • related_multiome_datasets: Related datasets in the multiome experiment (not directly submittable)

References

  • dbxrefs: External resource identifiers (Pattern: ^GEO:GSE\d+$)
  • documents: Documents providing additional information (links to Document)
  • publication_identifiers: Publication identifiers (various formats accepted)
  • protocols: Links to protocols for conducting the assay
  • external_image_url: Links to external image storage (URL format)

Administrative Fields

  • award: Grant associated with the submission (links to Award)
  • lab: Lab associated with the submission (links to Lab)
  • uuid: Unique identifier for the object (server-assigned)
  • collections: Data collections the samples are part of (admin-only)
  • schema_version: JSON schema version (default: "17")
  • alternate_accessions: Previous accessions for merged objects (admin-only)
  • creation_timestamp: Object creation date (server-assigned)
  • release_timestamp: Object release date (admin-only)
  • submitted_by: User who submitted the object (server-assigned)
  • submitted_files_timestamp: Timestamp of first file creation (not submittable)
  • notes: DACC internal notes (admin-only)
  • revoke_detail: Explanation for revoked status (admin-only)

2. Analysis Set Metadata

a) Intermediate Analysis Set

b) Principal Analysis Set

Overview

An Analysis Set represents the results of a computational analysis of raw genomic data or other analyses. Analysis sets can be either intermediate (part of a larger analysis chain) or principal (final interpretable results).


Required Fields

The following fields are required when creating an Analysis Set:

  • award: Grant associated with the submission
  • file_set_type: The level of this analysis set ("intermediate analysis" or "principal analysis")
  • lab: Lab associated with the submission

Additionally, input_file_sets is required if the file_set_type is "principal analysis".


Important Rules

  1. Mutually Exclusive Fields: Specification of samples is mutually exclusive with specification of donors.
  2. Status Requirements:
    • Release, archived, and revoked status should have release_timestamp specified
    • Release timestamp is required if status is released, revoked, or archived
  3. Principal Analysis Requirement:
    • If file_set_type is "principal analysis", then input_file_sets must be specified

Field Descriptions

Basic Information

  • accession: A unique identifier prefixed with PKB (server-assigned)
  • aliases: Lab-specific identifiers to reference an object (Format: lab-name:identifier)
  • description: A plain text description of the object
  • file_set_type: The level of this analysis set ("intermediate analysis" or "principal analysis")
  • status: The status of the metadata object (admin-only, default: "in progress")
  • submitter_comment: Additional information from submitter

Analysis Information

  • input_file_sets: File set(s) required for this analysis (required for principal analysis)
  • samples: Sample(s) associated with this file set (mutually exclusive with donors)
  • donors: Donors of the samples (not directly submittable)
  • assay_titles: Titles of assays that produced data analyzed (not directly submittable)

Related Data

  • files: Files associated with this file set (not directly submittable)
  • control_for: File sets for which this is a control (not directly submittable)
  • input_file_set_for: File sets that use this as input (not directly submittable)

References

  • dbxrefs: External resource identifiers (Pattern: ^GEO:GSE\d+$)
  • documents: Documents providing additional information (links to Document)
  • publication_identifiers: Publication identifiers (various formats accepted)

Administrative Fields

  • award: Grant associated with the submission (links to Award)
  • lab: Lab associated with the submission (links to Lab)
  • uuid: Unique identifier for the object (server-assigned)
  • collections: Data collections the samples are part of (admin-only)
  • schema_version: JSON schema version (default: "7")
  • alternate_accessions: Previous accessions for merged objects (admin-only)
  • creation_timestamp: Object creation date (server-assigned)
  • release_timestamp: Object release date (admin-only)
  • submitted_by: User who submitted the object (server-assigned)
  • submitted_files_timestamp: Timestamp of first file creation (not submittable)
  • notes: DACC internal notes (admin-only)
  • revoke_detail: Explanation for revoked status (admin-only)

Types of Analysis Sets

Intermediate Analysis

  • Processed data which are not the final results of an experiment
  • May not be interpretable on their own
  • Part of a larger analysis pipeline

Principal Analysis

  • Processed data which are the final results of an experiment
  • Results can be interpretable on their own
  • Requires specification of input_file_sets

Relationship with Measurement Sets

Analysis Sets often process data from Measurement Sets. The connection is made through the input_file_sets field, which can reference Measurement Sets or other Analysis Sets that served as input for the current analysis.