Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal and extended schemas proposal #50

Open
nsheff opened this issue Jun 28, 2023 · 1 comment
Open

Minimal and extended schemas proposal #50

nsheff opened this issue Jun 28, 2023 · 1 comment
Labels
schema-term Proposals for terms in the core schema

Comments

@nsheff
Copy link
Member

nsheff commented Jun 28, 2023

We decided to start with two schemas: a minimal schema that we would post now as what we should implement, and then an extended schema, which is in evaluation stage to see if it should end up in the minimal schema. Here are some drafts of these for comment and revision:

Minimal seqcol schema

description: "A collection of biological sequences, defined by the GA4GH Sequence Collections standard."
$id: "/schemas/seqcol_base"
version: 0.1.0
type: object
properties:
  lengths:
    type: array
    collated: true
    description: "Number of elements, such as nucleotides or amino acids, in each sequence."
    items:
      type: integer
  names:
    type: array
    collated: true
    description: "Human-readable identifiers of each sequence (e.g. chromosome names or accessions)."
    items:
      type: string
  sequences:
    type: array
    collated: true
    description: "Digests of sequences computed using the GA4GH digest algorithm (sha512t24u)."
    items:
      type: string
  sorted_name_length_pairs:
    type: array
    description: "Sorted digests of names+lengths pairs, computed following the seqcol specification."
    items:
      type: string
required:
  - lengths
  - names
inherent:
  - lengths
  - names
  - sequences

Extended seqcol schema

$ref: "/schemas/seqcol_base"
$id: "/schemas/seqcol_extended"
properties:
  masks:
    type: array
    collated: true
    description: "Digests of subsequence masks indicating subsequences to be excluded from an analysis, such as repeats"
    items:
      type: string
  priorities:
    type: array
    collated: true
    description: "Annotation of whether each sequence is a primary or secondary component in the collection."
    items:
      type: boolean
  topologies:
    type: array
    collated: true
    description: "Annotation of whether each sequence represents a linear or other topology."
    items:
      type: string
      enum: ["circular", "linear"]
      default: "linear"
  molecule_types:
    type: array
    collated: true
    description: "Designation of the type of molecule for each sequence, such as RNA, DNA, or protein."
    items:
      type: string
  alphabets:
    type: array
    collated: true
    description: "The set of characters actually present in each sequence"
    items:
      type: string
  alphabet_domains:
    type: array
    collated: true
    description: "The set of characters that could be included in each sequence"
    items:
      type: string
@nsheff
Copy link
Member Author

nsheff commented Nov 20, 2024

The latest minimal schema has been updated to this:

description: "A collection of biological sequences."
type: object
properties:
  lengths:
    type: array
    collated: true
    description: "Number of elements, such as nucleotides or amino acids, in each sequence."
    items:
      type: integer
  names:
    type: array
    collated: true
    description: "Human-readable labels of each sequence (chromosome names)."
    items:
      type: string
  sequences:
    type: array
    collated: true
    items:
      type: string
      description: "Refget sequences v2 identifiers for sequences."
  accessions:
    type: array
    collated: true
    items:
      type: string
      description: "Unique external accessions for the sequences"
required:
  - names
  - lengths
  - sequences
ga4gh:
  inherent:
    - names
    - sequences

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema-term Proposals for terms in the core schema
Projects
None yet
Development

No branches or pull requests

1 participant