Merge pull request #81 from ga4gh/dev

Add spec for list, filtered list, and attribute endpoints.
ga4gh · Dec 11, 2024 · 7738f40 · 7738f40
2 parents 353e232 + b994915
commit 7738f40
Show file tree

Hide file tree

Showing 8 changed files with 592 additions and 222 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
-# Seqcol Docs
+# Refget Docs
 
-This is the repository for the Seqcol specification. These docs are written using `mkdocs` and hosted on `readthedocs`.
+This is the repository for documentation of the GA4GH Refget specifications, which includes both Refget Sequences and Refget Sequence Collections. These docs are written using `mkdocs` using Material for Mkdocs and hosted using GitHub Pages.
 
 ## Building locally
 

diff --git a/_typos.toml b/_typos.toml
@@ -0,0 +1,4 @@
+[default.extend-words]
+# Don't correct the "fiw", which shows up in some of our digest examples
+fiw = "fiw"
+Ot = "Ot"
diff --git a/docs/README.md b/docs/README.md
@@ -1,25 +1,25 @@
-# Refget
-
-Unique identifiers and lookup service for reference sequences and sequence collections.
-
-<img src="img/seqcol_abstract_simple.svg" alt="Refget abstract" class="img-responsive">
-
+# Refget specifications
 
 ## What is refget?
 
+Refget is a protocol for identifying and distributing reference biological sequences.
+It currently consists of 2 standards:
 
-Refget is a protocol for identifying and distributing biological sequence references. It currently consists of 2 standards:
+1. [Refget sequences](sequences.md): a GA4GH-approved standard for individual sequences
+2. [Refget sequence collections](seqcol.md): a standard for collections of sequences, under review 
+
+<img src="img/seqcol_abstract_simple.svg" alt="Refget abstract" class="img-responsive">
 
-1. Refget sequences: a GA4GH-approved standard for individual sequences
-2. Refget sequence collections: a standard for collections of sequences, under review 
 
 ## What is the refget sequences standard?
 
-The original refget handled sequences only. Refget enables access to reference sequences using an identifier derived from the sequence itself.
+The original refget standard, now called *Refget sequences*, handles sequences only.
+Refget sequences enables access to reference sequences using an identifier derived from the sequence itself.
+
 
 ## What is the refget sequence collections standard?
 
-*Sequence Collections*, or `seqcol` for short, standardizes unique identifiers for collections of sequences. Seqcol identifiers can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences. The seqcol protocol provides:
+*Refget sequence collections*, or `seqcol` for short, standardizes unique identifiers for collections of sequences. Seqcol identifiers can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences. The seqcol protocol provides:
 
 - implementations of an algorithm for computing sequence identifiers;
 - a lookup service to retrieve sequences given a seqcol identifier

diff --git a/docs/contributing.md b/docs/contributing.md
@@ -4,7 +4,7 @@ We welcome more participants! If you are interested in contributing, one of the
 
 ## Maintainers
 
-- <a href="http://databio.org">Nathan Sheffield</a>, Center for Public Health Genomics, University of Virginia
+- <a href="http://databio.org">Nathan Sheffield</a>, Department of Genome Sciences, University of Virginia
 - Andy Yates, EMBL-EBI
 - Timothee Cezard, EMBL-EBI
 

diff --git a/docs/decision_record.md b/docs/decision_record.md
diff --git a/docs/seqcol.md b/docs/seqcol.md
diff --git a/docs/seqcol_rationale.md b/docs/seqcol_rationale.md
@@ -82,3 +82,40 @@ One final important point. Sometimes people seeing the standard for the first ti
 For reasons outlined in the specification, for the actual computing of the identifier, it's important to use the array-based structure -- this is what enables us to use the "level 1" digests for certain comparison questions, and also provides critical performance benefits for extremely large sequence collections. But don't let this dissuade you! My critical point is this: *the way to compute the interoperable identifier does not force you to structure your data in a certain way in your service* -- it's simply the way you structure the data when you compute its identifier. You are, of course, free to store a collection however you want, in whatever structure makes sense for you. You'd just need to structure it according to the standard if you wanted to implement the algorithm for computing the digest. In fact, my implementation provides a way to retrieve the collection information in either structure. 
 
 
+
+
+
+
+
+### Sequence collections without sequences
+
+Typically, we think of a sequence collection as consisting of real sequences, but in fact, sequence collections can also be used to specify collections where the actual sequence content is irrelevant.
+Since this concept can be a bit abstract for those not familiar, we'll try here to explain the rationale and benefit of this.
+First, consider that in a sequence comparison, for some use cases, we may be primarily concerned only with the *length* of the sequence, and not the actual sequence of characters.
+For example, BED files provide start and end coordinates of genomic regions of interest, which are defined on a particular sequence.
+On the surface, it seems that two genomic regions are only comparable if they are defined on the same sequence.
+However, this not *strictly* true; in fact, really, as long as the underlying sequences are homologous, and the position in one sequence references an equivalent position in the other, then it makes sense to compare the coordinates.
+In other words, even if the underlying sequences aren't *exactly* the same, as long as they represent something equivalent, then the coordinates can be compared.
+A prerequisite for this is that the *lengths* of the sequence must match; it wouldn't make sense to compare position 5,673 on a sequence of length 8,000 against the same position on a sequence of length 9,000 because those positions don't clearly represent the same thing; but if the sequences have the same length and represent a homology statement, then it may be meaningful to compare the positions. 
+
+We realized that we could gain a lot of power from the seqcol comparison function by comparing just the name and length vectors, which typically correspond to a coordinate system.
+Thus, actual sequence content is optional for sequence collections.
+We still think it's correct to refer to a sequence-content-less sequence collection as a "sequence collection" -- because it is still an abstract concept that *is* representing a collection of sequences: we know their names, and their lengths, we just don't care about the actual characters in the sequence in this case.
+Thus, we can think of these as a sequence collection without sequence characters.
+
+An example of a canonical representation (level 2) of a sequence collection with unspecified sequences would be:
+
+```
+{
+  "lengths": [
+    "1216",
+    "970",
+    "1788"
+  ],
+  "names": [
+    "A",
+    "B",
+    "C"
+  ]
+}
+```
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -54,6 +54,7 @@ extra_css:
   - stylesheets/extra.css
 
 markdown_extensions:
+  - admonition
   - pymdownx.highlight:
       use_pygments: true
   - pymdownx.superfences: