diff --git a/docs/seqcol_rationale.md b/docs/seqcol_rationale.md index 5bcd4a0..f85d072 100644 --- a/docs/seqcol_rationale.md +++ b/docs/seqcol_rationale.md @@ -37,7 +37,7 @@ These strategies don’t solve all the problems individually, but taken together ### 1. Handling divergent needs by splitting the standard into two parts -Our first strategy splits the identifier definition into two parts: the *algorithm* by which the identifier will be computed, and the list of *what content contributes to the digest*. The algorithm means the sorting approach, hash function, delimiters, concatenation process, etc. -- how to construct the string that gets digested to make the identifier. The second part is the *list of content*, which asks what attributes affect the digest; that is, do we include the names, sequences, etc in the string to digest? The division is useful because many of the differences come from different choice of content, not from a different algorithm, meaning the algorithm could be consistent even in situations where the content is not. For example, the second use case requires sequences, but the third use case does not. The algorithm can be kept the same. +Our first strategy splits the identifier definition into two parts: 1) the *algorithm* by which the identifier will be computed, and 2) the list of *what content contributes to the digest*. The algorithm means the sorting approach, hash function, delimiters, concatenation process, etc. -- how to construct the string that gets digested to make the identifier. The second part is the *list of content*, which asks what attributes affect the digest; that is, do we include the names, sequences, etc in the string to digest? The division is useful because many of the differences come from different choice of content, not from a different algorithm, meaning the algorithm could be consistent even in situations where the content is not. For example, the second use case requires sequences, but the third use case does not. The algorithm can be kept the same. Separating these two tasks also has a conceptual benefit: it isolates the critical question: *exactly which attributes of a sequence collection should contribute to the digest computation?* The standard abstracts away this question by allowing an implementation to specify which attributes contribute to the digest through the "inherent" property in the schema. This lets us publish a proposal for the algorithm, but leave the final decision open to the community (or even to different versions to be used by different communities). So people can use the sequence collections standard with whatever schema they want, providing flexibility to handle a wide variety of use cases by changing which attributes are listed as inherent in the schema.