diff --git a/_posts/2020-12-15-versioning.md b/_posts/2020-12-15-versioning.md new file mode 100644 index 00000000..e4f2e0b4 --- /dev/null +++ b/_posts/2020-12-15-versioning.md @@ -0,0 +1,142 @@ +--- +layout: post +title: Versioning of Biological Databases +date: 2020-12-15 00:00:00 -0800 +author: Charles Tapley Hoyt +--- +What can be versioned? + +- Software +- Databases +- The Internet +- Coca Cola Recipe + +This post is about all of the different dimensions +of verioning, including what versions look like, +where version information is stored, ... + +## Anatomy of a Version + +### Semantic Versioning + +Resources whose version numbers follow the format +X.Y.Z are using *semantic versioning*. The +X refers to a major version, Y to a minor verison, +and Z to a patch. Typically in software, an increase +in the major version denotes a backwards-incompatible +change to the API. With data, this is less defined, but +perhaps could be said that a major version bump should +be necessary if the data's schema (shape/format) changes. + +See also: https://semver.org/ + +Examples: BioGrid, DrugBank + +### Sort-of Semantic Versioning + +Resources whose versions who follow the format +X.Y are also using *semantic versioning* but +do not use a patch. + +Examples: Protein Ontology, MSigDB, miRBase + +### Sequential Versioning + +Examples: ChEBI, Reactome + +### Calendar Versioning + +ISO 8601-compliant examples: Gene Ontology, Phenotype And Trait Ontology + +Other Examples: WikiPathways, DrugCentral + +### Unversioned + +#### Daily Build + +Entrez Gene database is an example of a dataset that is built daily, and +doesn't really have version information associated with it + +#### One-off + +Many databases are created with the purpose of being published and forgotten +about. These often don't get a version number assigned to them. + +#### Just... Missing + +You know what this means + +# Where is the Version Information + +- Inside data + - As structured information inside the data + - OBO Ontologies have the `data-version` tag at the header + - Biological Expression Langauge has the `SET VERSION` header + - Unstandardized data like DrugBank has a section with metadata that includes the version + - As unstructued infomration inside the data + - Wikidata pathway GMT files contain version information in unstructured data - the GMT format + is not so respectful of metadata +- In location information of data + - Many OBO ontologies (if you're using GitHub as a file hosting system instead of the PURL service) + contain a folder of releases. E.g. DOID: https://github.com/DiseaseOntology/HumanDiseaseOntology/tree/main/src/ontology/releases/2020-12-02 +- In the name of the file + - BioGRID (e.g., https://downloads.thebiogrid.org/File/BioGRID/Release-Archive/BIOGRID-4.2.192/BIOGRID-ALL-4.2.192.mitab.zip) +- On the website + - Reactome only states the current version on the site and does not have information inside + the locations, filenames, or the data itself :/ +- No version information at all + - Many databases maintained by small groups (such as excel sheets published as a database) + do not have care taken for versioning, though hopefully the end of this post will give + inspiration on how even groups working alone can do this well + + +# Where is data hosted? + +- GitHub + - Example Disease Ontology and many other OBO Foundry ontologies +- FTP server + - miRBase +- HTTP / custom + - BioGRID +- Archive Systems + - Zenodo + - OpenBioLink + - FigShare + - Mendeley + - CKG + +# Longevity of Versions + +- All old versions available + - BioGRID +- Select number of recent versions available (rolling) + - PubChem Compound +- Only current version available + - Reactome + +# Automatic Identification of Current Version + +- Symlink from "latest" or "current" +- + +# Recommendations + +Of all of these different systems, I think BioGRID is the best example, +It maintains an archive of all current versions, it includes the +version information in both the path and file name. + +The improvement BioGRID should have is in each release, a file with the same +name, not including the version number, that just contains the version number. +This means that if the symlink for latest is used, there will always be a file +with the same URI whose content always reflects the current version. + +Even further, I could imagine defining a JSON schema for what a "current version +metadata file" might include, with the version number, the release date, the location +of the version file, etc. + +A solid second place is the concept of the OBO PURL system, where +a file is always at the same location with the same name, but contains +version information in the content of the file itself. To make this system +the winner, it would be necessary for the PURL system to also keep track of +the current version of all OBO ontologies and make that information easy to get +(spoilers, I'm building that system at https://cthoyt.com/bioversions) \ No newline at end of file