diff --git a/_posts/2026-literature-and-kgs.md b/_posts/2026-literature-and-kgs.md new file mode 100644 index 0000000..331113f --- /dev/null +++ b/_posts/2026-literature-and-kgs.md @@ -0,0 +1,71 @@ +--- +layout: post +title: The Role of Literature in Constructing a Knowledge Graph +date: 2026-02-10 11:13:00 +0100 +author: Charles Tapley Hoyt +tags: + - SSSOM + - semantic mappings + - knowledge graphs +--- + +[PubMed](https://pubmed.ncbi.nlm.nih.gov) is an index of nearly 40 million + +[`pubmed-downloader`](https://github.com/cthoyt/pubmed-downloader) + +## Identifying Relevant Literature + +1. Search over pubmed +2. Enrichment of citations + +every knowledge graph needs an aspect of literature enrichment. here's what +happened + +```mermaid +flowchart LR + articles[peer-reviewed articles] --> source + preprints[pre-prints] --> source + patents --> source + experts[expert text] --> source +``` + +1. Define some queries for relevant literature. In Catalaix, this is based on + finding papers authored by people in the consortium +2. Enrich the retrieved literature based on both upstream and downstream + citations +3. Curate papers as being relevant or not. In RAPTER and Bioregistry project, we + did this very successfully +4. Run NER and other information extraction workflows on these papers in a + semi-automated curation look. This part is much more agile in the beginning + as the data model doesn't need to be set. Though I don't have a taste for it, + LLMs show potential for quickly constructing novel information extraction + pipelines, e.g., DRAGON-AI reference?, but in practice, I haven't yet seen + this be used successfully. + +The accelerating rate of publication of peer-reviewed papers, patents, +(electronic) laboratory notebooks (e.g., Chemotion), repositories (e.g., +RADAR4Chem), and other expert-driven text creates challenges for catalaix +consortium members in finding and understanding relevant publications. The CKG +will aggregate and index all relevant publications and provide catalaix +consortium members with access, e.g., through a Reaxys-like search interface for +chemical names and structures. Such interfaces will uniquely leverage a +combination of public and project-specific ontologies to contextualize searches, +e.g., to find all zinc-containing catalysts of alcoholysis reactions on PET. The +aggregation step will enrich publications with bibliometric metadata such as +publication year, venue, authorships, and citations. The indexing step will +implement information extraction workflows such as named entity recognition +(NER), which can identify substrates, products, catalysts, reagents, chemical +reactions, and other named entities appearing within the text, link them to +appropriate ontology terms, and enable them to be queried through the CKG. On +top of NER, relation extraction workflows can capture relationships between +named entities appearing within the text, such as the classification of a +chemical as a plasticizer or dye. Such workflows are semi-automated, i.e., have +a fully automated initial step followed by a human-in-the-loop curation step to +ensure high quality results. Importantly, such workflows will be connected to +the already existing catalaix Wiki, democratizing the ability for domain experts +within the consortium to contribute to the CKG simply by adding text to the +Wiki. + +## Catalaix Use Case + +https://github.com/catalaix/catalaix-kg/pull/6 diff --git a/_posts/2026-oer-mappings.md b/_posts/2026-oer-mappings.md new file mode 100644 index 0000000..5fb4764 --- /dev/null +++ b/_posts/2026-oer-mappings.md @@ -0,0 +1,105 @@ +--- +layout: post +title: + Mapping between Open Educational Resource Data Models and Related Ontologies +date: 2025-11-07 10:14:00 +0200 +author: Charles Tapley Hoyt +tags: + - open educational resources + - learning materials + - OERs + - SSSOM + - SSSOM Curator + - Biomappings + - semantic mappings +--- + +Interest in (open) educational resources (OERs) in the last twenty years has +lead to a highly fragmented landscape of modeling efforts. This post is about +establishing mappings and crosswalks between these disparate efforts using the +[Simple Standard for Sharing Ontological Mappings (SSSOM)](https://mapping-commons.github.io/sssom) +and [SSSOM Curator](https://github.com/cthoyt/sssom-curator). + +More concretely, most modeling efforts for (open) educational resources and +learning materials involves developing a metadata model that captures key +information such as the title, description, authors, language, disciple, and +keywords as well as pedagogical metadata like the target audience, required +proficiency level, and learning objectives. Notably, the Dublin Core Metadata +Initiative's +[Learning Resource Metadata Innovation (LMRI)](https://www.dublincore.org/specifications/lrmi) +and +[Educational Resource Discovery Index (ERuDIte)](https://www.pagestudy.org/erudite-training-resource-standard/) +each produced their own OER metadata models, then later consolidated efforts +with a third OER metadata model in Schema.org. The World Wide Web Consortium +(W3C) established the +[Open Educational Resources Schema Community Group](https://www.w3.org/community/oerschema/) +which developed [OERSchema](https://github.com/open-curriculum/oerschema), but +this metadata model did not see critical adoption, the working group shut down +in 2023, and the repository is effectively inactive. There's also numerous +partially overlapping isolated efforts (surprisingly, many from German groups) with +heterogeneous reusability (e.g., many are published by not downloadable, many +are poorly constructed). + +Here's a non-exhaustive list of metadata models that follow semantic web +standards (see Semantic Farm collection [0000018](https://semantic.farm/collection/0000018)): + +| Prefix | Name | Homepage | +| ---------------------------------------------- | ------------------------------------------------------- | -------------------------------------------------------------------- | +| [`educor`](https://semantic.farm/educor) | Educational and Career-Oriented Recommendation Ontology | https://github.com/tibonto/educor | +| [`lrmi`](https://semantic.farm/lrmi) | DCMI Learning Resource Metadata Innovation Terms | https://www.dublincore.org/specifications/lrmi/lrmi_terms/2022-06-14 | +| [`modalia`](https://semantic.farm/modalia) | MoDALIA Ontology | https://git.rwth-aachen.de/dalia/dalia-ontology | +| [`oerschema`](https://semantic.farm/oerschema) | OER Schema | https://github.com/open-curriculum/oerschema | +| [`schema`](https://semantic.farm/schema) | Schema.org | https://schema.org | +| [`vivo`](https://semantic.farm/vivo) | VIVO Ontology | https://github.com/vivo-ontologies/vivo-ontology | + +## TL;DR + +This post is about predicting mappings between ontologies, data models, and other +semantic spaces relevant for open educational resources (OERs) and curating them. + + + +with [SSSOM Curator](https://github.com/cthoyt/sssom-curator), +a generalization and re-implementation of [Biomappings](https://github.com/biopragmatics/biomappings), a +semi-automated, human-in-the-loop mapping curations workflow that was originally domain-specific for life sciences. + + + +```console +$ uv tool install sssom-curator[predict-lexical,exports,web] +$ sssom-curator init +$ sssom-curator predict lexical --all-by-all --force kim.hcrt schema vivo +$ sssom-curator web +``` + +1. Surveying the semantic landscape +2. Ingesting resources +3. using lexical prediction workflow +4. curation +5. future: assess the amount of uncurated stuff (i.e., islands in the mapping + graph) + +## Survey Semantic Landscape + +## Education Levels + +| Prefix | Name | Homepage | +| ---------------------------------------------------------------- | ---------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- | +| [`ans.educationlevel`](https://semantic.farm/ans.educationlevel) | U.S. Education Level Vocabulary | http://purl.org/ASN/scheme/ASNEducationLevel/ | +| [`isced1997`](https://semantic.farm/isced1997) | International Standard Classification of Education, 1997 Edition | https://ec.europa.eu/eurostat/statistics-explained/index.php?title=International_Standard_Classification_of_Education_(ISCED) | +| [`isced2011`](https://semantic.farm/isced2011) | International Standard Classification of Education, 2011 Edition | https://ec.europa.eu/eurostat/statistics-explained/index.php?title=International_Standard_Classification_of_Education_(ISCED) | +| [`isced2013`](https://semantic.farm/isced2013) | International Standard Classification of Education, 2013 Edition | https://ec.europa.eu/eurostat/statistics-explained/index.php?title=International_Standard_Classification_of_Education_(ISCED) | +| [`kim.educationlevel`](https://semantic.farm/kim.educationlevel) | KIM Education Level | https://github.com/dini-ag-kim/educationalLevel | +| [`kim.esv`](https://semantic.farm/kim.esv) | Educational Sectors Vocabulary | https://github.com/dini-ag-kim/vocabs-edu | +| [`kim.hcrt`](https://semantic.farm/kim.hcrt) | Higher Education Resource Types | https://github.com/dini-ag-kim/hcrt | +| [`oeh.educationlevel`](https://semantic.farm/oeh.educationlevel) | OpenEduHub Education Level | https://github.com/openeduhub/oeh-metadata-vocabs | + +## Subjects and Disciplines + +| Prefix | Name | Homepage | +| ---------------------------------------------------------------------------------------- | ------------------------------------------------ | --------------------------------------------------------- | +| [`ccso`](https://semantic.farm/ccso) | Curriculum Course Syllabus Ontology | https://github.com/Vkreations/CCSO | +| [`kim.schulfaecher`](https://semantic.farm/kim.schulfaecher) | KIM School Subjects | https://github.com/dini-ag-kim/schulfaecher | +| [`kim.hochschulfaechersystematik`](https://semantic.farm/kim.hochschulfaechersystematik) | German University Subject Classification System | https://github.com/dini-ag-kim/hochschulfaechersystematik | +| [`adcad`](https://semantic.farm/adcad) | Arctic Data Center Academic Disciplines Ontology | https://github.com/NCEAS/adc-disciplines | +| [`edam`](https://semantic.farm/edam) | EDAM Ontology | https://github.com/edamontology/edamontology | diff --git a/_posts/2026-semantic-farm-for-nfdi.md b/_posts/2026-semantic-farm-for-nfdi.md new file mode 100644 index 0000000..1ed08e4 --- /dev/null +++ b/_posts/2026-semantic-farm-for-nfdi.md @@ -0,0 +1,80 @@ +--- +layout: post +title: Semantic Farm for NFDI +date: 2025-11-05 8:32:00 +0100 +author: Charles Tapley Hoyt +tags: + - open educational resources + - learning materials + - OERs + - SSSOM + - SSSOM Curator + - Biomappings +--- + +The [Semantic Farm (https://semantic.farm)](https://semantic.farm) is a data interoperability platform +that indexes ontologies, databases, and other resources that assign (persistent) +identifiers. + +By collecting metadata about such resources, the Semantic Farm supports +researchers to find the appropriate (persistent) identifier schema to annotate +their (meta)data to be more FAIR (findable, accessible, interoperable, +reusable). + +The NFDI +[Section (Meta)data, Terminologies, Provenance](https://www.nfdi.de/section-meta/?lang=en) +proposes the Semantic Farm as a +[Basic Service for NFDI (Base4NFDI)](https://base4nfdi.de) + +Why should it be a Base4NFDI service? + +Who are the stakeholders? + +1. NFDI Sections + - [Section Common Infrastructure](https://www.nfdi.de/section-infra/?lang=en) + - Data Integration + - Data Management Planning + - Data Science and Artificial Intelligence + - Electronic Lab Notebooks + - Persistent Identifiers (PID) + - Section Metadata's charter said to do a survey of consortia ontology + usage - this is a place to concretize it and make actionable + - Section EduTrain uses Semantic Farm in the DALIA project to make OERs + citable +2. NFDI Consortia + - Chemistry and Cat did pilot where they consolidated all the ontologies they + use. This helps them communicate to all scientist in the consortia + - Culture demonstrated w/ my blog post + - Need to reach out to other sections... +3. Base4NFDI + - TS4NFDI technologies use Semantic Farm in the core already (e.g., ontology + lookup service) in their implementation for supporting cross-references. + There are also several ideas for incubators to more tightly integrate + Semantic Farm into TS4NFDI to better support TS4NFDI users. Semantic Farm + - DMP4NFDI can use Semantic Farm to support writing better data management + plans by 1. helping find appropriate ontologies, controlled vocabularies, + and other resources that mint semantic spaces to annotate data in a FAIR + way and 2. educating writers to understand some practical aspects of + semantics + - PID4NFDI - need to show how it's complementary and how it's different + - KGI4NFDI +4. NFDI Central + - Reporting on semantic spaces produced by consortia, which includes both + ontologies and databases. By construction, anything in Semantic Farm has + taken a significant step towards FAIR by documenting its accessibility, + improving its findability, and implicitly by making info necessary for + interoperability + +Difference from previous base4nfdi proposals: + +1. Semantic Farm already exists, is already running, and is already being + demonstrated. +2. We started with Bioregistry, and the idea is to support whole NFDI +3. Doesn't need a ton of funding to continue, already has a detailed governacne + strucutre to support community maintenance which leads to sustainability and + longevity +4. Has international partners outside NFDI / europe that are invested in it. + +Complementary tools in NFDI + +- comparison to BARTOC diff --git a/_posts/2026/2026-03-16-semantic-mapping-sources.md b/_posts/2026/2026-03-16-semantic-mapping-sources.md new file mode 100644 index 0000000..e75a299 --- /dev/null +++ b/_posts/2026/2026-03-16-semantic-mapping-sources.md @@ -0,0 +1,72 @@ +--- +layout: post +title: Where do Semantic Mappings Come From? +date: 2026-01-20 11:42:00 +0100 +author: Charles Tapley Hoyt +tags: + - SSSOM + - semantic mappings + - knowledge graphs +--- + +The first challenge with semantic mappings is the variety of forms they can +take. This both includes different data models and serializations of those +models. This problem is effectively solved, but I think is worth reviewing for +historical purposes (please let me know if I missed something): + +SKOS logo +[Simple Knowledge Organization System (SKOS)](https://www.w3.org/TR/skos-reference) +is a data model for RDF to represent controlled vocabularies, taxonomies, +dictionaries, thesauri, and other semantic artifacts. It defines several +semantic mapping predicates including for broad matches, narrow matches, close +matches, related matches, and exact matches. + +[JSKOS (JSON for Knowledge Organization Systems)](https://gbv.github.io/jskos/#mapping), +a JSON-based extension of the SKOS data model. I recently wrote a post about +converting between [SSSOM and JSKOS]({% post_url 2026-01-15-sssom-to-jskos %}). + +OWL logo +[Web Ontology Language (OWL)](https://www.w3.org/TR/owl2-syntax/) is primarily +used for ontologies. It has first-class language support for encoding +equivalences between classes, properties, or individuals. Other semantic +mappings can be encoded as annotation properties on classes, properties, or +individuals, e.g., using SKOS predicates. + +OBO logo +The +[OBO Flat File Format](https://owlcollab.github.io/oboformat/doc/GO.format.obo-1_4.html) +is a simplified version of OWL with macros most useful for curating biomedical +ontologies. It has the same abilities as OWL, but also the `xref` macro which +corresponds to `oboInOwl:hasDbXref` relations, which are by nature imprecise and +therefore used in a variety of ways. + +SSSOM logo +The +[Simple Standard for Sharing Ontological Mappings (SSSOM)](https://mapping-commons.github.io/sssom/) +is a fit-for-purpose format for semantic mappings between classes, properties, +or individuals. SSSOM guides curators towards inputting key metadata that are +typically missing from other formalisms and is gaining wider community adoption. +Importantly, SSSOM integrates into ontology curation workflows, especially for +[Ontology Development Kit (ODK)](https://incatools.github.io/ontology-development-kit) +users. + +The +[Expressive and Declarative Ontology Alignment Language (EDOAL)](https://moex.gitlabpages.inria.fr/alignapi/edoal.html) +lives in a similar space to SSSOM, but IMO was much less approachable (c.f. +XML + Java), and has not seen a lot of traction in the biomedical space. + +OntoPortal logo +[OntoPortal](https://ontoportal.org/) has its own data model for semantic +mappings that has low metadata precision. I recently wrote a post on converting +[OntoPortal to SSSOM]({% post_url 2025-11-23-sssom-from-bioportal %}). OntoPortal would also like +to invest more in SSSOM infrastructure if it can organize funding and human resources. + +Wikidata logo +[Wikidata](https://www.wikidata.org) has its own data model for semantic +mappings that include higher precision metadata. I recently wrote a post on +mapping between the data models from [SSSOM and +Wikidata]({% post_url 2026-01-07-sssom-to-wikidata %}). + +Finally, there's a long tail of mappings that live in poorly annotated CSV, TSV, +Excel, and other formats. Similarly, mappings can live in plain RDF files, e.g., +encoded with SKOS predicates, but without high precision metadata.