Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions _posts/2026-literature-and-kgs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
layout: post
title: The Role of Literature in Constructing a Knowledge Graph
date: 2026-02-10 11:13:00 +0100
author: Charles Tapley Hoyt
tags:
- SSSOM
- semantic mappings
- knowledge graphs
---

[PubMed](https://pubmed.ncbi.nlm.nih.gov) is an index of nearly 40 million

[`pubmed-downloader`](https://github.com/cthoyt/pubmed-downloader)

## Identifying Relevant Literature

1. Search over pubmed
2. Enrichment of citations

every knowledge graph needs an aspect of literature enrichment. here's what
happened

```mermaid
flowchart LR
articles[peer-reviewed articles] --> source
preprints[pre-prints] --> source
patents --> source
experts[expert text] --> source
```

1. Define some queries for relevant literature. In Catalaix, this is based on
finding papers authored by people in the consortium
2. Enrich the retrieved literature based on both upstream and downstream
citations
3. Curate papers as being relevant or not. In RAPTER and Bioregistry project, we
did this very successfully
4. Run NER and other information extraction workflows on these papers in a
semi-automated curation look. This part is much more agile in the beginning
as the data model doesn't need to be set. Though I don't have a taste for it,
LLMs show potential for quickly constructing novel information extraction
pipelines, e.g., DRAGON-AI reference?, but in practice, I haven't yet seen
this be used successfully.

The accelerating rate of publication of peer-reviewed papers, patents,
(electronic) laboratory notebooks (e.g., Chemotion), repositories (e.g.,
RADAR4Chem), and other expert-driven text creates challenges for catalaix
consortium members in finding and understanding relevant publications. The CKG
will aggregate and index all relevant publications and provide catalaix
consortium members with access, e.g., through a Reaxys-like search interface for
chemical names and structures. Such interfaces will uniquely leverage a
combination of public and project-specific ontologies to contextualize searches,
e.g., to find all zinc-containing catalysts of alcoholysis reactions on PET. The
aggregation step will enrich publications with bibliometric metadata such as
publication year, venue, authorships, and citations. The indexing step will
implement information extraction workflows such as named entity recognition
(NER), which can identify substrates, products, catalysts, reagents, chemical
reactions, and other named entities appearing within the text, link them to
appropriate ontology terms, and enable them to be queried through the CKG. On
top of NER, relation extraction workflows can capture relationships between
named entities appearing within the text, such as the classification of a
chemical as a plasticizer or dye. Such workflows are semi-automated, i.e., have
a fully automated initial step followed by a human-in-the-loop curation step to
ensure high quality results. Importantly, such workflows will be connected to
the already existing catalaix Wiki, democratizing the ability for domain experts
within the consortium to contribute to the CKG simply by adding text to the
Wiki.

## Catalaix Use Case

https://github.com/catalaix/catalaix-kg/pull/6
105 changes: 105 additions & 0 deletions _posts/2026-oer-mappings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
---
layout: post
title:
Mapping between Open Educational Resource Data Models and Related Ontologies
date: 2025-11-07 10:14:00 +0200
author: Charles Tapley Hoyt
tags:
- open educational resources
- learning materials
- OERs
- SSSOM
- SSSOM Curator
- Biomappings
- semantic mappings
---

Interest in (open) educational resources (OERs) in the last twenty years has
lead to a highly fragmented landscape of modeling efforts. This post is about
establishing mappings and crosswalks between these disparate efforts using the
[Simple Standard for Sharing Ontological Mappings (SSSOM)](https://mapping-commons.github.io/sssom)
and [SSSOM Curator](https://github.com/cthoyt/sssom-curator).

More concretely, most modeling efforts for (open) educational resources and
learning materials involves developing a metadata model that captures key
information such as the title, description, authors, language, disciple, and
keywords as well as pedagogical metadata like the target audience, required
proficiency level, and learning objectives. Notably, the Dublin Core Metadata
Initiative's
[Learning Resource Metadata Innovation (LMRI)](https://www.dublincore.org/specifications/lrmi)
and
[Educational Resource Discovery Index (ERuDIte)](https://www.pagestudy.org/erudite-training-resource-standard/)
each produced their own OER metadata models, then later consolidated efforts
with a third OER metadata model in Schema.org. The World Wide Web Consortium
(W3C) established the
[Open Educational Resources Schema Community Group](https://www.w3.org/community/oerschema/)
which developed [OERSchema](https://github.com/open-curriculum/oerschema), but
this metadata model did not see critical adoption, the working group shut down
in 2023, and the repository is effectively inactive. There's also numerous
partially overlapping isolated efforts (surprisingly, many from German groups) with
heterogeneous reusability (e.g., many are published by not downloadable, many
are poorly constructed).

Here's a non-exhaustive list of metadata models that follow semantic web
standards (see Semantic Farm collection [0000018](https://semantic.farm/collection/0000018)):

| Prefix | Name | Homepage |
| ---------------------------------------------- | ------------------------------------------------------- | -------------------------------------------------------------------- |
| [`educor`](https://semantic.farm/educor) | Educational and Career-Oriented Recommendation Ontology | https://github.com/tibonto/educor |
| [`lrmi`](https://semantic.farm/lrmi) | DCMI Learning Resource Metadata Innovation Terms | https://www.dublincore.org/specifications/lrmi/lrmi_terms/2022-06-14 |
| [`modalia`](https://semantic.farm/modalia) | MoDALIA Ontology | https://git.rwth-aachen.de/dalia/dalia-ontology |
| [`oerschema`](https://semantic.farm/oerschema) | OER Schema | https://github.com/open-curriculum/oerschema |
| [`schema`](https://semantic.farm/schema) | Schema.org | https://schema.org |
| [`vivo`](https://semantic.farm/vivo) | VIVO Ontology | https://github.com/vivo-ontologies/vivo-ontology |

## TL;DR

This post is about predicting mappings between ontologies, data models, and other
semantic spaces relevant for open educational resources (OERs) and curating them.



with [SSSOM Curator](https://github.com/cthoyt/sssom-curator),
a generalization and re-implementation of [Biomappings](https://github.com/biopragmatics/biomappings), a
semi-automated, human-in-the-loop mapping curations workflow that was originally domain-specific for life sciences.



```console
$ uv tool install sssom-curator[predict-lexical,exports,web]
$ sssom-curator init
$ sssom-curator predict lexical --all-by-all --force kim.hcrt schema vivo
$ sssom-curator web
```

1. Surveying the semantic landscape
2. Ingesting resources
3. using lexical prediction workflow
4. curation
5. future: assess the amount of uncurated stuff (i.e., islands in the mapping
graph)

## Survey Semantic Landscape

## Education Levels

| Prefix | Name | Homepage |
| ---------------------------------------------------------------- | ---------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------- |
| [`ans.educationlevel`](https://semantic.farm/ans.educationlevel) | U.S. Education Level Vocabulary | http://purl.org/ASN/scheme/ASNEducationLevel/ |
| [`isced1997`](https://semantic.farm/isced1997) | International Standard Classification of Education, 1997 Edition | https://ec.europa.eu/eurostat/statistics-explained/index.php?title=International_Standard_Classification_of_Education_(ISCED) |
| [`isced2011`](https://semantic.farm/isced2011) | International Standard Classification of Education, 2011 Edition | https://ec.europa.eu/eurostat/statistics-explained/index.php?title=International_Standard_Classification_of_Education_(ISCED) |
| [`isced2013`](https://semantic.farm/isced2013) | International Standard Classification of Education, 2013 Edition | https://ec.europa.eu/eurostat/statistics-explained/index.php?title=International_Standard_Classification_of_Education_(ISCED) |
| [`kim.educationlevel`](https://semantic.farm/kim.educationlevel) | KIM Education Level | https://github.com/dini-ag-kim/educationalLevel |
| [`kim.esv`](https://semantic.farm/kim.esv) | Educational Sectors Vocabulary | https://github.com/dini-ag-kim/vocabs-edu |
| [`kim.hcrt`](https://semantic.farm/kim.hcrt) | Higher Education Resource Types | https://github.com/dini-ag-kim/hcrt |
| [`oeh.educationlevel`](https://semantic.farm/oeh.educationlevel) | OpenEduHub Education Level | https://github.com/openeduhub/oeh-metadata-vocabs |

## Subjects and Disciplines

| Prefix | Name | Homepage |
| ---------------------------------------------------------------------------------------- | ------------------------------------------------ | --------------------------------------------------------- |
| [`ccso`](https://semantic.farm/ccso) | Curriculum Course Syllabus Ontology | https://github.com/Vkreations/CCSO |
| [`kim.schulfaecher`](https://semantic.farm/kim.schulfaecher) | KIM School Subjects | https://github.com/dini-ag-kim/schulfaecher |
| [`kim.hochschulfaechersystematik`](https://semantic.farm/kim.hochschulfaechersystematik) | German University Subject Classification System | https://github.com/dini-ag-kim/hochschulfaechersystematik |
| [`adcad`](https://semantic.farm/adcad) | Arctic Data Center Academic Disciplines Ontology | https://github.com/NCEAS/adc-disciplines |
| [`edam`](https://semantic.farm/edam) | EDAM Ontology | https://github.com/edamontology/edamontology |
80 changes: 80 additions & 0 deletions _posts/2026-semantic-farm-for-nfdi.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
---
layout: post
title: Semantic Farm for NFDI
date: 2025-11-05 8:32:00 +0100
author: Charles Tapley Hoyt
tags:
- open educational resources
- learning materials
- OERs
- SSSOM
- SSSOM Curator
- Biomappings
---

The [Semantic Farm (https://semantic.farm)](https://semantic.farm) is a data interoperability platform
that indexes ontologies, databases, and other resources that assign (persistent)
identifiers.

By collecting metadata about such resources, the Semantic Farm supports
researchers to find the appropriate (persistent) identifier schema to annotate
their (meta)data to be more FAIR (findable, accessible, interoperable,
reusable).

The NFDI
[Section (Meta)data, Terminologies, Provenance](https://www.nfdi.de/section-meta/?lang=en)
proposes the Semantic Farm as a
[Basic Service for NFDI (Base4NFDI)](https://base4nfdi.de)

Why should it be a Base4NFDI service?

Who are the stakeholders?

1. NFDI Sections
- [Section Common Infrastructure](https://www.nfdi.de/section-infra/?lang=en)
- Data Integration
- Data Management Planning
- Data Science and Artificial Intelligence
- Electronic Lab Notebooks
- Persistent Identifiers (PID)
- Section Metadata's charter said to do a survey of consortia ontology
usage - this is a place to concretize it and make actionable
- Section EduTrain uses Semantic Farm in the DALIA project to make OERs
citable
2. NFDI Consortia
- Chemistry and Cat did pilot where they consolidated all the ontologies they
use. This helps them communicate to all scientist in the consortia
- Culture demonstrated w/ my blog post
- Need to reach out to other sections...
3. Base4NFDI
- TS4NFDI technologies use Semantic Farm in the core already (e.g., ontology
lookup service) in their implementation for supporting cross-references.
There are also several ideas for incubators to more tightly integrate
Semantic Farm into TS4NFDI to better support TS4NFDI users. Semantic Farm
- DMP4NFDI can use Semantic Farm to support writing better data management
plans by 1. helping find appropriate ontologies, controlled vocabularies,
and other resources that mint semantic spaces to annotate data in a FAIR
way and 2. educating writers to understand some practical aspects of
semantics
- PID4NFDI - need to show how it's complementary and how it's different
- KGI4NFDI
4. NFDI Central
- Reporting on semantic spaces produced by consortia, which includes both
ontologies and databases. By construction, anything in Semantic Farm has
taken a significant step towards FAIR by documenting its accessibility,
improving its findability, and implicitly by making info necessary for
interoperability

Difference from previous base4nfdi proposals:

1. Semantic Farm already exists, is already running, and is already being
demonstrated.
2. We started with Bioregistry, and the idea is to support whole NFDI
3. Doesn't need a ton of funding to continue, already has a detailed governacne
strucutre to support community maintenance which leads to sustainability and
longevity
4. Has international partners outside NFDI / europe that are invested in it.

Complementary tools in NFDI

- comparison to BARTOC
72 changes: 72 additions & 0 deletions _posts/2026/2026-03-16-semantic-mapping-sources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
---
layout: post
title: Where do Semantic Mappings Come From?
date: 2026-01-20 11:42:00 +0100
author: Charles Tapley Hoyt
tags:
- SSSOM
- semantic mappings
- knowledge graphs
---

The first challenge with semantic mappings is the variety of forms they can
take. This both includes different data models and serializations of those
models. This problem is effectively solved, but I think is worth reviewing for
historical purposes (please let me know if I missed something):

<img src="https://forge.extranet.logilab.fr/uploads/-/system/project/avatar/107/external-content.duckduckgo.com.jpeg" align="left" style="max-height: 3em;" alt="SKOS logo"/>
[Simple Knowledge Organization System (SKOS)](https://www.w3.org/TR/skos-reference)
is a data model for RDF to represent controlled vocabularies, taxonomies,
dictionaries, thesauri, and other semantic artifacts. It defines several
semantic mapping predicates including for broad matches, narrow matches, close
matches, related matches, and exact matches.

[JSKOS (JSON for Knowledge Organization Systems)](https://gbv.github.io/jskos/#mapping),
a JSON-based extension of the SKOS data model. I recently wrote a post about
converting between [SSSOM and JSKOS]({% post_url 2026-01-15-sssom-to-jskos %}).

<img src="https://www.jean-delahousse.net/wp-content/uploads/2020/09/Owl_logo-258x300.png" align="left" style="max-height: 3em; margin-right: 0.5em;" alt="OWL logo">
[Web Ontology Language (OWL)](https://www.w3.org/TR/owl2-syntax/) is primarily
used for ontologies. It has first-class language support for encoding
equivalences between classes, properties, or individuals. Other semantic
mappings can be encoded as annotation properties on classes, properties, or
individuals, e.g., using SKOS predicates.

<img src="https://obofoundry.org/images/foundrylogo.png" align="left" style="max-height: 3em; margin-right: 0.5em;" alt="OBO logo">
The
[OBO Flat File Format](https://owlcollab.github.io/oboformat/doc/GO.format.obo-1_4.html)
is a simplified version of OWL with macros most useful for curating biomedical
ontologies. It has the same abilities as OWL, but also the `xref` macro which
corresponds to `oboInOwl:hasDbXref` relations, which are by nature imprecise and
therefore used in a variety of ways.

<img src="https://avatars.githubusercontent.com/u/77892844?v=4" align="left" style="max-height: 3em; margin-right: 0.5em;" alt="SSSOM logo">
The
[Simple Standard for Sharing Ontological Mappings (SSSOM)](https://mapping-commons.github.io/sssom/)
is a fit-for-purpose format for semantic mappings between classes, properties,
or individuals. SSSOM guides curators towards inputting key metadata that are
typically missing from other formalisms and is gaining wider community adoption.
Importantly, SSSOM integrates into ontology curation workflows, especially for
[Ontology Development Kit (ODK)](https://incatools.github.io/ontology-development-kit)
users.

The
[Expressive and Declarative Ontology Alignment Language (EDOAL)](https://moex.gitlabpages.inria.fr/alignapi/edoal.html)
lives in a similar space to SSSOM, but IMO was much less approachable (c.f.
XML + Java), and has not seen a lot of traction in the biomedical space.

<img src="https://ontoportal.org/images/logo.png" align="left" style="max-height: 3em; margin-right: 0.5em;" alt="OntoPortal logo"/>
[OntoPortal](https://ontoportal.org/) has its own data model for semantic
mappings that has low metadata precision. I recently wrote a post on converting
[OntoPortal to SSSOM]({% post_url 2025-11-23-sssom-from-bioportal %}). OntoPortal would also like
to invest more in SSSOM infrastructure if it can organize funding and human resources.

<img src="https://upload.wikimedia.org/wikipedia/commons/6/66/Wikidata-logo-en.svg" align="left" style="max-height: 3em" alt="Wikidata logo">
[Wikidata](https://www.wikidata.org) has its own data model for semantic
mappings that include higher precision metadata. I recently wrote a post on
mapping between the data models from [SSSOM and
Wikidata]({% post_url 2026-01-07-sssom-to-wikidata %}).

Finally, there's a long tail of mappings that live in poorly annotated CSV, TSV,
Excel, and other formats. Similarly, mappings can live in plain RDF files, e.g.,
encoded with SKOS predicates, but without high precision metadata.