-
Notifications
You must be signed in to change notification settings - Fork 9
Add guide on choosing entity_id and entity_uri for HERD references #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
bendichter
wants to merge
2
commits into
NeurodataWithoutBorders:main
Choose a base branch
from
bendichter:add-herd-entity-id-uri-guide
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,176 @@ | ||
| .. _external_resources_entity_guide: | ||
|
|
||
| Using Ontologies and Identifiers with NWB | ||
| ========================================= | ||
|
|
||
| Neurophysiology data is full of terms that mean something specific outside of your file: the | ||
| species of a subject, the institution that collected the data, the researchers who ran the | ||
| experiment, or the brain region a probe was implanted in. Writing these as free text (``"mouse"``, | ||
| ``"the Allen Institute"``, ``"V1"``) is easy to do but hard to compute on — different files spell | ||
| the same thing different ways, and a reader has no authoritative reference for what exactly was | ||
| meant. | ||
|
|
||
| **External resources** solve this by linking a term in your file to a standardized entry in an | ||
| external **ontology**, **registry**, or **atlas** — for example linking the species | ||
| ``"Mus musculus"`` to its entry in the NCBI Taxonomy. This makes your annotations unambiguous, | ||
| machine-readable, and interoperable: tools can group, search, and compare data across files and | ||
| labs because everyone points at the same canonical identifier. | ||
|
|
||
| In NWB, these links are stored using HDMF's **HERD** (HDMF External Resources Data) structure, | ||
| which records, for each annotation, the term as it appears in your file together with a compact | ||
| identifier (``entity_id``) and a resolvable URL (``entity_uri``) for the external entry. | ||
|
|
||
| How to add external resources to an NWB file | ||
| -------------------------------------------- | ||
|
|
||
| There are two complementary ways to connect NWB data to external terms, both provided by HDMF: | ||
|
|
||
| * **HERD** lets you attach references to existing values in a file — recording that a given | ||
| attribute or column value corresponds to a specific external term. See the | ||
| :hdmf-docs:`HERD tutorial <tutorials/plot_external_resources.html>` for a walkthrough of | ||
| :py:meth:`HERD.add_ref <hdmf.common.resources.HERD.add_ref>`. | ||
| * **TermSet** lets you validate values *as you write them*, constraining a field to terms drawn | ||
| from a chosen ontology. See the :hdmf-docs:`TermSet tutorial <tutorials/plot_term_set.html>`, | ||
| and the PyNWB | ||
| :pynwb-docs:`How to Configure Term Validations <tutorials/general/plot_configurator.html>` | ||
| tutorial for configuring term validation across a file. | ||
|
|
||
| The rest of this page covers a question that comes up with both approaches: once you have picked | ||
| an external term, what exactly should go in the ``entity_id`` and ``entity_uri`` fields? | ||
|
|
||
| Choosing ``entity_id`` and ``entity_uri`` | ||
| ----------------------------------------- | ||
|
|
||
| When you annotate data with an external resource using | ||
| :py:meth:`HERD.add_ref <hdmf.common.resources.HERD.add_ref>`, each reference records two | ||
| fields that identify the external term: | ||
|
|
||
| ``entity_id`` | ||
| A compact identifier (a `CURIE <https://www.w3.org/TR/curie/>`_) of the form | ||
| ``prefix:identifier`` (e.g. ``NCBITaxon:10090``). The ``prefix`` names the registry or | ||
| ontology and the ``identifier`` is the term's accession within it. | ||
|
|
||
| ``entity_uri`` | ||
| The full URL that the ``entity_id`` resolves to — a persistent, dereferenceable web | ||
| address for that exact term. | ||
|
|
||
| Recommended practice | ||
| ^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| #. **Use a CURIE for** ``entity_id``. Prefer an identifier whose ``prefix`` is registered with | ||
| `bioregistry.io <https://bioregistry.io>`_. The Bioregistry is a comprehensive registry of | ||
| prefixes that maps each CURIE to a canonical, resolvable URL, which avoids the ambiguity of | ||
| the many overlapping identifier schemes (e.g. ``NCBITaxon`` vs. ``taxonomy`` vs. | ||
| ``NCBI_TAXON``). | ||
|
|
||
| #. **Use the resolved URL for** ``entity_uri``. The ``entity_uri`` should be the URL that the | ||
| CURIE resolves to. You can look this up by resolving the CURIE through the Bioregistry: | ||
| visiting ``https://bioregistry.io/<entity_id>`` (for example | ||
| ``https://bioregistry.io/NCBITaxon:10090``) redirects to the canonical provider URL, which is | ||
| the value to store in ``entity_uri``. | ||
|
|
||
| Keeping ``entity_id`` and ``entity_uri`` consistent in this way means a reader can both | ||
| recognize the registry from the compact ``entity_id`` and dereference the ``entity_uri`` to land | ||
| on an authoritative description of the term. | ||
|
|
||
| Commonly used registries | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| All of the registries below are registered with the Bioregistry. The ``entity_uri`` column shows | ||
| the canonical URL the example ``entity_id`` resolves to. | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 10 16 22 20 32 | ||
|
|
||
| * - Prefix | ||
| - Use for | ||
| - Common NWB field(s) | ||
| - Example ``entity_id`` | ||
| - Example ``entity_uri`` | ||
| * - ``NCBITaxon`` | ||
| - Species | ||
| - ``Subject.species`` | ||
| - ``NCBITaxon:10090`` | ||
| - ``http://purl.obolibrary.org/obo/NCBITaxon_10090`` | ||
| * - ``ROR`` | ||
| - Organizations / institutions | ||
| - ``NWBFile.institution`` | ||
| - ``ROR:013meh722`` | ||
| - ``https://ror.org/013meh722`` | ||
| * - ``ORCID`` | ||
| - People (researchers) | ||
| - ``NWBFile.experimenter`` | ||
| - ``ORCID:0000-0002-1825-0097`` | ||
| - ``https://orcid.org/0000-0002-1825-0097`` | ||
| * - ``UBERON`` | ||
| - Brain regions (cross-species) | ||
| - Brain-region location fields [#loc]_ | ||
| - ``UBERON:0001950`` | ||
| - ``http://purl.obolibrary.org/obo/UBERON_0001950`` | ||
| * - ``MBA`` | ||
| - Brain regions (Allen Mouse Brain Atlas) | ||
| - Brain-region location fields [#loc]_ | ||
| - ``MBA:385`` | ||
| - ``https://purl.brain-bican.org/ontology/mbao/MBA_385`` | ||
| * - ``HBA`` | ||
| - Brain regions (Allen Human Brain Atlas) | ||
| - Brain-region location fields [#loc]_ | ||
| - ``HBA:4005`` | ||
| - ``https://purl.brain-bican.org/ontology/hbao/HBA_4005`` | ||
| * - ``DANDI`` | ||
| - Dandisets | ||
| - (identifies the dataset as a whole) | ||
| - ``DANDI:000015`` | ||
| - ``https://dandiarchive.org/dandiset/000015`` | ||
|
|
||
| .. [#loc] Brain-region annotations commonly apply to ``ElectrodeGroup.location``, | ||
| ``ImagingPlane.location``, and the ``location`` column of the ``electrodes`` table. | ||
|
|
||
| Example | ||
| ^^^^^^^ | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| # the species of the subject, mapped to NCBI Taxonomy | ||
| herd.add_ref( | ||
| container=nwbfile.subject, | ||
| attribute="species", | ||
| key="Mus musculus", | ||
| entity_id="NCBITaxon:10090", | ||
| entity_uri="http://purl.obolibrary.org/obo/NCBITaxon_10090", | ||
| ) | ||
|
|
||
| Resources without individually resolvable URLs | ||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
|
||
| Some resources do not provide a dereferenceable URL for each individual term. For example, many | ||
| brain atlases (such as the macaque **D99** atlas) publish a single document or download for the | ||
| whole atlas rather than one persistent URL per region. | ||
|
|
||
| In that case: | ||
|
|
||
| * Put the **URL of the resource as a whole** in ``entity_uri`` (e.g. the atlas's landing or | ||
| download page). | ||
| * Put the resource's **identifier for the specific term** — for example, the brain area ID used | ||
| by the atlas — in ``entity_id``. | ||
|
|
||
| This keeps every reference dereferenceable to *something* authoritative (the resource) while | ||
| still recording the precise term identifier, even when a per-term URL does not exist. | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| # a region from an atlas that has no per-region URL: identify the region by its | ||
| # atlas-specific ID and point entity_uri at the atlas itself | ||
| herd.add_ref( | ||
| container=electrodes_table, | ||
| attribute="location", | ||
| key="area_42", | ||
| entity_id="42", | ||
| entity_uri="https://afni.nimh.nih.gov/pub/dist/atlases/macaque/D99_macaque/", | ||
| ) | ||
|
|
||
| .. seealso:: | ||
|
|
||
| :py:class:`HERD <hdmf.common.resources.HERD>` for the full API, and | ||
| :py:meth:`HERD.add_ref <hdmf.common.resources.HERD.add_ref>` for adding references. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This URI doesn't resolve