diff --git a/docs/source/external_resources_entity_guide.rst b/docs/source/external_resources_entity_guide.rst new file mode 100644 index 0000000..28758fd --- /dev/null +++ b/docs/source/external_resources_entity_guide.rst @@ -0,0 +1,176 @@ +.. _external_resources_entity_guide: + +Using Ontologies and Identifiers with NWB +========================================= + +Neurophysiology data is full of terms that mean something specific outside of your file: the +species of a subject, the institution that collected the data, the researchers who ran the +experiment, or the brain region a probe was implanted in. Writing these as free text (``"mouse"``, +``"the Allen Institute"``, ``"V1"``) is easy to do but hard to compute on — different files spell +the same thing different ways, and a reader has no authoritative reference for what exactly was +meant. + +**External resources** solve this by linking a term in your file to a standardized entry in an +external **ontology**, **registry**, or **atlas** — for example linking the species +``"Mus musculus"`` to its entry in the NCBI Taxonomy. This makes your annotations unambiguous, +machine-readable, and interoperable: tools can group, search, and compare data across files and +labs because everyone points at the same canonical identifier. + +In NWB, these links are stored using HDMF's **HERD** (HDMF External Resources Data) structure, +which records, for each annotation, the term as it appears in your file together with a compact +identifier (``entity_id``) and a resolvable URL (``entity_uri``) for the external entry. + +How to add external resources to an NWB file +-------------------------------------------- + +There are two complementary ways to connect NWB data to external terms, both provided by HDMF: + +* **HERD** lets you attach references to existing values in a file — recording that a given + attribute or column value corresponds to a specific external term. See the + :hdmf-docs:`HERD tutorial ` for a walkthrough of + :py:meth:`HERD.add_ref `. +* **TermSet** lets you validate values *as you write them*, constraining a field to terms drawn + from a chosen ontology. See the :hdmf-docs:`TermSet tutorial `, + and the PyNWB + :pynwb-docs:`How to Configure Term Validations ` + tutorial for configuring term validation across a file. + +The rest of this page covers a question that comes up with both approaches: once you have picked +an external term, what exactly should go in the ``entity_id`` and ``entity_uri`` fields? + +Choosing ``entity_id`` and ``entity_uri`` +----------------------------------------- + +When you annotate data with an external resource using +:py:meth:`HERD.add_ref `, each reference records two +fields that identify the external term: + +``entity_id`` + A compact identifier (a `CURIE `_) of the form + ``prefix:identifier`` (e.g. ``NCBITaxon:10090``). The ``prefix`` names the registry or + ontology and the ``identifier`` is the term's accession within it. + +``entity_uri`` + The full URL that the ``entity_id`` resolves to — a persistent, dereferenceable web + address for that exact term. + +Recommended practice +^^^^^^^^^^^^^^^^^^^^^ + +#. **Use a CURIE for** ``entity_id``. Prefer an identifier whose ``prefix`` is registered with + `bioregistry.io `_. The Bioregistry is a comprehensive registry of + prefixes that maps each CURIE to a canonical, resolvable URL, which avoids the ambiguity of + the many overlapping identifier schemes (e.g. ``NCBITaxon`` vs. ``taxonomy`` vs. + ``NCBI_TAXON``). + +#. **Use the resolved URL for** ``entity_uri``. The ``entity_uri`` should be the URL that the + CURIE resolves to. You can look this up by resolving the CURIE through the Bioregistry: + visiting ``https://bioregistry.io/`` (for example + ``https://bioregistry.io/NCBITaxon:10090``) redirects to the canonical provider URL, which is + the value to store in ``entity_uri``. + +Keeping ``entity_id`` and ``entity_uri`` consistent in this way means a reader can both +recognize the registry from the compact ``entity_id`` and dereference the ``entity_uri`` to land +on an authoritative description of the term. + +Commonly used registries +^^^^^^^^^^^^^^^^^^^^^^^^^ + +All of the registries below are registered with the Bioregistry. The ``entity_uri`` column shows +the canonical URL the example ``entity_id`` resolves to. + +.. list-table:: + :header-rows: 1 + :widths: 10 16 22 20 32 + + * - Prefix + - Use for + - Common NWB field(s) + - Example ``entity_id`` + - Example ``entity_uri`` + * - ``NCBITaxon`` + - Species + - ``Subject.species`` + - ``NCBITaxon:10090`` + - ``http://purl.obolibrary.org/obo/NCBITaxon_10090`` + * - ``ROR`` + - Organizations / institutions + - ``NWBFile.institution`` + - ``ROR:013meh722`` + - ``https://ror.org/013meh722`` + * - ``ORCID`` + - People (researchers) + - ``NWBFile.experimenter`` + - ``ORCID:0000-0002-1825-0097`` + - ``https://orcid.org/0000-0002-1825-0097`` + * - ``UBERON`` + - Brain regions (cross-species) + - Brain-region location fields [#loc]_ + - ``UBERON:0001950`` + - ``http://purl.obolibrary.org/obo/UBERON_0001950`` + * - ``MBA`` + - Brain regions (Allen Mouse Brain Atlas) + - Brain-region location fields [#loc]_ + - ``MBA:385`` + - ``https://purl.brain-bican.org/ontology/mbao/MBA_385`` + * - ``HBA`` + - Brain regions (Allen Human Brain Atlas) + - Brain-region location fields [#loc]_ + - ``HBA:4005`` + - ``https://purl.brain-bican.org/ontology/hbao/HBA_4005`` + * - ``DANDI`` + - Dandisets + - (identifies the dataset as a whole) + - ``DANDI:000015`` + - ``https://dandiarchive.org/dandiset/000015`` + +.. [#loc] Brain-region annotations commonly apply to ``ElectrodeGroup.location``, + ``ImagingPlane.location``, and the ``location`` column of the ``electrodes`` table. + +Example +^^^^^^^ + +.. code-block:: python + + # the species of the subject, mapped to NCBI Taxonomy + herd.add_ref( + container=nwbfile.subject, + attribute="species", + key="Mus musculus", + entity_id="NCBITaxon:10090", + entity_uri="http://purl.obolibrary.org/obo/NCBITaxon_10090", + ) + +Resources without individually resolvable URLs +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Some resources do not provide a dereferenceable URL for each individual term. For example, many +brain atlases (such as the macaque **D99** atlas) publish a single document or download for the +whole atlas rather than one persistent URL per region. + +In that case: + +* Put the **URL of the resource as a whole** in ``entity_uri`` (e.g. the atlas's landing or + download page). +* Put the resource's **identifier for the specific term** — for example, the brain area ID used + by the atlas — in ``entity_id``. + +This keeps every reference dereferenceable to *something* authoritative (the resource) while +still recording the precise term identifier, even when a per-term URL does not exist. + +.. code-block:: python + + # a region from an atlas that has no per-region URL: identify the region by its + # atlas-specific ID and point entity_uri at the atlas itself + herd.add_ref( + container=electrodes_table, + attribute="location", + key="area_42", + entity_id="42", + entity_uri="https://afni.nimh.nih.gov/pub/dist/atlases/macaque/D99_macaque/", + ) + +.. seealso:: + + :py:class:`HERD ` for the full API, and + :py:meth:`HERD.add_ref ` for adding references. diff --git a/docs/source/index.rst b/docs/source/index.rst index cf1eaae..1cfd23d 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -21,6 +21,7 @@ for each of those tasks and point you to the best tools to use for your preferre conversion_tutorial/user_guide file_read/file_read extensions_tutorial/extensions_tutorial_home + external_resources_entity_guide core_tools/core_tools_home .. toctree::