Skip to content

Implement RDFC-1.0#245

Open
mielvds wants to merge 24 commits intomasterfrom
220-RDFC10
Open

Implement RDFC-1.0#245
mielvds wants to merge 24 commits intomasterfrom
220-RDFC10

Conversation

@mielvds
Copy link
Copy Markdown
Collaborator

@mielvds mielvds commented Feb 25, 2026

This PR is a rather big one. It fully implements URDNA2015, URDNA2012 and RDFC10 and has complete test-suite coverage on these algorithms (with the exception of the one on poisonous datasets, but more about that later). It 'fixes' #220

Some general remarks before I dive into the details:

  • I went back and forth on how approach this: move the canon stuff to a new repo first, or fix the problems first. I sided with the latter as the test harness is already in place and the code is too tangled to not break everything.
  • This introduces rdflib into PyLD, but limited to the normalization/canonicalization canon.py. This made the code much much cleaner, but ironically didn't fix what I introduced it for: nquad serialization. The relevant methods are copied over from rdflib until I can turn them into PRs for rdflib.
  • I added some functions to transform the legacy RDF.JS dataset structure in an rdflib.Dataset and back. This should ensure backwards-compatibility on some methods such as jsonld.normalization() and URDNA2015.main().
  • While switching to rdflib introduces significant changes, I tried to change as little as possible otherwise. Optimizing the algorithm implementations can be done later.

Overview of the changes:

  • added rdflib dependency in all relevant config and code files.
  • the change to rdflib mostly changed
    • RDF term type checking (e.g., checking is something is a bnode)
    • looping over triples/quads
    • serialization
    • constructing RDF terms (custom deepcopy is no longer needed)
  • util.py
    • added the functions from_legacy_dataset() and to_legacy_dataset() to easily go from an rdflib.Dataset to an RDFJS-like dict wherever needed.
    • Also added unittests in tests/test_util.py for these functions. We might want to move more general-purpose methods to this file.
  • canon.py:
    • the main logic was moved to URDNA2015._canonicalize(self, dataset: Dataset) while handling input and output remained in URDNA2015.main(). The latter now does the nquads parsing instead of JsonLdProcessor.normalize(), so all parsing and serialization is handled by the same class.
    • the method URDNA2015._canonicalize(self, dataset: Dataset) accepts an rdflib.Dataset and returns a tuple with
      • the canonicalized result as a nquads str and
      • the blank node identifier map as dict.
    • The method URDNA2015 .main(self, dataset: str | dict | Dataset, options) now accepts a rdflib.Dataset object in addition to an nquads str or the original RDFJS-likedict. It returns
      • a str: the serialized nquads result, or
      • a dict: the result as RDFJS-like dataset or the blank node identifier map when the new parameter outputMap is True.
    • the hashing algorithm is now an class attribute URDNA2015.hash_algorithm so it easily configurable (required for RDFC1.0)
    • the permutations() function now uses itertools.permutations instead of a custom implementation.
    • replacements for rdflib's _nq_row and _quoteLiteral, which should eventually move to a fix for rdflib's nquads serializer.
  • tests/runtests.py
    • (re-)enabled all skipped URDNA2015, URDNA2012
    • Added the RDFC10 tests
    • added support for testing blank-node identifier maps. If the result of a test is a dict and the expected value is a string, the expected value is now parsed as JSON.
    • added support for testing with different hashing algorithms
    • !! I did not include test 74c on dataset poisoning because I'm not sure how to handle that yet. We need to discuss possible guardrails first before continuing on this.

It's possible that we move this code out before ever merging, but at least it can be easily tested. That said, when importing this functionality as an external library, we include rdflib anyway. It therefore makes sense to continue replacing the RDFJS-like data structures elsewhere in the code, essentially making it fully rdflib compliant and dependent.

Since this involves RDFLib: @nicholascar, anyone who can maybe have a look at this? Also @davidlehn @BigBlueHat @anatoly-scherbakov

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 25, 2026

Copy link
Copy Markdown

@WhiteGobo WhiteGobo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just looked over the parts, where the behaviour of rdflib plays a role, because i dont fully understand what the code itself does. I hope my comments are a little helpful.
I have another more generell comment, as BNode is a subclass to str if you eg. check types with mypy you might get no error, when you by mistake compare a str to a BNode. But this would always fail. this might play a role, when you call hash_related_blank_node. But as i said, i have not a full grasp on the program.

Comment thread lib/pyld/canon.py Outdated
Comment thread lib/pyld/canon.py
encoded = self._quote_encode(l_)

if l_.language:
if l_.datatype:
Copy link
Copy Markdown

@WhiteGobo WhiteGobo Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would remove this test if l_.datatype: in rdflib, if language is given, the datatype should always be forced, so this test for datatype should always return the same. As far as i remember the correct datatype for lang tagged literals is rdf:langstring, so there might be in future a change to return rdfs:langstring instead of None.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed! But this is code copied (and cleaned) from rdflib's nquads serializer, so I'll mix it in a PR there.

@mielvds
Copy link
Copy Markdown
Collaborator Author

mielvds commented May 5, 2026

as BNode is a subclass to str if you eg. check types with mypy you might get no error, when you by mistake compare a str to a BNode. But this would always fail. this might play a role, when you call hash_related_blank_node. But as i said, i have not a full grasp on the program.

@WhiteGobo do you mean isinstance(component, BNode)? what would be a better way to check if something is a blank node?

@WhiteGobo
Copy link
Copy Markdown

WhiteGobo commented May 5, 2026

do you mean isinstance(component, BNode)?

No i did think more of finding type errors with tools like mypy. And i think of the over way aroung so something like isinstance(accidently_some_BNode, str).

This was merely a warning, when using str instead of directly using a BNode, that mypy wouldnt find inconsistent use between those two.
Just to illustrate take for example canon.py line 55ff. I will change some lines, which would produce an error:

        bnode_map: "dict[str, str]" # this would be used, so that mypy can detect wrong inserts.
        normalized, bnode_map = self._canonicalize(rdflib_dataset)
        # mapping old bnode IDs to their new canonical IDs
        for k, v in parser._bnode_ids.items():
            bnode_id = str(v)
            # Imagine, you were unconcentrated and you used by mistake the original bnode instead of the extracted str.
            if v in bnode_map: # This line is errornous and therefore mypy should throw a warning
                bnode_map[k] = bnode_map[bnode_id]
                del bnode_map[bnode_id]

For these kind of scenario, i use mypy to warn me, that i did something wrong, but mypy would register BNode as valid value, because, its just a subclass to str. And i dont mean, that extracting the id from a BNode is wrong, i just wanted to point it out, if you use mypy to find such errors. As you extract bnode_map from somewhere else, i dont even think, you should prefer BNode over str.

@mielvds
Copy link
Copy Markdown
Collaborator Author

mielvds commented May 5, 2026

ah ok, type checking, gotya. Adding type hints everywhere would generate too many changes (and be too much work). I think this for some other PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants