Skip to content

PROTOTYPE: Use lexical scope to assign metadata to runs in notebooks #6

@alaric-dotmesh

Description

@alaric-dotmesh

As a python library in Jupyter user, I'd like to be able to re-run arbitrary cells in my notebook in whatever order I see fit and still have correct run metadata generated, so that I can enjoy the interactivity and explorability of Jupyter notebooks along with the provenance tracking and reproducibility of Dotscience.

Currently, I can't, because of #1. Let's try using a different approach and see if it works.

ACs:

As this is a prototyping effort, all these ACs are to be considered "aspirational"; we'll see what we can achieve in practice then decide, at the end, whether what we have is better than what we ALREADY have.

  • I can run cells in any order a reasonable user would do so, and get the results I'd expect in my run metadata.
  • No extra user effort is required.
  • Unless I do something really/deliberately silly or unlikely, there's no way to end up with two runs merged together ending up in a Dotscience commit.

These ACs should apply for all of these cases:

  • Notebooks with one run.
  • Notebooks with two or more runs in separate cells, eg multiple calls to ds.publish().
  • Notebooks with a loop that generates runs, eg a single call to ds.publish() that's in a loop (eg, trying the same algorithm with a range of input parameters to see what's best).
  • A combination of the previous two cases.

Implementation plan:

We have a CUNNING PLAN to break this impasse! It's Luke's suggestion:

  • Don't store state in-memory in the python library, because the history of that in-memory state is the dynamic flow of execution of Jupyter cells which may have nothing to do with their order in the notebook, leading to the problems expounded above.
  • Instead, every time you call a metadata-registration function like ds.input(), it should output a machine-readable tag at that very point.
  • ds.publish() outputs an "end of this run" tag
  • the parser (be it notebook or command-output) reads the tags from top to bottom, building up in-memory state in notebook lexical order and outputting a run and clearing its in-memory state at the "end of this run" tag
  • Therefore, the assignment of actions to runs is based purely on the lexical structure, not the dynamic structure.
  • For extra niceness, in Jupyter mode, we can output the markers inside "Jupyter widgets" that control their display (rather than plain text) so they're less obtrusive and prettier; but we need to transparently not do that when not in Jupyter.
  • How does this work with "publish inside a loop"? Unless we come up with a clever trick, we'll only keep the results of the last iteration of the loop. But do users do publish inside a loop to try the same algorithm with different input parameters, or copy+paste the cell and edit the parameters in each copy?

Metadata

Metadata

Assignees

No one assigned

    Labels

    taskNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions