Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions book/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,7 @@ parts:
- file: plugins/how-to-guides/usage-examples
- file: plugins/how-to-guides/format-validation-levels
- file: plugins/how-to-guides/handle-exceptions-in-parallel-pipelines
- file: plugins/how-to-guides/track-the-value-of-auto-params-in-provenance
- file: plugins/how-to-guides/include-R-in-plugins
- file: plugins/how-to-guides/raise-visible-warning
- file: plugins/explanations/intro
Expand Down
4 changes: 4 additions & 0 deletions book/back-matter/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,10 @@ Artifact class
Artifact API
See {term}`Python 3 API`.

CaptureHolder
A class used as the Python type annotation for {term}`Parameters <Parameter>` whose default value is determined algorithmically (e.g., random number generator seeds).
Using this class ensures that the algorithmically set value is tracked in {term}`Provenance`.

Collection
An ordered list of `key: value` pairs. Think of an ordered [Python dictionary](https://docs.python.org/3/tutorial/datastructures.html#dictionaries).
These can be used as {term}`Input`, {term}`Parameter`, and {term}`Output` {term}`Types <Type>`.
Expand Down
81 changes: 62 additions & 19 deletions book/plugins/how-to-guides/create-register-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,32 +7,58 @@ This is accomplished by stitching together one or more `Methods` and/or `Visuali

Defining a function that can be registered as a `Pipeline` is very similar to defining one that can be registered as a `Method` with a few distinctions.

First, `Pipelines` do not use function annotations and instead receive `Artifact` objects as input and return `Artifact` and/or `Visualization` objects as output.
First, `Pipelines` are not required to use function annotations unless you are using the {term}`CaptureHolder` API documented [here](howto-track-the-value-of-auto-params-in-provenance). Instead they implicitly receive `Artifact` objects as input and return `Artifact` and/or `Visualization` objects as output.

If you choose to use function annotations on a `Pipeline` you must annotate all inputs, parameters, outputs, and the special `ctx` argument (described below). The parameters follow the same [mypy](http://mypy-lang.org/) syntax as `Methods` and `Visualizers`; however, the inputs and outputs are annotated simply as `Artifact` or `Visualization` in the case of singles or `list[Artifact]`, `dict[str, Artifact]`, `list[Visualization]`, or `dict[str, Visualization]` in the case of `Collections`. `ctx` must use `IContext` as its annotation.

Second, `Pipelines` must have `ctx` as their first parameter, which provides the following API:
- `ctx.get_action(plugin: str, action: str)`: returns a *sub-action* that can be called like a normal Artifact API call.
- `ctx.make_artifact(type, view, view_type=None)`: this has the same behavior as `Artifact.import_data`. It is wrapped by `ctx` for pipeline book-keeping.

Let's take a look at [`q2_diversity.core_metrics`](https://github.com/qiime2/q2-diversity/blob/99a0ccaaec14838b95845dbfe57f874d092b65c7/q2_diversity/_core_metrics.py#L10) for an example of a function that we can register as a `Pipeline`:
Let's take a look at [`q2_diversity.core_metrics`](https://github.com/qiime2/q2-diversity/blob/3fe491062b8a72939111ff66b2f4aeab8c12b16d/q2_diversity/_core_metrics.py#L14) for an example of a function that we can register as a `Pipeline`:

```python
def core_metrics(ctx, table, sampling_depth, metadata, n_jobs=1):
def core_metrics(ctx: IContext,
table: Artifact,
sampling_depth: int,
metadata: Metadata,
with_replacement: bool = False,
n_jobs: int = 1,
ignore_missing_samples: bool = False,
random_seed: CaptureHolder[int] = None) -> \
tuple[
Artifact, Artifact, Artifact, Artifact, Artifact, Artifact,
Artifact, Artifact, Visualization, Visualization
]:
random_int = CaptureHolder.get_or_set(random_seed, get_np_random_seed)
biom_table = table.view(biom.Table)
if biom_table.length() < 2:
raise ValueError(
'Table must have at least two samples as beta diversity will be'
' applied later.'
)

rarefy = ctx.get_action('feature_table', 'rarefy')
alpha = ctx.get_action('diversity', 'alpha')
beta = ctx.get_action('diversity', 'beta')
observed_features = ctx.get_action('diversity_lib', 'observed_features')
pielou_e = ctx.get_action('diversity_lib', 'pielou_evenness')
shannon = ctx.get_action('diversity_lib', 'shannon_entropy')
braycurtis = ctx.get_action('diversity_lib', 'bray_curtis')
jaccard = ctx.get_action('diversity_lib', 'jaccard')
pcoa = ctx.get_action('diversity', 'pcoa')
emperor_plot = ctx.get_action('emperor', 'plot')

results = []
rarefied_table, = rarefy(table=table, sampling_depth=sampling_depth)
rarefied_table, = rarefy(table=table, sampling_depth=sampling_depth,
with_replacement=with_replacement,
random_seed=random_int)
results.append(rarefied_table)

for metric in 'observed_otus', 'shannon', 'pielou_e':
results += alpha(table=rarefied_table, metric=metric)
for metric in (observed_features, shannon, pielou_e):
results += metric(table=rarefied_table)

dms = []
for metric in 'jaccard', 'braycurtis':
beta_results = beta(table=rarefied_table, metric=metric, n_jobs=n_jobs)
for metric in (jaccard, braycurtis):
beta_results = metric(table=rarefied_table, n_jobs=n_jobs)
results += beta_results
dms += beta_results

Expand All @@ -43,7 +69,8 @@ def core_metrics(ctx, table, sampling_depth, metadata, n_jobs=1):
pcoas += pcoa_results

for pcoa in pcoas:
results += emperor_plot(pcoa=pcoa, metadata=metadata)
results += emperor_plot(pcoa=pcoa, metadata=metadata,
ignore_missing_samples=ignore_missing_samples)

return tuple(results)
```
Expand All @@ -61,7 +88,7 @@ A description of this output should be included in `output_descriptions`
Citations do not need to be added for the pipeline unless unique citations are required for the pipeline that are not appropriate for the underlying `Methods` and `Visualizers` that it calls.
Citations for these underlying actions are automatically logged in citation provenance for this pipeline.

As an example for registering a `Pipeline`, we can look at `q2_diversity.core_metrics` (find the original source [here](https://github.com/qiime2/q2-diversity/blob/99a0ccaaec14838b95845dbfe57f874d092b65c7/q2_diversity/plugin_setup.py#L494)):
As an example for registering a `Pipeline`, we can look at `q2_diversity.core_metrics` (find the original source [here](https://github.com/qiime2/q2-diversity/blob/3fe491062b8a72939111ff66b2f4aeab8c12b16d/q2_diversity/plugin_setup.py#L496-L565)):

```python
plugin.pipelines.register_function(
Expand All @@ -72,11 +99,14 @@ plugin.pipelines.register_function(
parameters={
'sampling_depth': Int % Range(1, None),
'metadata': Metadata,
'n_jobs': Int % Range(0, None),
'with_replacement': Bool,
'n_jobs': Threads,
'ignore_missing_samples': Bool,
'random_seed': Int
},
outputs=[
('rarefied_table', FeatureTable[Frequency]),
('observed_otus_vector', SampleData[AlphaDiversity]),
('observed_features_vector', SampleData[AlphaDiversity]),
('shannon_vector', SampleData[AlphaDiversity]),
('evenness_vector', SampleData[AlphaDiversity]),
('jaccard_distance_matrix', DistanceMatrix),
Expand All @@ -88,17 +118,30 @@ plugin.pipelines.register_function(
],
input_descriptions={
'table': 'The feature table containing the samples over which '
'diversity metrics should be computed.',
'diversity metrics should be computed.',
},
parameter_descriptions={
'sampling_depth': 'The total frequency that each sample should be '
'rarefied to prior to computing diversity metrics.',
'rarefied to prior to computing diversity metrics.',
'metadata': 'The sample metadata to use in the emperor plots.',
'n_jobs': '[beta methods only] - %s' % sklearn_n_jobs_description
'with_replacement': with_replacement_description,
'n_jobs': '[beta methods only] - %s' % n_jobs_description,
'ignore_missing_samples': 'If set to `True` samples and features '
'without metadata are included by '
'setting all metadata values to: '
'"This element has no metadata". By '
'default an exception will be raised if '
'missing elements are encountered. Note, '
'this flag only takes effect if there is at '
'least one overlapping element.',
'random_seed': 'Seed for the random number generation used to rarefy '
'your feature table.'

},
output_descriptions={
'rarefied_table': 'The resulting rarefied feature table.',
'observed_otus_vector': 'Vector of Observed OTUs values by sample.',
'observed_features_vector': 'Vector of Observed Features values by '
'sample.',
'shannon_vector': 'Vector of Shannon diversity values by sample.',
'evenness_vector': 'Vector of Pielou\'s evenness values by sample.',
'jaccard_distance_matrix':
Expand All @@ -116,7 +159,7 @@ plugin.pipelines.register_function(
},
name='Core diversity metrics (non-phylogenetic)',
description=("Applies a collection of diversity metrics "
"(non-phylogenetic) to a feature table.")
"(non-phylogenetic) to a feature table.")
)
```

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
(howto-track-the-value-of-auto-params-in-provenance)=
# Track the value of random seeds and other automated parameter settings in data provenance

There are cases where you might want an {term}`Action` to take a {term}`Parameter` that can be set to an explicit value by the user, or be algorithmically determined by the {term}`Action` itself.
The most common use case for this is a random seed, where you may allow the user to pass a set seed (usually because they are trying to reproduce a previous result) or have the {term}`Action` set a random seed for general use (usually the default approach, achieved by setting the default value to `None`).
This creates a challenge for reproducibility: if `None` is passed into the {term}`Action`, then `None` will be recorded in the output's {term}`Provenance`.
The value that the seed was set to internally is lost, making it impossible to exactly reproduce a prior {term}`Action` execution because the random seed is not known.

This problem is solved by using the {term}`CaptureHolder` object.

```{note}
The {term}`CaptureHolder` only works if the value of the algorithmically set parameter is actually accessible in the Python implementation of the function.
If you are passing a sentinel value into an underlying tool (e.g., R code that is used under the hood by your {term}`Action`) which sets its value, that value will be inaccessible in the {term}`Provenance`.
```

The {term}`Action` registration is unchanged:

```python
my_plugin.methods.register_function(
function=method_with_random_seed,
inputs={},
parameters={
'random_seed': Int
},
outputs=[('seed', SingleInt)],
name='Takes a random seed',
description='Takes an integer as a random seed and returns that same'
' integer. If no integer is provided, it generates one at'
' random and captures that randomly generated integer in'
' provenance.'
)
```

What changes is the implementation of the underlying Python function:

```python
from qiime2.plugin.type import CaptureHolder

def random_seed_method(random_seed: CaptureHolder[int] = None) -> int:
# Resolve the seed: if the user passed None, generate a random value and
# record it in provenance; otherwise use the user-supplied value as-is.
random_int = CaptureHolder.get_or_set(
random_seed, lambda: random.randrange(sys.maxsize)
)

# Use the resolved integer value (guaranteed to never be None here)
my_value = my_function(random_int)

return my_value
```

The following rules must be followed to use the {term}`CaptureHolder` object:

1. The type annotation on the {term}`CaptureHolder` {term}`Parameter` must be `CaptureHolder[T]`, where `T` is the Python view type that corresponds to the QIIME 2 {term}`Semantic Type` used for the {term}`Parameter` at registration (e.g., `CaptureHolder[int]` for a parameter registered as `Int`).
2. The default value of the {term}`CaptureHolder` {term}`Parameter` must be `None`.
3. `CaptureHolder.get_or_set(<instance>, <callable>)` must be called exactly once per {term}`CaptureHolder` {term}`Parameter`, before the parameter is used. The return value is the resolved value that should be used in place of the {term}`CaptureHolder` going forward.

`CaptureHolder.get_or_set` takes two arguments: the {term}`CaptureHolder` {term}`Parameter` instance, and a zero-argument callable that generates a value when one is needed.
If the user passed `None`, the callable is invoked and its return value is used.
If the user passed an explicit value, that value is returned directly.
In both cases the resolved value is written back into the {term}`Action`'s {term}`Provenance` as though it had been passed in by the user originally.

```{note}
When calling the underlying function directly during testing (rather than through QIIME 2), `CaptureHolder.get_or_set` behaves correctly whether the parameter is a `CaptureHolder` instance, a plain value, or `None`.
This means you can write unit tests that call the function directly without any special handling.
```