Skip to content

Commit b29188b

Browse files
GitHub #95: Workflow: Documentation plan (#515)
1 parent 44c357f commit b29188b

22 files changed

Lines changed: 899 additions & 96 deletions

docs/architecture/authoring-definition-boundary.md

Lines changed: 12 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -94,9 +94,18 @@ findings from historical definition drift.
9494
Workflow code may observe its current history budget without reading storage
9595
internals. `historyLength()` returns the current history event count,
9696
`historySize()` returns the serialized history size estimate, and
97-
`shouldContinueAsNew()` returns the continue-as-new suggestion flag. These are
98-
advisory authoring signals; workflow code still chooses when to call
99-
`continueAsNew()`.
97+
`historyFanOut()` returns the largest parallel-group breadth recorded in this
98+
run's history. `shouldContinueAsNew()` returns the continue-as-new suggestion
99+
flag, which is true when any dimension reaches its hard threshold.
100+
101+
The budget is reported as a three-state pressure indicator:
102+
`historyBudgetPressure()` returns `ok`, `approaching`, or
103+
`continue_as_new_recommended`. Each dimension (event count, payload size,
104+
fan-out) has a soft (warning) threshold and a hard (continue-as-new) threshold.
105+
Reaching any soft threshold flips pressure to `approaching`; reaching any hard
106+
threshold flips it to `continue_as_new_recommended` and sets
107+
`shouldContinueAsNew()` to true. These are advisory authoring signals;
108+
workflow code still chooses when to call `continueAsNew()`.
100109

101110
## Activity idempotency surface
102111

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# History Budget
2+
3+
Workflow runs accumulate history events across activities, timers, signals,
4+
updates, side effects, child workflows, and message cursor advances. Without
5+
explicit budgets, long-running runs can grow until replay cost or persistence
6+
limits cause failures that are hard to diagnose and impossible to retroactively
7+
fix on a single run. The history budget contract gives operators and workflow
8+
code an inspectable signal long before that boundary is reached.
9+
10+
## Dimensions
11+
12+
The budget is computed across three dimensions, each with a soft (warning)
13+
threshold and a hard (continue-as-new) threshold:
14+
15+
| Dimension | Counter | Default soft | Default hard | Config root |
16+
| --- | --- | --- | --- | --- |
17+
| Event count | `history_event_count` — number of `workflow_history_events` rows | 8 000 | 10 000 | `workflows.v2.history_budget.event_warning_threshold` / `.continue_as_new_event_threshold` |
18+
| Payload size | `history_size_bytes` — serialized event-type + JSON payload size | 4 MiB | 5 MiB | `.size_bytes_warning_threshold` / `.continue_as_new_size_bytes_threshold` |
19+
| Fan-out | `history_fan_out` — largest `parallel_group_size` recorded in any parallel group | 160 | 200 | `.fan_out_warning_threshold` / `.continue_as_new_fan_out_threshold` |
20+
21+
Setting any threshold to `0` disables that dimension; warning thresholds clamp
22+
to the corresponding hard threshold so a misconfigured warning cannot fire
23+
after continue-as-new is already recommended.
24+
25+
## Pressure indicator
26+
27+
Each run has a derived `history_budget_pressure` value with three states:
28+
29+
- `ok` — every dimension is below its soft threshold.
30+
- `approaching` — at least one dimension is at or above its soft threshold,
31+
but no dimension has crossed its hard threshold.
32+
- `continue_as_new_recommended` — at least one dimension is at or above its
33+
hard threshold. `continue_as_new_recommended=true` is also surfaced as a
34+
boolean for backward compatibility.
35+
36+
The pressure value is computed from the same counters that drive
37+
`continue_as_new_recommended`, so operators see the same authoritative signal
38+
across waterline, the run detail view, and operator metrics.
39+
40+
## Surfaces
41+
42+
- `Workflow::historyLength()`, `historySize()`, `historyFanOut()`,
43+
`historyBudgetPressure()`, and `shouldContinueAsNew()` are advisory
44+
authoring signals exposed on the v2 workflow base class.
45+
- `WorkflowRunSummary` persists `history_event_count`, `history_size_bytes`,
46+
`history_fan_out`, `continue_as_new_recommended`, and
47+
`history_budget_pressure`. `RunListItemView` and `RunDetailView` project
48+
these directly.
49+
- `RunDetailView` additionally returns the active soft and hard thresholds
50+
(`history_event_threshold`, `history_event_warning_threshold`,
51+
`history_size_bytes_threshold`, `history_size_bytes_warning_threshold`,
52+
`history_fan_out_threshold`, `history_fan_out_warning_threshold`) and the
53+
list of dimensions that triggered the current pressure
54+
(`history_budget_pressure_dimensions`) so operators can explain *why* a run
55+
is approaching the boundary.
56+
- `OperatorMetrics::history` reports
57+
`continue_as_new_recommended_runs`, `approaching_budget_runs`,
58+
`max_event_count`, `max_size_bytes`, `max_fan_out`, and the configured
59+
thresholds for each dimension.
60+
61+
## Replay-safety
62+
63+
Counters come straight from frozen history-event payloads. Fan-out is derived
64+
by taking the maximum `parallel_group_size` across distinct
65+
`parallel_group_id` values recorded in the run's history events; the value is
66+
deterministic for a given history and re-derives correctly on any replay.
67+
Workflow code that branches on `historyBudgetPressure()` or
68+
`shouldContinueAsNew()` therefore reaches the same decision on the original
69+
attempt and on every subsequent replay.
70+
71+
## What this contract does *not* cover
72+
73+
- Snapshot or history compaction is intentionally out of scope for the first
74+
release. The budget contract ships first so archive can land on top of an
75+
inspectable correctness signal before any compaction protocol is committed.
76+
- Reset semantics (truncating history at a chosen sequence) remain a reserved
77+
operator command and are tracked separately in the v2 plan.

docs/workflow/plan.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -275,6 +275,10 @@ explicitly reserved for a future contract before support is advertised.
275275
defines first-class deployment lifecycle and rollout blockage.
276276
- [`docs/architecture/sticky-execution.md`](../architecture/sticky-execution.md)
277277
defines sticky replay-cache routing and cold-replay fallback.
278+
- [`docs/architecture/history-budget.md`](../architecture/history-budget.md)
279+
defines the soft and hard thresholds for history event count, payload
280+
size, and parallel fan-out, and the `pressure` indicator that drives
281+
`continue_as_new_recommended`.
278282
- [`docs/architecture/workflow-service-calls-architecture.md`](../architecture/workflow-service-calls-architecture.md)
279283
defines cross-namespace service-call lifecycle and outcome semantics.
280284
- [`docs/architecture/cross-namespace-service-policy.md`](../architecture/cross-namespace-service-policy.md)

src/V2/Models/WorkflowRunSummary.php

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,7 @@ class WorkflowRunSummary extends Model
4242
'projection_schema_version' => 'integer',
4343
'history_event_count' => 'integer',
4444
'history_size_bytes' => 'integer',
45+
'history_fan_out' => 'integer',
4546
'continue_as_new_recommended' => 'bool',
4647
'started_at' => 'datetime',
4748
'sort_timestamp' => 'datetime',

src/V2/Support/HealthCheck.php

Lines changed: 55 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,6 @@
55
namespace Workflow\V2\Support;
66

77
use Carbon\CarbonInterface;
8-
use Illuminate\Contracts\Cache\Repository as CacheRepository;
9-
use Illuminate\Support\Facades\App;
108

119
final class HealthCheck
1210
{
@@ -577,21 +575,27 @@ private static function longPollWakeAccelerationCheck(): array
577575
'reason' => null,
578576
];
579577

580-
$cache = self::resolveCacheRepository();
578+
$defaultStore = self::configuredDefaultCacheStore();
579+
$configuredDriver = self::configuredCacheDriver($defaultStore);
580+
581+
if ($configuredDriver === null) {
582+
$data['backend'] = null;
583+
$data['capable'] = false;
584+
$data['safe'] = false;
585+
$data['reason'] = 'Cache backend is not resolvable; wake acceleration may be disabled. Durable discovery continues via bounded polling.';
581586

582-
if ($cache === null) {
583587
return self::check(
584588
'long_poll_wake_acceleration',
585589
'warning',
586-
'Cache repository is not resolvable; wake acceleration may be disabled. Durable discovery continues via bounded polling.',
590+
$data['reason'],
587591
self::CATEGORY_ACCELERATION,
588592
$data,
589593
);
590594
}
591595

592596
$validator = new LongPollCacheValidator();
593-
$capability = $validator->validateMultiNodeCapable($cache);
594-
$safety = $validator->checkMultiNodeSafety($cache, $multiNode);
597+
$capability = $validator->validateMultiNodeCapableFromDriver($configuredDriver);
598+
$safety = $validator->checkMultiNodeSafetyFromDriver($configuredDriver, $multiNode);
595599

596600
$data['backend'] = is_string($capability['backend'] ?? null) ? $capability['backend'] : null;
597601
$data['capable'] = (bool) ($capability['capable'] ?? false);
@@ -621,18 +625,53 @@ private static function longPollWakeAccelerationCheck(): array
621625
);
622626
}
623627

624-
private static function resolveCacheRepository(): ?CacheRepository
628+
/**
629+
* Read the currently configured default cache store name. The check
630+
* deliberately reads `cache.default` (with a fall-through to the older
631+
* `cache.driver` alias) every snapshot so operator-visible config is
632+
* the source of truth, not a previously-resolved store memoized in the
633+
* cache manager.
634+
*/
635+
private static function configuredDefaultCacheStore(): ?string
625636
{
626-
try {
627-
// Resolve through the CacheManager so the check reflects the
628-
// currently configured default store. The cache.store container
629-
// singleton is bound on first access and does not reflect later
630-
// changes to cache.default, which drifts from the advertised
631-
// backend when operators reconfigure cache at runtime.
632-
return App::make('cache')->store();
633-
} catch (\Throwable) {
637+
$value = config('cache.default') ?? config('cache.driver');
638+
639+
if (! is_string($value)) {
634640
return null;
635641
}
642+
643+
$normalized = trim($value);
644+
645+
return $normalized === '' ? null : $normalized;
646+
}
647+
648+
/**
649+
* Resolve the driver name configured for the given default cache store.
650+
* Falls back to the store name itself when the driver entry is missing
651+
* (Laravel's CacheManager does the same — the store key is the driver
652+
* name when no explicit `driver` is configured).
653+
*/
654+
private static function configuredCacheDriver(?string $store): ?string
655+
{
656+
if ($store === null) {
657+
return null;
658+
}
659+
660+
$driver = config(sprintf('cache.stores.%s.driver', $store));
661+
662+
if (is_string($driver)) {
663+
$normalized = trim($driver);
664+
665+
if ($normalized !== '') {
666+
return $normalized;
667+
}
668+
}
669+
670+
if (config(sprintf('cache.stores.%s', $store)) !== null) {
671+
return $store;
672+
}
673+
674+
return null;
636675
}
637676

638677
/**

0 commit comments

Comments
 (0)