Skip to content

Reaggregate metrics between failed exports#37

Open
alco wants to merge 1 commit intomainfrom
fix/export-metrics-reaggregate
Open

Reaggregate metrics between failed exports#37
alco wants to merge 1 commit intomainfrom
fix/export-metrics-reaggregate

Conversation

@alco
Copy link
Copy Markdown
Member

@alco alco commented Apr 21, 2026

Summary

OtelMetricExporter.MetricStore.export_metrics/1 guards the OTLP HTTP send with a Task.yield(task, 20_000) || Task.shutdown(task) wrapper. Under sustained export failures the request payload grows on every cycle because each generation of data is sent as its own OTLP data point per metric/tag. Eventually the send takes longer than 20s and the task times out every export cycle — the behaviour reported in electric-sql/stratovolt#1455.

This PR collapses all pending generations into a single one before building the payload, so the request contains at most one data point per metric/tag combination regardless of how many prior exports failed:

  • counters / sums / distribution bucket counts are summed
  • last_value keeps the newest value (older generations are overwritten by newer ones during the merge)
  • distribution min / max are merged with min / max
  • the carry-over generation's finish timestamp is extended to cover the full span of the collapsed generations

Payload size — and therefore send duration — stays bounded across repeated failures. This is scoped purely to export_metrics/1's handling of pending data; the convert_value/2 changes for stratovolt#1454 are unaffected.

Link: electric-sql/stratovolt#1455

Test plan

  • mix compile --warnings-as-errors
  • mix test (45 tests, 0 failures)
  • Updated the existing "preserves metrics across generations on failed exports" test to assert reaggregation (single data point with combined value) instead of the old accumulation behaviour
  • New test: after 5 consecutive failed exports, the next successful export contains exactly one data point with the cumulative sum, not five

When `export_metrics/1` fails, the `MetricStore` used to leave each
generation's data points separately in its ETS table and include them all
in the next export attempt as independent OTLP data points per tag. Under
sustained export failures (e.g. intermittent receiver errors or rate
limiting) the payload grows linearly with the number of accumulated
generations, which in turn causes the 20s `Task.yield` timeout guarding
the HTTP send to fire constantly.

Collapse all pending generations into a single one before building the
payload. Counters, sums and distribution bucket counts are summed;
last_value keeps the newest value; distribution min/max are merged with
min/max. The resulting request contains at most one data point per
metric/tag combination regardless of how many prior exports failed, so
payload size (and therefore send duration) stays bounded.

Refs: electric-sql/stratovolt#1455
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8a0edff126

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

merge_row(state.metrics_table, type, bucket, target_key, value, extra)
end

:ets.match_delete(state.metrics_table, {{gen, :_, :_, :_, :_}, :_, :_})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Move generation deletion after successful export

Deleting rows in reaggregate_failed_generations/3 before the send outcome is known can drop metrics under normal concurrent writes. A writer that read the old generation just before rotate_generation/1 can still insert into current_gen during export; those inserts can be removed here without being merged (or orphaned once generation metadata is removed), and if the HTTP send later fails they are no longer retried. This is a regression from the previous flow, which only deleted pending generations after :ok.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants