Skip to content

perf(plugin): per-batch atomic Postgres write optimisations for bulk-apply#172

Draft
ldrozdz93 wants to merge 4 commits into
developfrom
perf/per-batch-atomic-postgres-write
Draft

perf(plugin): per-batch atomic Postgres write optimisations for bulk-apply#172
ldrozdz93 wants to merge 4 commits into
developfrom
perf/per-batch-atomic-postgres-write

Conversation

@ldrozdz93
Copy link
Copy Markdown
Contributor

@ldrozdz93 ldrozdz93 commented May 12, 2026

[claude]

Summary

Three Postgres-side optimisations stacked on top of the bulk-apply endpoint, plus a configurable retry on Postgres deadlock. All writes happen inside the batch's outer transaction.atomic(), so the batch remains atomic from a concurrent reader's perspective.

Draft PR — not for merge yet.

Status: NOT tested after rebase

This branch has not been exercised end-to-end against develop's code. The optimisations were originally developed and benchmarked on a separate branch (perf/bulk-orm-no-ttl-bump) that pre-dated develop's own /bulk-apply/ endpoint. To produce this PR, the branch was rebased onto the current develop and re-targeted at BulkApplyView instead of the older ApplyChangeSetBatchView.

The rebased branch has only been:

  • Syntax-checked (python ast.parse)
  • Linted (ruff clean on changed files)
  • Verified logically against the pre-rebase branch (each optimisation behaves the same; per-batch reader-atomicity preserved)

It has NOT been:

  • Run against a real NetBox instance
  • Exercised by any unit / integration test suite on this base
  • Benchmarked

Anyone using this branch for local performance testing should run the plugin's test suite as a first sanity check, then run their own end-to-end ingest before drawing conclusions. The bench numbers below were measured on the pre-rebase branch — they should hold (the optimisations are structurally identical) but a fresh sweep is the right thing to confirm.

Single-changeset endpoint also changes — drop or test before merge

In addition to the batch endpoint, this PR also modifies the existing single-changeset endpoint (ApplyChangeSetView at /apply-change-set/). Before this PR, that endpoint just did:

change_set = ChangeSet.from_dict(request.data)
result = _apply_one_changeset(change_set, request)
return Response(result.to_dict(), status=result.get_status_code())

After this PR, it does:

with deferred_changelog() as defc, transaction.atomic():
    result = _apply_one_changeset(change_set, request)
    if result.errors:
        defc.rollback_pending()
    else:
        defc.commit_pending()
    defc.flush()
return Response(result.to_dict(), status=result.get_status_code())

So the single-changeset endpoint also picks up:

  • Bulk audit-log INSERT for that one changeset's ObjectChange rows (was: many small INSERTs as the signal handler fired per-save; now: one bulk_create at the end inside the same atomic).
  • Per-changeset bulk TaggedItem write (applied via apply_tags_bulk from applier.py).
  • Preload of ContentType + Tag IDs (also from applier.py).
  • Atomic data + audit + tags commit — same atomicity story as the batch endpoint, just for a single changeset.

It does NOT pick up the deadlock retry (that's only on BulkApplyView).

These changes were not the explicit target of the original optimisation work — they came along because applier.py is shared and because audit-log buffering had to be wired into both endpoints to be self-consistent. They have a few risks:

  • Production hot path. The single-changeset endpoint is on the regular ingest path for older / non-batched callers. A subtle change in audit-log row ordering or m2m-signal suppression could affect downstream consumers (event subscribers, webhook receivers) in ways the batch path doesn't.
  • No targeted test coverage. The bench work measured the batch endpoint only. The single-changeset endpoint has only been syntax/lint-checked here.
  • Tag-write semantics. apply_tags_bulk issues a DELETE + bulk INSERT (replace-set semantics, matching instance.tags.set([...])). On a single-changeset call this is functionally the same as the prior path, but the SQL shape is different (one transaction wrapping one DELETE + one INSERT, instead of multiple m2m_changed triggers). Anything monitoring extras_taggeditem mutations via signals will see a different pattern.

Recommendation before merge: pick one of

  1. Drop the single-changeset endpoint changes — leave ApplyChangeSetView._post unchanged from develop, scope this PR strictly to the batch endpoint. Lower risk, narrower change surface, matches what was actually benchmarked. The plugin code in applier.py and deferred_changelog.py is shared, but the calling convention from the single-changeset view can stay outside the new wrapper.
  2. Thoroughly test the single-changeset endpoint — run the plugin's existing test suite for ApplyChangeSetView, plus add coverage for the new audit-log / tag-write paths under that endpoint. Check for any downstream consumer (webhooks, event subscribers, observers) that depends on the old per-save audit-row pattern.

Option 1 is the cheaper / lower-risk path if the only goal is the bench win on the batch endpoint. Option 2 is the right path if you want the optimisations to also apply to single-changeset callers (smaller absolute win, but the per-changeset CPU drop should still help).

What changed, in simple words

1. Preload ContentType + tag IDs once per changeset

Before: every entity save did one or more small SELECT queries to look up "what's the integer ID for this model type?" and "what's the integer ID for the diode tag?". With many entities, that's hundreds of small round-trips to Postgres.
After: at the start of each changeset, the plugin runs one SELECT * FROM django_content_type and one SELECT * FROM extras_tag, caches the results in a Python dict on the request, and every subsequent ID lookup inside the changeset is an in-memory dict access instead of a Postgres call.

2. Batch-wide single bulk INSERT of NetBox audit-log rows

Before: every time an entity is saved, NetBox's signal handler inserts one row into core_objectchange (the audit log). For a 100-changeset batch with ~280 entities each, that's ~28,000 individual INSERTs into the audit-log table.
After: a thread-local buffer collects ObjectChange rows in memory across all changesets in the batch. At the very end of the batch, the buffer is flushed with one ObjectChange.objects.bulk_create(...). ~28,000 small INSERTs become 1 batched INSERT.

A failed-changeset accumulator (pending vs committed buffers) makes sure that if changeset N rolls back via its savepoint, its pending audit rows are discarded — they don't leak into the final bulk_create. So audit-log consistency is preserved: a row exists in the audit log only if the data it audits was actually committed.

3. Per-changeset bulk INSERT of TaggedItem rows + suppress redundant signal

Before: every device.tags.add(...) issued one INSERT INTO extras_taggeditem per tag. For ~280 entities × ~3 tags each, that's ~840 small INSERTs per changeset, and NetBox's m2m_changed handler also fires one extra ObjectChange audit row per tag attachment (~840 extra audit rows per changeset on top).
After: tag-link rows are accumulated in memory per changeset and written in one TaggedItem.objects.bulk_create(...) at the end of each changeset. The redundant m2m_changed audit-row fire is suppressed, so a tag attachment no longer doubles the audit-log volume.

Measured impact (pre-rebase numbers)

Workload: meraki-x10.yaml (28,310 entities), single host, RPS=100, APPLY_BATCH_SIZE=100, USE_BATCH_APPLY=true, AUTO_APPLY_CHANGESETS=false. Three runs each cell, mean vs mean:

metric baseline optimised drop
NetBox container CPU-seconds 512.8 388.1 −24.3 %
NetBox-postgres container CPU-seconds 146.2 106.9 −26.9 %
wall-clock (apply window) 44.2 min 32.9 min −25.5 %

Range no-overlap on all three metrics: every optimised run was faster, less CPU-busy, and easier on Postgres than every baseline run. CV around 7–8% on both cells over the three runs.

These numbers were taken on the pre-rebase branch and measured only the batch endpoint. They have not been re-measured after the rebase, and they do not speak to the single-changeset endpoint's behaviour (see Status section).

Rare deadlock risk: concurrent batches modifying the same entities in different order

Because the whole batch runs inside one outer transaction.atomic(), every row this batch writes stays row-locked in Postgres until the outer commit (which can take seconds, depending on batch size). That trade-off is the cost of giving readers a consistent per-batch view, and it opens up a rare but real deadlock scenario.

The scenario. Two different ingest requests are handled by two separate reconciler workers in parallel. The two resulting batch-API calls happen to touch the same entities, but in a different order. For example:

  • batch X (worker 1): modifies entity A first, then entity B.
  • batch Y (worker 2): modifies entity B first, then entity A.

X takes the row lock on A; Y takes the row lock on B. Then X tries to write B (blocked, Y has it) and Y tries to write A (blocked, X has it). Each is waiting for the other to release. This is the classic A-B / B-A deadlock pattern.

Postgres auto-resolves this. When Postgres detects the wait-cycle (default deadlock_timeout = 1 second), it picks one of the two transactions as the "victim" and aborts it with SQLSTATE 40P01 (deadlock_detected). The other transaction proceeds to commit normally. So the system doesn't hang — but this should be tested empirically. The above reasoning was assisted by AI analysis and should not be trusted blindly — please verify the actual behaviour in your local test setup before relying on it for capacity planning.

Mitigation (implemented in this PR). The aborted batch raises Django's OperationalError with pgcode='40P01'. The plugin catches this and retries the whole batch with jittered exponential backoff. Max retries is configurable:

PLUGINS_CONFIG = {
    "netbox_diode_plugin": {
        # ... existing settings ...
        "batch_apply_deadlock_retry_max_count": 3,  # default
    }
}
  • Default 3 retries (so up to 4 attempts total).
  • 0 disables retries entirely.
  • Bigger values raise the retry budget for deadlock-prone deployments; smaller values fail faster.

On success after retry, the response carries an X-Diode-Batch-Retries: <n> header so it shows up in the caller's logs.

The deadlock retry is implemented only on the batch endpoint. The single-changeset endpoint inherits the outer atomic + audit-buffer wrapping but does not retry on deadlock; a deadlock there bubbles up to the caller as a 500 (as it does today on develop).

Alternative: client-side retry on 40P01. The reconciler client could in principle catch the deadlock error code on the gRPC/HTTP response and retry the call from its side. That would work, but it's more complicated than retrying on the server side: the reconciler has to re-encode the gRPC request, re-burn rate-limiter budget, re-traverse the network, and the visibility of "which batch is being retried" gets harder to correlate across logs. Doing it inside the plugin keeps the retry cheap (milliseconds), local, and observable via a single response header.

Effect on concurrent NetBox UI users during a batch

The same row-level locking that enables per-batch atomicity also means that any concurrent NetBox UI user (or any other writer) will block if they try to modify an entity that the in-flight batch has already locked. The block lasts until the outer transaction commits — which, depending on batch size, can be a few seconds for typical workloads. Plain UI reads (list pages, filter views, detail pages) are not blocked — under Postgres's default MVCC, readers see the pre-batch snapshot and never wait on writers.

So during a heavy ingest, a user who happens to be editing a device that the ingest is also touching will see a multi-second hang on "Save". For most workloads this is invisible (no overlap), but it's a real behaviour worth knowing about.

🤖 Generated with Claude Code

PR 2 of BULK-ORM Sprint 1.

Adds `_preload_changeset_cache(change_set, request)` called at the top of
`apply_changeset`. It walks the changeset once, collects every distinct
object_type and every tag slug, then:

- Calls `ContentType.objects.get_for_models(*models)` — a single query
  that warms Django's per-process content-type cache so subsequent
  per-change `get_for_model()` calls are free.
- Issues one `Tag.objects.filter(slug__in=...).values_list('id','slug')`
  query and stashes `tag_ids_by_slug` on `request._diode_preload`.

PR 4 will reuse `tag_ids_by_slug` to bulk-write `TaggedItem` rows.

Test `test_preload_cache.py` asserts:
- After preload, `get_for_model(Site)` runs in 0 queries.
- Tag IDs for slugs referenced by non-NOOP changes are resolved in <= 2
  queries (1 ContentType warm + 1 Tag fetch).
@ldrozdz93 ldrozdz93 force-pushed the perf/per-batch-atomic-postgres-write branch from b7acf92 to c1b47bc Compare May 12, 2026 07:55
@ldrozdz93 ldrozdz93 changed the title perf(plugin): per-batch atomic Postgres write optimisations for apply-change-set-batch perf(plugin): per-batch atomic Postgres write optimisations for bulk-apply May 12, 2026
ldrozdz93 and others added 3 commits May 12, 2026 10:03
NetBox's core.signals.handle_changed_object fires ObjectChange.save()
once per post_save / m2m_changed event. For a multi-thousand-entity
batch the per-instance INSERT becomes the dominant write cost.

Install a module-level monkey-patch on ObjectChange.save that, when a
thread-local flag is active, buffers the instance instead of writing
it. The deferred_changelog() context manager yields a _Deferred
accumulator with commit_pending() / rollback_pending() / flush()
methods:

* On success of a changeset, the caller calls commit_pending(): the
  changeset's pending audit-log rows promote into the batch buffer.
* On failure (ChangeSetException -> savepoint rollback), the caller
  calls rollback_pending(): the failed changeset's audit rows are
  discarded so they cannot leak past the rollback.
* At batch end, flush() drains the batch buffer with one
  ObjectChange.objects.bulk_create(buffer, batch_size=500).

m2m merge dedup is preserved by a per-window in-memory index keyed on
(content_type_id, object_id, request_id): a follow-up m2m_changed fire
collapses onto the prior buffered row instead of producing a duplicate
audit entry.

ApplyChangeSetView and BulkApplyView both wrap their work with
deferred_changelog(), so the batch's audit-log rows commit together
with the batch's data writes (per-batch atomicity to readers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
NetBox's instance.tags.set([...]) triggers m2m_changed per tagged
instance. The signal handler then (a) writes another ObjectChange row
and (b) re-runs serialize_for_event(instance) for the events_queue.
For changesets with thousands of tagged objects this dominates apply
cost on top of the per-object signal overhead.

Add bulk_tags.apply_tags_bulk(instance_tag_pairs, request) which:
- Bypasses the through-table m2m signal by bulk-writing
  extras_taggeditem rows directly via bulk_create(ignore_conflicts=
  True, batch_size=500). _diode_preload['tag_ids_by_slug'] (from the
  preload step) is reused; any missing slugs are resolved via one
  fallback query.
- Mirrors instance.tags.set([...]) replace-set semantics by deleting
  existing TaggedItem rows for the (content_type, object_id) targets in
  one combined DELETE before the bulk INSERT.
- Re-emits a single enqueue_event(OBJECT_UPDATED) per tagged instance
  so the events_queue payload reflects the post-tag state, without the
  duplicate ObjectChange row.

applier._apply_change pops 'tags' from 'data' before serializer.save()
runs and buffers (instance, tag_input) on request._diode_tag_pairs;
apply_changeset flushes via apply_tags_bulk after its main loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The whole batch runs inside an outer transaction.atomic(), so all data
+ tag + audit writes for the batch become visible to readers in one
atomic transition. The trade-off is a longer lock-hold window per
batch, which can cause Postgres deadlocks (SQLSTATE 40P01) when two
concurrent batches touch the same rows in different orders.

Wrap BulkApplyView's outer transaction.atomic() in a bounded retry
loop. On a Postgres deadlock or serialization failure (40001), the
whole batch transaction is rolled back; the retry loop re-enters with
fresh deferred_changelog accumulators (no stale state from the aborted
attempt).

Behaviour:
- New plugin setting: batch_apply_deadlock_retry_max_count, default 3
  (three retries after the initial attempt = up to 4 attempts total).
- 0 disables retries entirely.
- Backoff is jittered exponential: 50ms * 2^attempt * U(0.5, 1.5).
- On success after retries, the response carries X-Diode-Batch-Retries:
  <n> so the reconciler-side log can correlate.
- On exhaustion, the original OperationalError is re-raised; caller
  handles the failure as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ldrozdz93 ldrozdz93 force-pushed the perf/per-batch-atomic-postgres-write branch from c1b47bc to de8aa33 Compare May 12, 2026 08:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant