Skip to content

perf(tag inheritance): batch_mode + per-batch bulk during import + reorganize#14877

Open
valentijnscholten wants to merge 21 commits into
DefectDojo:devfrom
valentijnscholten:perf/tag-inheritance-batch-during-import
Open

perf(tag inheritance): batch_mode + per-batch bulk during import + reorganize#14877
valentijnscholten wants to merge 21 commits into
DefectDojo:devfrom
valentijnscholten:perf/tag-inheritance-batch-during-import

Conversation

@valentijnscholten
Copy link
Copy Markdown
Member

@valentijnscholten valentijnscholten commented May 14, 2026

Summary

Wraps the import / reimport hot loop in tag_inheritance.suppress_tag_inheritance() and bulk-applies inherited Product tags per batch, before post_process_findings_batch dispatches, so rules / deduplication see inherited tags on finding.tags. Pairs with the per-batch import-tag work landed in #14839.

This is the Phase B Stage 2 follow-up to #14811 / #14812 / #14827.

Changes

  • process_findings in DefaultImporter and DefaultReImporter now runs its finding-creation loop inside tag_inheritance.suppress_tag_inheritance(). Per batch — after apply_import_tags_for_batch and before dojo_dispatch_task(post_process_findings_batch, ...) — calls a new helper apply_inherited_tags_for_findings(batch_findings) that bulk-syncs inherited tags on the batch's Findings + the Endpoints (V2) / Locations (V3) reachable from them via FK chain.
  • DefaultReImporter now tracks newly-created findings separately in new_findings_in_batch and passes only those to apply_inherited_tags_for_findings. Matched/existing findings already have inheritance from their original creation, so re-running through-table reads on them was pure overhead on no-change reimports.
  • LocationManager._persist_locations skips apply_inherited_tags_for_locations when no new LocationProductReference rows were created. New finding refs only link findings already in self._product, so they can't change a Location's Product membership; only a new product ref can. When the product-ref list is empty the Location's inherited target set is unchanged.
  • The signal-driven auto-inheritance entrypoint is renamed inherit_instance_tagsauto_inherit_product_tags and moved from dojo.tags.inheritance to dojo.tags.signals, colocated with the receivers that call it. It is the only inheritance helper that uses is_suppressed() as a gate (others use suppress_tag_inheritance() to wrap their writes), so the gate semantic is exclusive to the signal path.
  • The post_save handlers for child models (Endpoint/Engagement/Test/Finding/Location) and ref models (LocationFindingReference/LocationProductReference) are combined into a single inherit_tags_on_instance handler with sender-based target dispatch + created=True gate. Previously the ref handler fired on every save — including set_status updates that don't change the Location's related-product set; those now correctly no-op.
  • EndpointManager.get_or_create_endpoints replaces its per-row inherit_instance_tags(ep) cleanup loop with a single apply_inherited_tags_for_endpoints(created) bulk call. Inside the importer the per-batch helper already covers these endpoints via Endpoint.status_endpoint.finding; the bulk call is retained as a defensive hook for non-importer callers.
  • New apply_inherited_tags_for_locations(locations, *, product) mirrors the endpoint / finding helpers. LocationManager._bulk_inherit_tags is removed; the persist path delegates to the new helper. Drops a redundant outer suppress_tag_inheritance() (writes inside _sync_inherited_tags are already wrapped) and a redundant bulk pre-fetch of existing inherited tags (the through-table primitive does that read once already).
  • _inherited_tag_names_for_location (the per-Location callback for the bulk sync primitive) now filters contributing products by their own enable_product_tag_inheritance flag. Previously, tags from flag-off products linked to the same Location could leak in via propagate_tags_on_product_sync and apply_inherited_tags_for_findings.
  • The bulk sync primitive is restructured: _sync_inheritance_for_qs(queryset, target_names_per_child=callable)_sync_inheritance_for_ids(model_class, child_ids, target_tag_names: set[str] | Callable[[int], set[str]]). Callers pass values_list("pk", flat=True) directly so the primitive never materializes children as full model instances; the bulk helpers only need instance.pk and instance.__class__, so model_class(pk=pid) stubs (created lazily for rows whose tag set actually changes) are enough. Constant-target callers pass a plain set; the per-pk callable form is reserved for Locations (multi-product target). Eliminates the Finding-row hydration that dominated wall-clock time on large products (~22s for ~14000 findings → expected to drop substantially; the query-count baseline can't see this since it measures SQL, not Python time).
  • propagate_tags_on_product_sync (used by the product-tag-toggle Celery task) gains an early-exit when neither system-wide nor per-product inheritance is enabled, eliminating ~9 wasted reads per call on inheritance-off products. State transitions (toggling either flag) still trigger a full sweep via their existing signal handlers.
  • Location gains iter_related_products() — a related-manager (self.products + self.findings) implementation returning list[Product]. Callers that pre-issue prefetch_related("products__product__tags", "findings__finding__test__engagement__product__tags") get zero extra queries per Location. The existing JOIN'd all_related_products() is kept for the per-instance signal path where prefetching is not possible.
  • Change the "managing" of inherited tags from a "rebuild every time" to a "diff" based approach where added inherited tags on products are cascaded into the object's tags field, and removed inherited tags on products are also removed from the object's tag. No need anymore to reconstruct user_tags or do complex passing of potential tags. Behavior is still the same, guarded by pre-existing tests.
  • Centralize tag (inheritance) logic into dojo.tags.inheritance. Move the propagate_tags_on_product Celery task into that module alongside its _sync counterpart and delete dojo/product/helpers.py. A propagate_tags_on_product_deprecated alias is kept under the old task name so in-flight tasks complete after upgrade.

Query-count impact

Pins in unittests/test_tag_inheritance_perf.py updated to match. "Before" numbers for the per-100-children scenarios are the pre-Phase-A baselines recorded in source comments next to each constant. The two propagate_tags_on_product_sync rows and the reimport-with-new-findings rows are newly added baselines so there is no Before number.

Hot path Before After Δ
ZAP scan import V2 (19 findings) 1378 420 -958
ZAP scan import V3 1256 444 -812
ZAP reimport no-change V2 69 69 0
ZAP reimport no-change V3 87 81 -6
ZAP reimport subset → full V2 (9 new) 169 (new)
ZAP reimport subset → full V3 (9 new) 198 (new)
Product tag add → 100 findings (V2) 4758 91 -4667
Product tag add → 100 findings (V3) 4759 91 -4668
Product tag remove → 100 findings (V2) 4540 53 -4487
Product tag remove → 100 findings (V3) 4541 53 -4488
Product tag add → 100 endpoints (V2) 3958 91 -3867
Product tag remove → 100 endpoints (V2) 3740 53 -3687
Product tag add → 100 locations (V3) 4532 125 -4407
Product tag remove → 100 locations (V3) 4307 75 -4232
propagate_tags_on_product_sync no-op V2 9 (new)
propagate_tags_on_product_sync no-op V3 18 (new)

…ase B Stage 2)

Wraps the import / reimport hot loop in `tag_inheritance.batch_mode()` and
bulk-applies inherited Product tags per batch *before* `post_process_findings_batch`
dispatches, so rules / deduplication see inherited tags on `finding.tags`.

Changes:

- `process_findings` (DefaultImporter + DefaultReImporter) now runs its
  finding-creation loop inside `batch_mode()`. Per batch, right after
  `apply_import_tags_for_batch`, calls a new helper
  `apply_inherited_tags_for_findings(batch_findings)` that bulk-syncs
  inherited tags on the batch's Findings plus the Endpoints (V2) / Locations
  (V3) reachable from them via FK chain. Inheritance is therefore applied to
  the persisted children before the post-process task dispatches.
- `inherit_instance_tags` in `dojo/tags_signals.py` now early-returns when
  `tag_inheritance.is_in_batch_mode()`, so the batch wrap transparently
  suppresses per-row inheritance work for any caller — including
  `bulk_create` cleanup paths that invoke it manually. `inherit_tags_on_instance`
  post_save delegates to that helper, so the gate also covers signal-driven
  fires.
- `EndpointManager.get_or_create_endpoints` replaces its per-row
  `inherit_instance_tags(ep)` cleanup loop with a single
  `apply_inherited_tags_for_endpoints(created)` bulk call. Inside the importer
  the per-batch helper already covers these endpoints via
  `Endpoint.status_endpoint.finding`; the bulk call is kept as a defensive
  hook for any non-importer caller.
- `propagate_tags_on_product_sync` (used by the product-tag-toggle Celery
  task) gains an early-exit when neither system-wide nor per-product
  inheritance is enabled, eliminating ~9 wasted reads per call on
  inheritance-off products. State transitions (toggling either flag) still
  trigger a full sweep through their existing signal handlers.
- `Location` gains `iter_related_products()`: a related-manager
  (`self.products` + `self.findings`) implementation of `all_related_products()`
  that returns `list[Product]`. Callers that pre-issue
  `prefetch_related("products__product__tags",
  "findings__finding__test__engagement__product__tags")` get zero extra
  queries per Location. The existing JOIN'd `all_related_products()` is kept
  for the per-instance signal path where prefetching is not possible.
- `_inherited_tag_names_for_location` (the per-Location callback used to
  compute the inherited target set) switches to `iter_related_products()`;
  both call sites (`propagate_tags_on_product_sync` V3 branch and
  `apply_inherited_tags_for_findings` V3 branch) now prefetch the chain.

Query-count impact on `unittests/test_tag_inheritance_perf.py` (pins updated
in this commit):

| Hot path                                | Before  | After  | Δ     |
|-----------------------------------------|--------:|-------:|------:|
| ZAP scan import V2 (19 findings)        |    1385 |    477 |  -908 |
| ZAP scan import V3                      |    1263 |    945 |  -318 |
| ZAP reimport no-change V2               |      69 |     75 |    +6 |
| ZAP reimport no-change V3               |      87 |    102 |   +15 |
| Product tag add → 100 locations (V3)    |     316 |    125 |  -191 |
| Product tag remove → 100 locations (V3) |     266 |     75 |  -191 |

Small reimport-no-change regressions are the unavoidable per-batch helper
read cost (2 reads × Finding + 2 reads × Endpoint/Location + 1 product tags
read). Real-work imports drop significantly because per-row
`_manage_inherited_tags` work no longer fires inside the loop.
@valentijnscholten valentijnscholten force-pushed the perf/tag-inheritance-batch-during-import branch from ecd29bb to 8513855 Compare May 14, 2026 23:43
@github-actions
Copy link
Copy Markdown
Contributor

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link
Copy Markdown
Contributor

Conflicts have been resolved. A maintainer will review the pull request shortly.

@valentijnscholten valentijnscholten changed the title perf(tags): batch_mode + per-batch bulk inheritance during import (Phase B Stage 2) perf(tags): batch_mode + per-batch bulk inheritance during import + reorganize May 15, 2026
Move scattered tag-inheritance logic into dojo/tag_inheritance.py without
changing runtime behavior:

- _manage_inherited_tags relocated from dojo/models.py (replaced by a
  lazy-import shim to avoid an import cycle through dojo.utils).
- get_products, inherit_product_tags, get_products_to_inherit_tags_from,
  propagate_inheritance, inherit_instance_tags, inherit_linked_instance_tags
  relocated from dojo/tags_signals.py; receivers there now delegate.
- propagate_tags_on_product_sync, apply_inherited_tags_for_endpoints,
  apply_inherited_tags_for_findings, _sync_inheritance_for_qs and
  _inherited_tag_names_for_location relocated from dojo/product/helpers.py;
  the Celery wrapper propagate_tags_on_product stays in product/helpers.py.

Backward-compat re-exports preserved at every original import site so
external callers (dojo/location, dojo/importers, unittests, pro/) keep
working unchanged.

Adjusts unittests.test_tag_inheritance mock targets to the new module
path. Bumps EXPECTED_ZAP_IMPORT_V2/V3 baselines (477->470, 945->938) to
the actual query counts; the prior pin was already drifting on this
branch before the move.
Move the three tag-related top-level modules into a dedicated package:

  dojo/tag_inheritance.py -> dojo/tags/inheritance.py
  dojo/tag_utils.py       -> dojo/tags/utils.py
  dojo/tags_signals.py    -> dojo/tags/signals.py

Update all internal call sites to the new canonical paths (apps,
models, importers, finding, product/helpers, utils_cascade_delete,
settings, unittests). No backward-compat shims left at the old paths.

Also drops the unused dojo/tag/ stub directory that only contained
stale .pyc files.

No logic change. All tag inheritance and tag bulk unit tests pass.
@github-actions github-actions Bot added the settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR label May 15, 2026
Tag inheritance is usually enabled system-wide first, then optionally
overridden per product. The system setting is cached, so check it first
and short-circuit the related-product walk / per-product attribute read
when it's already True.

Sites updated:
- dojo/tags/inheritance.py: inherit_product_tags,
  get_products_to_inherit_tags_from, and the three early-return guards
  inside propagate_tags_on_product_sync / apply_inherited_tags_for_*.
- dojo/location/models.py: Location.products_to_inherit_tags_from.
- dojo/importers/location_manager.py: bulk-inherit early-return.

Also picks up the in-flight rename of batch_mode() -> suppress() /
is_in_batch_mode() -> is_suppressed() in the two importers.

Behavior unchanged. All tag inheritance / perf / bulk unit tests pass
(36 + 20 + 29).
…ance wrapper

- is_tag_inheritance_enabled now delegates to get_products_to_inherit_tags_from
  (no products eligible -> not enabled). Centralizes the "which products
  contribute inherited tags" decision in one place; the None-product filter
  moves into get_products_to_inherit_tags_from so both call sites get it.
- Drop the inherit_linked_instance_tags wrapper: single signal caller now
  inlines tag_inheritance.inherit_instance_tags(instance.location).
- Trim dead backward-compat re-exports from dojo/tags/signals.py; reroute
  the test imports to dojo.tags.inheritance directly.

Behavior unchanged. test_tag_inheritance (36) and test_tag_inheritance_perf
(20) pass.
Combine the two-step gate (propagate_inheritance returning bool, then
caller writing) into a single inherit_instance_tags() with an optional
force=False kwarg that bypasses the suppress() check. The m2m
make_inherited_tags_sticky receiver now also routes through this single
entry point.

Wrap the inner instance.inherit_tags(tag_list) call in
suppress_tag_inheritance() to short-circuit m2m_changed re-entry. The
re-entrant signal previously did a redundant in-sync check using a
stale get_tag_list() cached value; suppressing it both eliminates the
recursion risk and removes per-row redundant queries:

  - test_baseline_zap_scan_import_v2: 470 -> 463
  - test_baseline_zap_scan_import_v3: 938 -> 931
  - test_create_one_finding_v2/v3: 64 -> 61
  - test_create_100_findings_v2/v3: 4024 -> 3724
  - test_finding_add_user_tag_v2/v3: 17 -> 16
  - test_finding_remove_inherited_v2/v3: 44 -> 40

Also picks up the rename of the context manager
suppress() -> suppress_tag_inheritance(). Unit tests for the early-exit
optimization rewritten against the merged function (still 4 cases).
Rewrite the per-row inheritance primitive as a pure diff:

  current_inherited = obj.inherited_tags.all()
  target            = incoming_inherited_tags
  remove (current - target), add (target - current)
  + sticky re-add any target name missing from obj.tags

Drops the "rebuild full tag_list, set() everything" approach and removes
the `existing_tags_hint` parameter entirely. Writes are wrapped in
`suppress_tag_inheritance()` so the m2m_changed signal can't dispatch
make_inherited_tags_sticky back into this function.

Other cleanup:

- Rename `_manage_inherited_tags` -> `_sync_inherited_tags`.
- Drop per-model `inherit_tags(self, existing_tags_hint)` methods on
  Endpoint / Engagement / Test / Finding / Location. The signal path
  (`inherit_instance_tags`) and `LocationManager._bulk_inherit_tags` now
  call `_sync_inherited_tags(instance, incoming)` directly.
- Drop the unused `existing_tags_by_location` dict in
  `LocationManager._bulk_inherit_tags` (one fewer through-table read).
- Unit tests rewritten against the diff primitive (4 cases) plus a
  no-products early-exit test for `inherit_instance_tags`.

Perf baselines (test_tag_inheritance_perf):

  EXPECTED_ZAP_IMPORT_V2:            463 -> 422
  EXPECTED_ZAP_IMPORT_V3:            931 -> 809
  EXPECTED_ZAP_REIMPORT_NO_CHANGE_V3:102 -> 101
  EXPECTED_CREATE_ONE_FINDING_V2/V3:  61 -> 55
  EXPECTED_CREATE_100_FINDINGS_V2/V3: 3724 -> 3124  (-600)
  EXPECTED_FINDING_REMOVE_INHERITED_V2/V3: 40 -> 18
  EXPECTED_FINDING_ADD_USER_TAG_V2/V3:     16 -> 17  (+1)
…elpers.py

- New `apply_inherited_tags_for_locations(locations, *, product)` mirrors the
  endpoint/finding helpers. Replaces `LocationManager._bulk_inherit_tags`;
  callsite delegates. Drops the redundant outer `suppress_tag_inheritance()`
  (writes inside `_sync_inherited_tags` are already wrapped) and the bulk
  pre-fetch of existing inherited tags (the through-table primitive does
  this read once already).
- `_inherited_tag_names_for_location` now filters contributing products by
  their own `enable_product_tag_inheritance` flag. Previously tags from
  flag-off products linked to the same Location could leak in via
  `propagate_tags_on_product_sync` / `apply_inherited_tags_for_findings`.
- Move `propagate_tags_on_product` Celery task into `dojo.tags.inheritance`
  alongside its `_sync` counterpart; delete `dojo/product/helpers.py`. Keep
  a `propagate_tags_on_product_deprecated` alias under the old task name so
  in-flight tasks complete after upgrade.
- Rename `_sync_inheritance_for_qs` kwarg `target_names_per_child` ->
  `target_tag_names_per_child` for clarity.
- Update import sites + perf baseline V3 import: 809 -> 445 queries.
…tags, move to signals

Renames `inherit_instance_tags` to `auto_inherit_product_tags` and relocates
it from `dojo.tags.inheritance` to `dojo.tags.signals`. The function is the
signal-driven entrypoint that applies a Product's tags to a child instance;
moving it next to the signal receivers (and renaming for clarity) makes the
auto-inheritance flow self-contained.

Of all the inheritance helpers it is the only one that early-returns when
`suppress_tag_inheritance()` is active — the other helpers use the context
manager around their writes for reentrancy. That gate semantic is exclusive
to the signal path, so colocation fits.

Also combines `inherit_tags_on_linked_instance` into `inherit_tags_on_instance`
with sender-based target dispatch and a `created=True` gate. Previously the
ref handler fired on every save, including `set_status` updates that don't
change the Location's related-product set; those now correctly no-op.

Drops the unused `force=` parameter on the function while renaming.
Replaces `_sync_inheritance_for_qs(queryset, target_tag_names_per_child=callable)`
with `_sync_inheritance_for_ids(model_class, child_ids, target_tag_names)`.

The previous implementation called `list(queryset)` to materialize every child
as a full model instance just so the bulk helpers (`bulk_add_tag_mapping`,
`bulk_remove_tags_from_instances`) had an `instance.pk` and `instance.__class__`
to work with. For Finding (70+ columns) this dominated wall-clock time on big
products — a real-world product with ~14000 findings took ~22s for a single
`propagate_tags_on_product_sync`.

The new path:
- Accepts an iterable of pks; constant-target callers pass
  `values_list("pk", flat=True)` directly, skipping all ORM hydration.
- Builds bare `model_class(pk=pid)` stubs (cached per pk) only for the rows
  whose inherited set actually needs to change, not for every row scanned.
- Accepts `target_tag_names` as either a `set[str]` (constant target, hoisted
  out of the loop — no per-row function call for the
  product → engagement/test/finding/endpoint propagation paths) or a
  `Callable[[int], set[str]]` for the Location case, where each row's target
  is the per-row union of its linked Products' tags.
- For Locations, callers materialize the prefetched instances into a
  `{pk: location}` dict and close over it in the callback — the prefetch
  chain still runs once upfront, but the primitive itself only sees pks.
- Adds an `add_map`/`remove_map` skip when `target == old` so rows already
  in sync don't even allocate a stub.

Also adds direct query-count baselines for `propagate_tags_on_product_sync`
in `test_tag_inheritance_perf.py` so future regressions on the sweep path
fail loudly (V2 = 9 queries, V3 = 18 queries on a product with 100 findings
plus 100 endpoints or locations).

ZAP baselines drop slightly as a side effect of the early `target == old`
skip (V2 import 422 → 420, V3 import 445 → 444, V2 reimport-no-change 75 →
74, V3 reimport-no-change 101 → 100). The major wall-clock win is invisible
to query counters — it's the avoided Finding ORM hydration on large
products.
Two gates eliminate ~14 wasted queries on a reimport that creates no new
findings or location refs (the common "scheduled rescan, nothing changed"
path):

1. `DefaultReImporter.process_findings` now tracks newly-created findings
   in `new_findings_in_batch` (populated in the else branch of the
   matched/unmatched dispatch) and passes ONLY those to
   `apply_inherited_tags_for_findings`. Matched/existing findings already
   had inheritance applied at their original creation, so re-running the
   through-table read + Location prefetch chain on them is pure overhead.

2. `LocationManager._persist_locations` now skips
   `apply_inherited_tags_for_locations` when no new `LocationProductReference`
   rows were created. New `LocationFindingReference`s only add findings
   within `self._product`, so they can't change a Location's Product
   membership; only a new product ref can. When `all_product_refs` is empty,
   the Location's inherited target set is unchanged and the helper would do
   a costly no-op read for nothing.

Net effect on the pinned ZAP reimport-no-change baselines:
- V2: 75 → 69 (matches the pre-Phase-A baseline of 69)
- V3: 101 → 81 (beats the pre-Phase-A baseline of 87)
Imports the 10-finding ZAP subset first, then reimports the 19-finding full
report so 9 findings are created during reimport while 10 are matched.
Pins V2 = 169 queries, V3 = 198 queries. Captures the realistic "scheduled
rescan with drift" path that the no-change baseline doesn't exercise.
@valentijnscholten valentijnscholten added this to the 2.59.0 milestone May 15, 2026
@valentijnscholten valentijnscholten marked this pull request as ready for review May 15, 2026 19:41
@valentijnscholten valentijnscholten added the affects_pro PRs that affect Pro and need a coordinated release/merge moment. label May 15, 2026
@valentijnscholten valentijnscholten changed the title perf(tags): batch_mode + per-batch bulk inheritance during import + reorganize perf(tag inheritance): batch_mode + per-batch bulk during import + reorganize May 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

affects_pro PRs that affect Pro and need a coordinated release/merge moment. settings_changes Needs changes to settings.py based on changes in settings.dist.py included in this PR unittests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant