Skip to content

Commit d61da2c

Browse files
perf: replace per-object async delete with SQL cascade walker (#14566)
* Optimize prepare_duplicates_for_delete and add test coverage Replace per-original O(n×m) loop with a single bulk UPDATE for inside-scope duplicate reset. Outside-scope reconfiguration still runs per-original but now uses .iterator() and .exists() to avoid loading full querysets into memory. Also adds WARN-level logging to fix_loop_duplicates for visibility into how often duplicate loops occur in production, and a comment on removeLoop explaining the optimization opportunity. * fix: remove unused import and fix docstring lint warning * perf: eliminate per-original queries with prefetch and bulk reset Remove redundant .exclude() and .exists() calls by leveraging the bulk UPDATE that already unlinks inside-scope duplicates. Add prefetch_related to fetch all reverse relations in a single query. * add comment * perf: replace per-object async delete with SQL cascade walker Replace the per-object obj.delete() approach in async_delete_crawl_task with a recursive SQL cascade walker that compiles QuerySets to raw SQL and walks model._meta.related_objects bottom-up. This auto-discovers all FK relations at runtime, including those added by plugins. Key changes: - New dojo/utils_cascade_delete.py: cascade_delete() utility - New dojo/signals.py: pre_bulk_delete_findings signal for extensibility - New bulk_clear_finding_m2m() in finding/helper.py for M2M cleanup with FileUpload disk cleanup and orphaned Notes deletion - Rewritten async_delete_crawl_task with chunked cascade deletion - Removed async_delete_chunk_task (no longer needed) - Product grading recalculated once at end instead of per-object * perf: replace mass_model_updater with single UPDATE in reconfigure_duplicate_cluster Use QuerySet.update() instead of mass_model_updater to re-point duplicates to the new original. Single SQL query instead of loading all findings into Python and calling bulk_update. * cleanup: remove dead code from duplicate handling Remove reset_duplicate_before_delete, reset_duplicates_before_delete, and set_new_original — all replaced by bulk UPDATE in prepare_duplicates_for_delete and .update() in reconfigure_duplicate_cluster. Remove unused mass_model_updater import. * fix: delete outside-scope duplicates before main scope to avoid FK violations When bulk-deleting findings in chunks, an original in an earlier chunk could fail to delete because its duplicate (higher ID) in a later chunk still references it via duplicate_finding FK. Fix by deleting outside-scope duplicates first, then the main scope. Also moves pre_bulk_delete_findings signal into bulk_delete_findings so it fires automatically. * fix: use bool type for DD_DUPLICATE_CLUSTER_CASCADE_DELETE env var * fix: replace save_no_options with .update() in reconfigure_duplicate_cluster Avoids triggering Finding.save() signals (pre_save_changed, execute_prioritization_calculations) when reconfiguring duplicate clusters during deletion. Adds tests for cross-engagement duplicate reconfiguration and product deletion with duplicates. * refactor: scope prepare_duplicates_for_delete to full object, not per-engagement Adds product= and product_type= parameters so the entire deletion scope is handled in one call, avoiding unnecessary reconfiguration of findings that are about to be deleted anyway. Uses subqueries instead of materializing ID sets, and chunks the originals loop with prefetch to bound memory. Reverts finding_delete to use ORM .delete() for single finding cascade deletes. * refactor: remove ASYNC_DELETE_MAPPING, use FINDING_SCOPE_FILTERS Replace the model_list-based mapping with a simple scope filter dict. prepare_duplicates_for_delete now accepts a single object and derives the scope via FINDING_SCOPE_FILTERS. Removes the redundant non-Finding model deletion loop — cascade_delete on the top-level object handles all remaining children. Cleans up async_delete class. * fix: resolve ruff lint violations in helper and tests * remove obsolete test * perf: add bulk_delete_findings, fix CASCADE_DELETE and scope expansion - Add bulk_delete_findings() wrapper: M2M cleanup + chunked cascade_delete - reconfigure_duplicate_cluster: return early when CASCADE_DELETE=True instead of calling Django .delete() which fires signals per finding - finding_delete: use bulk_delete_findings when CASCADE_DELETE=True - async_delete_crawl_task: expand scope to include outside-scope duplicates, use bulk_delete_findings instead of manual M2M + cascade_delete calls - Fix test to use async_delete class instead of direct task import * fix: handle M2M and tag cleanup in cascade_delete Adds generic M2M through-table cleanup to cascade_delete so tags and other M2M relations are cleared before row deletion. Introduces bulk_remove_all_tags in tag_utils to properly decrement tagulous tag counts during bulk deletion. Adds test for product deletion with tagged objects. * refactor: auto-discover TagFields in bulk_remove_all_tags Instead of hardcoding field names, iterate over all fields on the model and select those with tag_options. This avoids unexpected side effects when callers pass a specific tag_field_name parameter. * perf: address PR review feedback for large-scale delete safety - Stream finding IDs via iterator()+batched instead of materializing the full ID list into memory. Prevents OOM on 4.5M+ finding deletes. - Add SET LOCAL statement_timeout (300s) and deadlock error logging to cascade_delete SQL execution. Prevents runaway queries from holding locks indefinitely and surfaces deadlock errors in logs. - Reuse scope_ids subquery variable and replace .exists()+.count() with a single .count() call to avoid evaluating the subquery twice. - Add comment explaining why FileUpload uses per-object ORM delete (custom delete() removes files from disk; file attachments are rare). - Scope fix_loop_duplicates to the deletion set instead of scanning the full findings table. The double self-join is cheap when filtered to only findings in the scope being deleted. - Document that pre_bulk_delete_findings signal receivers must not materialize the full queryset (use .filter()/.iterator() instead). - Add skip_m2m_for parameter to cascade_delete so bulk_delete_findings can tell it Finding M2M was already cleaned by bulk_clear_finding_m2m, avoiding redundant tag count aggregation queries. * refactor: rename cascade_delete to cascade_delete_related_objects The function now only deletes related objects, not the root record. This allows async_delete_task to call obj.delete() on the top-level object via ORM, which fires Django signals (post_delete notifications, pghistory audit, Pro signals like product_post_delete). bulk_delete_findings uses execute_delete_sql to delete the finding rows themselves after cascade_delete_related_objects cleans children. * Update dojo/settings/settings.dist.py Co-authored-by: Cody Maffucci <46459665+Maffooch@users.noreply.github.com> --------- Co-authored-by: Cody Maffucci <46459665+Maffooch@users.noreply.github.com>
1 parent f0f0f7f commit d61da2c

9 files changed

Lines changed: 975 additions & 245 deletions

File tree

dojo/finding/helper.py

Lines changed: 199 additions & 91 deletions
Large diffs are not rendered by default.

dojo/models.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1647,7 +1647,7 @@ def is_ci_cd(self):
16471647
def delete(self, *args, **kwargs):
16481648
logger.debug("%d engagement delete", self.id)
16491649
from dojo.finding import helper as finding_helper # noqa: PLC0415 circular import
1650-
finding_helper.prepare_duplicates_for_delete(engagement=self)
1650+
finding_helper.prepare_duplicates_for_delete(self)
16511651
super().delete(*args, **kwargs)
16521652
with suppress(Engagement.DoesNotExist, Product.DoesNotExist):
16531653
# Suppressing a potential issue created from async delete removing

dojo/settings/settings.dist.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -302,7 +302,7 @@
302302
# Initial behaviour in Defect Dojo was to delete all duplicates when an original was deleted
303303
# New behaviour is to leave the duplicates in place, but set the oldest of duplicates as new original
304304
# Set to True to revert to the old behaviour where all duplicates are deleted
305-
DD_DUPLICATE_CLUSTER_CASCADE_DELETE=(str, False),
305+
DD_DUPLICATE_CLUSTER_CASCADE_DELETE=(bool, True),
306306
# Enable Rate Limiting for the login page
307307
DD_RATE_LIMITER_ENABLED=(bool, False),
308308
# Examples include 5/m 100/h and more https://django-ratelimit.readthedocs.io/en/stable/rates.html#simple-rates

dojo/signals.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
from django.dispatch import Signal
2+
3+
# Sent before bulk-deleting findings via cascade_delete.
4+
# Receivers can dispatch integrator notifications, collect metrics, etc.
5+
# Provides: finding_qs (QuerySet of findings about to be deleted)
6+
#
7+
# IMPORTANT: The queryset may contain millions of rows. Receivers MUST NOT
8+
# call list(), len(), or otherwise materialize the full queryset into memory.
9+
# Use .filter(), .iterator(), or aggregation queries instead.
10+
pre_bulk_delete_findings = Signal()

dojo/tag_utils.py

Lines changed: 66 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
from __future__ import annotations
22

3+
import logging
34
from collections.abc import Iterable
45

56
from django.conf import settings
@@ -8,6 +9,8 @@
89

910
from dojo.models import Product # local import to avoid circulars at import time
1011

12+
logger = logging.getLogger(__name__)
13+
1114

1215
def bulk_add_tags_to_instances(tag_or_tags, instances, tag_field_name: str = "tags", batch_size: int | None = None) -> int:
1316
"""
@@ -161,4 +164,66 @@ def bulk_add_tags_to_instances(tag_or_tags, instances, tag_field_name: str = "ta
161164
return total_created
162165

163166

164-
__all__ = ["bulk_add_tags_to_instances"]
167+
def bulk_remove_all_tags(model_class, instance_ids_qs):
168+
"""
169+
Remove all tags from instances identified by the given ID subquery.
170+
171+
Auto-discovers all TagFields on the model, decrements tag counts correctly,
172+
and deletes through-table rows.
173+
Accepts a QuerySet of IDs (as a subquery) to avoid materializing large ID lists.
174+
175+
Args:
176+
model_class: The model class (e.g. Finding, Product).
177+
instance_ids_qs: A QuerySet producing instance PKs (used as subquery).
178+
179+
"""
180+
tag_fields = [
181+
field for field in model_class._meta.get_fields()
182+
if hasattr(field, "tag_options")
183+
]
184+
185+
for tag_field in tag_fields:
186+
187+
tag_model = tag_field.related_model
188+
through_model = tag_field.remote_field.through
189+
190+
# Find the FK column that points to the source model
191+
source_field_name = None
192+
target_field_name = None
193+
for field in through_model._meta.get_fields():
194+
if hasattr(field, "remote_field") and field.remote_field:
195+
if field.remote_field.model == model_class:
196+
source_field_name = field.name
197+
elif field.remote_field.model == tag_model:
198+
target_field_name = field.name
199+
200+
if not source_field_name or not target_field_name:
201+
continue
202+
203+
# Get affected tag IDs and their counts before deletion
204+
affected_tags = (
205+
through_model.objects.filter(**{f"{source_field_name}__in": instance_ids_qs})
206+
.values(target_field_name)
207+
.annotate(num=models.Count("id"))
208+
)
209+
210+
# Decrement tag counts. Tag counts are not used in DefectDojo but we
211+
# maintain them to avoid breaking tagulous's internal bookkeeping.
212+
for entry in affected_tags:
213+
tag_model.objects.filter(pk=entry[target_field_name]).update(
214+
count=models.F("count") - entry["num"],
215+
)
216+
217+
# Delete through-table rows
218+
count, _ = through_model.objects.filter(
219+
**{f"{source_field_name}__in": instance_ids_qs},
220+
).delete()
221+
222+
if count:
223+
logger.debug(
224+
"bulk_remove_all_tags: removed %d %s.%s through-table rows",
225+
count, model_class.__name__, tag_field.name,
226+
)
227+
228+
229+
__all__ = ["bulk_add_tags_to_instances", "bulk_remove_all_tags"]

0 commit comments

Comments
 (0)