Skip to content

Commit 698ece0

Browse files
valentijnscholtenValentijn Scholten
andauthored
reimport: match findings in batches (#13889)
* fix logger NoneType * reimport: use batch matching * tests: always show all mismatched counts * reimport: add test for internal duplicates during matching * reimport: fix handling of duplicates inside the same report/batch * add id * reimport: optimize vulnerability_id processing * vulnerability_id processing: remove duplicate delete, fix bug, add tests * vulnerability_id processing: separate import/reimport * reimport: add management command to reimport sample scans * reimport: prefetch endpoint status * reimport: bullk update endpoint statuses * add script to update performance test counts easily * add script to update performance test counts easily * 8a57c38 * add upgrade notes * add upgrade notes * remove obsolete method * remove obsolete method * reimport: extract method to get candidates * reimport: remove fallback to non-batch * reimport: prep for Pro overrides * add comment * add comment * update counts and script * update counts and script --------- Co-authored-by: Valentijn Scholten <valentijn.scholten@iodigital.com>
1 parent a1478fb commit 698ece0

17 files changed

Lines changed: 2514 additions & 317 deletions

docs/content/en/open_source/upgrading/2.54.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
---
22
title: 'Upgrading to DefectDojo Version 2.54.x'
33
toc_hide: true
4-
weight: -20251201
5-
description: Removal of django-auditlog and exclusive use of django-pghistory for audit logging & Dropped support for DD_PARSER_EXCLUDE
4+
weight: -20250804
5+
description: Removal of django-auditlog & Dropped support for DD_PARSER_EXCLUDE & Reimport performance improvements
66
---
77

88
## Breaking Change: Removal of django-auditlog
@@ -44,4 +44,16 @@ The backfill migration is not mandatory to succeed. If it fails for some reason,
4444
To simplify the management of the DefectDojo application, parser exclusions are no longer controlled via the environment variable DD_PARSER_EXCLUDE or application settings. This variable is now unsupported.
4545
From now on, you should use the active flag in the Test_Type model to enable or disable parsers. Only parsers associated with active Test_Type entries will be available for use.
4646

47-
Check the [Release Notes](https://github.com/DefectDojo/django-DefectDojo/releases/tag/2.54.0) for the contents of the release.
47+
## Import/reimport performance improvements
48+
49+
DefectDojo 2.54.x includes performance improvements for reimporting scan results, especially for large scans:
50+
51+
- **Faster reimports** due to fewer database queries and more bulk operations.
52+
- **Reduced database load** during reimport matching and post-processing (helps avoid slowdowns/timeouts under heavy scan volume).
53+
- **More efficient endpoint status updates** during reimport of dynamic findings.
54+
- **Less churn when updating vulnerability IDs**, avoiding unnecessary deletes/writes when nothing changed.
55+
56+
No action is required after upgrading. (Optional tuning knobs exist via `DD_IMPORT_REIMPORT_MATCH_BATCH_SIZE` and `DD_IMPORT_REIMPORT_DEDUPE_BATCH_SIZE`.)
57+
58+
There are other instructions for upgrading to 2.54.x. Check the Release Notes for the contents of the release: `https://github.com/DefectDojo/django-DefectDojo/releases/tag/2.54.0`
59+
Check the [Release Notes](https://github.com/DefectDojo/django-DefectDojo/releases/tag/2.54.0) for the contents of the release.

dojo/finding/deduplication.py

Lines changed: 148 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -232,59 +232,123 @@ def are_endpoints_duplicates(new_finding, to_duplicate_finding):
232232
return False
233233

234234

235-
def build_dedupe_scope_queryset(test):
236-
scope_on_engagement = test.engagement.deduplication_on_engagement
237-
if scope_on_engagement:
238-
scope_q = Q(test__engagement=test.engagement)
239-
else:
240-
# Product scope limited to current product, but exclude engagements that opted into engagement-scoped dedupe
241-
scope_q = Q(test__engagement__product=test.engagement.product) & (
242-
Q(test__engagement=test.engagement)
243-
| Q(test__engagement__deduplication_on_engagement=False)
244-
)
235+
def build_candidate_scope_queryset(test, mode="deduplication", service=None):
236+
"""
237+
Build a queryset for candidate finding.
238+
239+
Args:
240+
test: The test to scope from
241+
mode: "deduplication" (can match across tests) or "reimport" (same test only)
242+
service: Optional service filter (for deduplication mode, not used for reimport since service is in hash)
243+
244+
"""
245+
if mode == "reimport":
246+
# For reimport, only filter by test. Service filtering is not needed because
247+
# service is included in hash_code calculation (HASH_CODE_FIELDS_ALWAYS = ["service"]),
248+
# so matching by hash_code automatically ensures correct service match.
249+
queryset = Finding.objects.filter(test=test)
250+
else: # deduplication mode
251+
scope_on_engagement = test.engagement.deduplication_on_engagement
252+
if scope_on_engagement:
253+
scope_q = Q(test__engagement=test.engagement)
254+
else:
255+
# Product scope limited to current product, but exclude engagements that opted into engagement-scoped dedupe
256+
scope_q = Q(test__engagement__product=test.engagement.product) & (
257+
Q(test__engagement=test.engagement)
258+
| Q(test__engagement__deduplication_on_engagement=False)
259+
)
260+
queryset = Finding.objects.filter(scope_q)
261+
262+
# Base prefetches for both modes
263+
prefetch_list = ["endpoints", "vulnerability_id_set", "found_by"]
264+
265+
# Additional prefetches for reimport mode
266+
if mode == "reimport":
267+
prefetch_list.extend([
268+
"status_finding",
269+
"status_finding__endpoint",
270+
])
245271

246272
return (
247-
Finding.objects.filter(scope_q)
273+
queryset
248274
.select_related("test", "test__engagement", "test__test_type")
249-
.prefetch_related("endpoints", "found_by")
275+
.prefetch_related(*prefetch_list)
250276
)
251277

252278

253-
def find_candidates_for_deduplication_hash(test, findings):
254-
base_queryset = build_dedupe_scope_queryset(test)
279+
def find_candidates_for_deduplication_hash(test, findings, mode="deduplication", service=None):
280+
"""
281+
Find candidates by hash_code. Works for both deduplication and reimport.
282+
283+
Args:
284+
test: The test to scope from
285+
findings: List of findings to find candidates for
286+
mode: "deduplication" or "reimport"
287+
service: Optional service filter (for deduplication mode, not used for reimport since service is in hash)
288+
289+
"""
290+
base_queryset = build_candidate_scope_queryset(test, mode=mode, service=service)
255291
hash_codes = {f.hash_code for f in findings if getattr(f, "hash_code", None) is not None}
256292
if not hash_codes:
257293
return {}
258-
existing_qs = (
259-
base_queryset.filter(hash_code__in=hash_codes)
260-
.exclude(hash_code=None)
261-
.exclude(duplicate=True)
262-
.order_by("id")
263-
)
294+
295+
existing_qs = base_queryset.filter(hash_code__in=hash_codes).exclude(hash_code=None)
296+
if mode == "deduplication":
297+
existing_qs = existing_qs.exclude(duplicate=True)
298+
existing_qs = existing_qs.order_by("id")
299+
264300
existing_by_hash = {}
265301
for ef in existing_qs:
266302
existing_by_hash.setdefault(ef.hash_code, []).append(ef)
267-
deduplicationLogger.debug(f"Found {len(existing_by_hash)} existing findings by hash codes")
303+
304+
log_msg = "for reimport" if mode == "reimport" else ""
305+
deduplicationLogger.debug(f"Found {len(existing_by_hash)} existing findings by hash codes {log_msg}")
268306
return existing_by_hash
269307

270308

271-
def find_candidates_for_deduplication_unique_id(test, findings):
272-
base_queryset = build_dedupe_scope_queryset(test)
309+
def find_candidates_for_deduplication_unique_id(test, findings, mode="deduplication", service=None):
310+
"""
311+
Find candidates by unique_id_from_tool. Works for both deduplication and reimport.
312+
313+
Args:
314+
test: The test to scope from
315+
findings: List of findings to find candidates for
316+
mode: "deduplication" or "reimport"
317+
service: Optional service filter (for deduplication mode, not used for reimport since service is in hash)
318+
319+
"""
320+
base_queryset = build_candidate_scope_queryset(test, mode=mode, service=service)
273321
unique_ids = {f.unique_id_from_tool for f in findings if getattr(f, "unique_id_from_tool", None) is not None}
274322
if not unique_ids:
275323
return {}
276-
existing_qs = base_queryset.filter(unique_id_from_tool__in=unique_ids).exclude(unique_id_from_tool=None).exclude(duplicate=True).order_by("id")
324+
325+
existing_qs = base_queryset.filter(unique_id_from_tool__in=unique_ids).exclude(unique_id_from_tool=None)
326+
if mode == "deduplication":
327+
existing_qs = existing_qs.exclude(duplicate=True)
277328
# unique_id_from_tool can only apply to the same test_type because it is parser dependent
278-
existing_qs = existing_qs.filter(test__test_type=test.test_type)
329+
existing_qs = existing_qs.filter(test__test_type=test.test_type).order_by("id")
330+
279331
existing_by_uid = {}
280332
for ef in existing_qs:
281333
existing_by_uid.setdefault(ef.unique_id_from_tool, []).append(ef)
282-
deduplicationLogger.debug(f"Found {len(existing_by_uid)} existing findings by unique IDs")
334+
335+
log_msg = "for reimport" if mode == "reimport" else ""
336+
deduplicationLogger.debug(f"Found {len(existing_by_uid)} existing findings by unique IDs {log_msg}")
283337
return existing_by_uid
284338

285339

286-
def find_candidates_for_deduplication_uid_or_hash(test, findings):
287-
base_queryset = build_dedupe_scope_queryset(test)
340+
def find_candidates_for_deduplication_uid_or_hash(test, findings, mode="deduplication", service=None):
341+
"""
342+
Find candidates by unique_id_from_tool or hash_code. Works for both deduplication and reimport.
343+
344+
Args:
345+
test: The test to scope from
346+
findings: List of findings to find candidates for
347+
mode: "deduplication" or "reimport"
348+
service: Optional service filter (for deduplication mode, not used for reimport since service is in hash)
349+
350+
"""
351+
base_queryset = build_candidate_scope_queryset(test, mode=mode, service=service)
288352
hash_codes = {f.hash_code for f in findings if getattr(f, "hash_code", None) is not None}
289353
unique_ids = {f.unique_id_from_tool for f in findings if getattr(f, "unique_id_from_tool", None) is not None}
290354
if not hash_codes and not unique_ids:
@@ -298,7 +362,11 @@ def find_candidates_for_deduplication_uid_or_hash(test, findings):
298362
uid_q = Q(unique_id_from_tool__isnull=False, unique_id_from_tool__in=unique_ids) & Q(test__test_type=test.test_type)
299363
cond |= uid_q
300364

301-
existing_qs = base_queryset.filter(cond).exclude(duplicate=True).order_by("id")
365+
existing_qs = base_queryset.filter(cond)
366+
if mode == "deduplication":
367+
# reimport matching will match against duplicates, import/deduplication doesn't.
368+
existing_qs = existing_qs.exclude(duplicate=True)
369+
existing_qs = existing_qs.order_by("id")
302370

303371
existing_by_hash = {}
304372
existing_by_uid = {}
@@ -307,13 +375,15 @@ def find_candidates_for_deduplication_uid_or_hash(test, findings):
307375
existing_by_hash.setdefault(ef.hash_code, []).append(ef)
308376
if ef.unique_id_from_tool is not None:
309377
existing_by_uid.setdefault(ef.unique_id_from_tool, []).append(ef)
310-
deduplicationLogger.debug(f"Found {len(existing_by_uid)} existing findings by unique IDs")
311-
deduplicationLogger.debug(f"Found {len(existing_by_hash)} existing findings by hash codes")
378+
379+
log_msg = "for reimport" if mode == "reimport" else ""
380+
deduplicationLogger.debug(f"Found {len(existing_by_uid)} existing findings by unique IDs {log_msg}")
381+
deduplicationLogger.debug(f"Found {len(existing_by_hash)} existing findings by hash codes {log_msg}")
312382
return existing_by_uid, existing_by_hash
313383

314384

315385
def find_candidates_for_deduplication_legacy(test, findings):
316-
base_queryset = build_dedupe_scope_queryset(test)
386+
base_queryset = build_candidate_scope_queryset(test, mode="deduplication")
317387
titles = {f.title for f in findings if getattr(f, "title", None)}
318388
cwes = {f.cwe for f in findings if getattr(f, "cwe", 0)}
319389
cwes.discard(0)
@@ -335,6 +405,52 @@ def find_candidates_for_deduplication_legacy(test, findings):
335405
return by_title, by_cwe
336406

337407

408+
# TODO: should we align this with deduplication?
409+
def find_candidates_for_reimport_legacy(test, findings, service=None):
410+
"""
411+
Find all existing findings in the test that match any of the given findings by title and severity.
412+
Used for batch reimport to avoid 1+N query problem.
413+
Legacy reimport matches by title (case-insensitive), severity, and numerical_severity.
414+
Note: This function is kept separate because legacy reimport has fundamentally different matching logic
415+
than legacy deduplication (title+severity vs title+CWE).
416+
Note: service parameter is kept for backward compatibility but not used since service is in hash_code.
417+
"""
418+
base_queryset = build_candidate_scope_queryset(test, mode="reimport", service=None)
419+
420+
# Collect all unique title/severity combinations
421+
title_severity_pairs = set()
422+
for finding in findings:
423+
if finding.title:
424+
title_severity_pairs.add((
425+
finding.title.lower(), # Case-insensitive matching
426+
finding.severity,
427+
Finding.get_numerical_severity(finding.severity),
428+
))
429+
430+
if not title_severity_pairs:
431+
return {}
432+
433+
# Build query to find all matching findings
434+
conditions = Q()
435+
for title_lower, severity, numerical_severity in title_severity_pairs:
436+
conditions |= (
437+
Q(title__iexact=title_lower) &
438+
Q(severity=severity) &
439+
Q(numerical_severity=numerical_severity)
440+
)
441+
442+
existing_qs = base_queryset.filter(conditions).order_by("id")
443+
444+
# Build dictionary keyed by (title_lower, severity) for quick lookup
445+
existing_by_key = {}
446+
for ef in existing_qs:
447+
key = (ef.title.lower(), ef.severity)
448+
existing_by_key.setdefault(key, []).append(ef)
449+
450+
deduplicationLogger.debug(f"Found {sum(len(v) for v in existing_by_key.values())} existing findings by legacy matching for reimport")
451+
return existing_by_key
452+
453+
338454
def _is_candidate_older(new_finding, candidate):
339455
# Ensure the newer finding is marked as duplicate of the older finding
340456
is_older = candidate.id < new_finding.id

dojo/finding/helper.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -765,12 +765,15 @@ def add_endpoints(new_finding, form):
765765
endpoint=endpoint, defaults={"date": form.cleaned_data["date"] or timezone.now()})
766766

767767

768-
def save_vulnerability_ids(finding, vulnerability_ids):
768+
def save_vulnerability_ids(finding, vulnerability_ids, *, delete_existing: bool = True):
769769
# Remove duplicates
770770
vulnerability_ids = list(dict.fromkeys(vulnerability_ids))
771771

772-
# Remove old vulnerability ids
773-
Vulnerability_Id.objects.filter(finding=finding).delete()
772+
# Remove old vulnerability ids if requested
773+
# Callers can set delete_existing=False when they know there are no existing IDs
774+
# to avoid an unnecessary delete query (e.g., for new findings)
775+
if delete_existing:
776+
Vulnerability_Id.objects.filter(finding=finding).delete()
774777

775778
# Save new vulnerability ids
776779
# Using bulk create throws Django 50 warnings about unsaved models...

dojo/importers/base_importer.py

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,6 @@
3232
Test_Import,
3333
Test_Import_Finding_Action,
3434
Test_Type,
35-
Vulnerability_Id,
3635
)
3736
from dojo.notifications.helper import create_notification
3837
from dojo.tag_utils import bulk_add_tags_to_instances
@@ -278,6 +277,7 @@ def determine_process_method(
278277
def determine_deduplication_algorithm(self) -> str:
279278
"""
280279
Determines what dedupe algorithm to use for the Test being processed.
280+
Overridden in Pro.
281281
:return: A string representing the dedupe algorithm to use.
282282
"""
283283
return self.test.deduplication_algorithm
@@ -793,21 +793,23 @@ def process_cve(
793793

794794
return finding
795795

796-
def process_vulnerability_ids(
796+
def store_vulnerability_ids(
797797
self,
798798
finding: Finding,
799799
) -> Finding:
800800
"""
801-
Parse the `unsaved_vulnerability_ids` field from findings after they are parsed
802-
to create `Vulnerability_Id` objects with the finding associated correctly
803-
"""
804-
if finding.unsaved_vulnerability_ids:
805-
# Remove old vulnerability ids - keeping this call only because of flake8
806-
Vulnerability_Id.objects.filter(finding=finding).delete()
801+
Store vulnerability IDs for a finding.
802+
Reads from finding.unsaved_vulnerability_ids and saves them overwriting existing ones.
803+
804+
Args:
805+
finding: The finding to store vulnerability IDs for
807806
808-
# user the helper function
809-
finding_helper.save_vulnerability_ids(finding, finding.unsaved_vulnerability_ids)
807+
Returns:
808+
The finding object
810809
810+
"""
811+
vulnerability_ids_to_process = finding.unsaved_vulnerability_ids or []
812+
finding_helper.save_vulnerability_ids(finding, vulnerability_ids_to_process, delete_existing=False)
811813
return finding
812814

813815
def process_files(

dojo/importers/default_importer.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -234,7 +234,7 @@ def process_findings(
234234
# Process any files
235235
self.process_files(finding)
236236
# Process vulnerability IDs
237-
finding = self.process_vulnerability_ids(finding)
237+
finding = self.store_vulnerability_ids(finding)
238238
# Categorize this finding as a new one
239239
new_findings.append(finding)
240240
# all data is already saved on the finding, we only need to trigger post processing in batches

0 commit comments

Comments
 (0)