Skip to content

Commit 674ffb6

Browse files
perf(fp-history): batch false positive history processing (#14449)
* perf(fp-history): batch false positive history processing Replaces the N+1 query pattern in false positive history with a single product-scoped DB query per batch, and switches per-finding save() calls to QuerySet.update() to eliminate redundant signal overhead. Changes: - Extract _fp_candidates_qs() as the single algorithm-dispatch helper shared by both single-finding and batch lookup paths - Add do_false_positive_history_batch() which fetches all FP candidates in one query and marks findings with a single UPDATE - do_false_positive_history() now delegates to the batch function - post_process_findings_batch (import/reimport) calls the batch function instead of a per-finding loop - _bulk_update_finding_status_and_severity (bulk edit) groups findings by (product, dedup_alg) and calls the batch function once per group; retroactive reactivation also batched the same way - Fix dead-code bug in process_false_positive_history: the condition finding.false_p and not finding.false_p was always False because form.save(commit=False) mutates the finding in place; fixed by capturing old_false_p before the form save - Replace all per-finding save()/save_no_options() in FP history paths with QuerySet.update() (bypasses signals identically to the old calls) - Move all FP history helpers from dojo/utils.py to dojo/finding/deduplication.py alongside the matching dedupe helpers All update() calls carry a comment explaining the signal-bypass equivalence with the previous save(skip_validation=True) calls. Adds 4 unit tests covering: batch single-query behaviour, retroactive batch FP marking, retroactive reactivation (previously dead code), and the no-reactivation guard. * perf(fp-history): add .only() to candidate fetch, fix update() comments Limit _fetch_fp_candidates_for_batch to only the fields actually read from candidate objects (id, false_p, active, hash_code, unique_id_from_tool, title, severity), avoiding loading unused columns. Correct update() comments to clarify that .only() does not constrain QuerySet.update() — Django generates UPDATE SQL independently — so the sync requirement is only for fields *read* from candidate objects. * test(fp-history): assert exact query count in batch tests assertNumQueries(7) on both batch tests covers: System_Settings, 4 lazy-load chain (test/engagement/product/test_type from findings[0]), candidates SELECT with .only(), and the bulk UPDATE — fixed regardless of batch size or number of retroactively marked findings. * test(fp-history): assert query count stays flat with N affected findings New test creates 5 pre-existing findings and asserts the batch still uses exactly 7 queries regardless — proving the old O(N) per-finding save loop is gone and a single bulk UPDATE covers all affected rows.
1 parent a7c6646 commit 674ffb6

5 files changed

Lines changed: 518 additions & 189 deletions

File tree

dojo/finding/deduplication.py

Lines changed: 264 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -826,3 +826,267 @@ def dedupe_batch_of_findings(findings, *args, **kwargs):
826826
else:
827827
deduplicationLogger.debug("dedupe: skipping dedupe because it's disabled in system settings get()")
828828
return None
829+
830+
831+
# ---------------------------------------------------------------------------
832+
# False-positive history helpers
833+
# ---------------------------------------------------------------------------
834+
835+
836+
def _fp_candidates_qs(scope_filter, dedup_alg, findings, exclude_ids=None):
837+
"""
838+
Build and return a lazy QuerySet of existing findings that could be FP matches
839+
for the given list of findings under the specified algorithm and scope.
840+
841+
Single source of truth for the algorithm dispatch, shared between
842+
match_finding_to_existing_findings (returns the QS directly for chaining) and
843+
_fetch_fp_candidates_for_batch (evaluates it into a keyed dict).
844+
845+
For the legacy algorithm, exclude_ids is intentionally ignored — this matches
846+
the original match_finding_to_existing_findings behaviour.
847+
"""
848+
if dedup_alg == "hash_code":
849+
hash_codes = {f.hash_code for f in findings if getattr(f, "hash_code", None)}
850+
if not hash_codes:
851+
return Finding.objects.none()
852+
qs = Finding.objects.filter(**scope_filter, hash_code__in=hash_codes).exclude(hash_code=None)
853+
if exclude_ids:
854+
qs = qs.exclude(id__in=exclude_ids)
855+
return qs.order_by("id")
856+
857+
if dedup_alg == "unique_id_from_tool":
858+
uids = {f.unique_id_from_tool for f in findings if getattr(f, "unique_id_from_tool", None)}
859+
if not uids:
860+
return Finding.objects.none()
861+
qs = Finding.objects.filter(**scope_filter, unique_id_from_tool__in=uids).exclude(unique_id_from_tool=None)
862+
if exclude_ids:
863+
qs = qs.exclude(id__in=exclude_ids)
864+
return qs.order_by("id")
865+
866+
if dedup_alg == "unique_id_from_tool_or_hash_code":
867+
hash_codes = {f.hash_code for f in findings if getattr(f, "hash_code", None)}
868+
uids = {f.unique_id_from_tool for f in findings if getattr(f, "unique_id_from_tool", None)}
869+
if not hash_codes and not uids:
870+
return Finding.objects.none()
871+
cond = Q()
872+
if hash_codes:
873+
cond |= Q(hash_code__isnull=False, hash_code__in=hash_codes)
874+
if uids:
875+
cond |= Q(unique_id_from_tool__isnull=False, unique_id_from_tool__in=uids)
876+
qs = Finding.objects.filter(Q(**scope_filter)).filter(cond)
877+
if exclude_ids:
878+
qs = qs.exclude(id__in=exclude_ids)
879+
return qs.order_by("id")
880+
881+
if dedup_alg == "legacy":
882+
pairs = {
883+
(f.title, f.severity, Finding.get_numerical_severity(f.severity))
884+
for f in findings
885+
if getattr(f, "title", None)
886+
}
887+
if not pairs:
888+
return Finding.objects.none()
889+
cond = Q()
890+
for title, severity, num_sev in pairs:
891+
cond |= Q(title__iexact=title, severity=severity, numerical_severity=num_sev)
892+
# Legacy does not exclude by id — matches the original match_finding_to_existing_findings behaviour.
893+
return Finding.objects.filter(**scope_filter).filter(cond).order_by("id")
894+
895+
logger.error(
896+
"FALSE_POSITIVE_HISTORY: unexpected deduplication_algorithm '%s', returning empty candidates",
897+
dedup_alg,
898+
)
899+
return Finding.objects.none()
900+
901+
902+
def _fetch_fp_candidates_for_batch(findings, product, dedup_alg):
903+
"""
904+
Fetch all existing findings in the product that could be FP matches for a batch,
905+
returning a dict keyed by match identifier for in-memory lookup.
906+
907+
For unique_id_from_tool_or_hash_code the return value is a tuple (by_uid, by_hash).
908+
For all other algorithms it is a plain dict.
909+
"""
910+
scope_filter = {"test__engagement__product": product}
911+
exclude_ids = {f.id for f in findings if f.id}
912+
qs = _fp_candidates_qs(scope_filter, dedup_alg, findings, exclude_ids).only(
913+
# Keep this list in sync with every field read from candidate objects in this function.
914+
# Accessing a field not listed here causes Django to issue an extra SELECT per object,
915+
# silently negating the .only() optimisation.
916+
"id", "false_p", "active", "hash_code", "unique_id_from_tool", "title", "severity",
917+
)
918+
919+
if dedup_alg == "unique_id_from_tool_or_hash_code":
920+
by_hash: dict = {}
921+
by_uid: dict = {}
922+
for ef in qs:
923+
if ef.hash_code:
924+
by_hash.setdefault(ef.hash_code, []).append(ef)
925+
if ef.unique_id_from_tool:
926+
by_uid.setdefault(ef.unique_id_from_tool, []).append(ef)
927+
return by_uid, by_hash
928+
929+
if dedup_alg == "hash_code":
930+
result: dict = {}
931+
for ef in qs:
932+
result.setdefault(ef.hash_code, []).append(ef)
933+
return result
934+
935+
if dedup_alg == "unique_id_from_tool":
936+
result = {}
937+
for ef in qs:
938+
result.setdefault(ef.unique_id_from_tool, []).append(ef)
939+
return result
940+
941+
if dedup_alg == "legacy":
942+
result = {}
943+
for ef in qs:
944+
result.setdefault((ef.title.lower(), ef.severity), []).append(ef)
945+
return result
946+
947+
return {}
948+
949+
950+
def do_false_positive_history_batch(findings):
951+
"""
952+
Batch version of do_false_positive_history.
953+
954+
Processes a list of findings from the same product in a single DB round-trip
955+
rather than one query per finding. All findings are expected to share the
956+
same test (i.e. same deduplication_algorithm and same product), which is
957+
guaranteed by both callers (post_process_findings_batch and bulk-edit).
958+
959+
Args:
960+
findings: list of :model:`dojo.Finding` instances
961+
962+
"""
963+
if not findings:
964+
return
965+
966+
system_settings = System_Settings.objects.get()
967+
968+
product = findings[0].test.engagement.product
969+
dedup_alg = findings[0].test.deduplication_algorithm
970+
971+
# Fetch all candidate existing findings with one DB query
972+
candidates = _fetch_fp_candidates_for_batch(findings, product, dedup_alg)
973+
974+
to_mark_as_fp_ids: set = set()
975+
976+
for finding in findings:
977+
# Resolve candidate list(s) for this finding
978+
if dedup_alg == "unique_id_from_tool_or_hash_code":
979+
by_uid, by_hash = candidates # type: ignore[misc]
980+
uid_matches = by_uid.get(finding.unique_id_from_tool, []) if finding.unique_id_from_tool else []
981+
hash_matches = by_hash.get(finding.hash_code, []) if finding.hash_code else []
982+
# Deduplicate by id while preserving both uid and hash matches
983+
seen: dict = {}
984+
for ef in uid_matches + hash_matches:
985+
seen.setdefault(ef.id, ef)
986+
existing = list(seen.values())
987+
elif dedup_alg == "hash_code":
988+
existing = candidates.get(finding.hash_code, []) if finding.hash_code else []
989+
elif dedup_alg == "unique_id_from_tool":
990+
existing = candidates.get(finding.unique_id_from_tool, []) if finding.unique_id_from_tool else []
991+
elif dedup_alg == "legacy":
992+
key = (finding.title.lower(), finding.severity) if finding.title else None
993+
existing = candidates.get(key, []) if key else []
994+
else:
995+
existing = []
996+
997+
existing_fps = [ef for ef in existing if ef.false_p]
998+
999+
if existing_fps:
1000+
finding.false_p = True
1001+
if finding.id:
1002+
to_mark_as_fp_ids.add(finding.id)
1003+
1004+
if system_settings.retroactive_false_positive_history and finding.false_p:
1005+
for ef in existing:
1006+
if ef.active and not ef.false_p:
1007+
to_mark_as_fp_ids.add(ef.id)
1008+
1009+
if to_mark_as_fp_ids:
1010+
deduplicationLogger.debug(
1011+
"FALSE_POSITIVE_HISTORY (batch): marking %i finding(s) as false positive: %s",
1012+
len(to_mark_as_fp_ids),
1013+
sorted(to_mark_as_fp_ids),
1014+
)
1015+
# QuerySet.update() bypasses Django signals — intentional, mimicking the previous
1016+
# super(Finding, find).save(skip_validation=True) calls that also skipped all post-save processing.
1017+
# Note: .only() does not constrain update() — Django generates the UPDATE SQL independently.
1018+
Finding.objects.filter(id__in=to_mark_as_fp_ids).update(false_p=True, active=False, verified=False)
1019+
1020+
1021+
def do_false_positive_history(finding, *args, **kwargs):
1022+
"""
1023+
Replicate false positives across product.
1024+
1025+
Mark finding as false positive if the same finding was previously marked
1026+
as false positive in the same product, beyond that, retroactively mark
1027+
all equal findings in the product as false positive (if they weren't already).
1028+
The retroactively replication will be also trigerred if the finding passed as
1029+
an argument already is a false positive. With this feature we can assure that
1030+
on each call of this method all findings in the product complies to the rule
1031+
(if one finding is a false positive, all equal findings in the same product also are).
1032+
1033+
Args:
1034+
finding (:model:`dojo.Finding`): Finding to be replicated
1035+
1036+
"""
1037+
do_false_positive_history_batch([finding])
1038+
1039+
1040+
def match_finding_to_existing_findings(finding, product=None, engagement=None, test=None):
1041+
"""
1042+
Customizable lookup that returns all existing findings for a given finding.
1043+
1044+
Takes one finding as an argument and returns all findings that are equal to it
1045+
on the same product, engagement or test. For now, only one custom filter can
1046+
be used, so you should choose between product, engagement or test.
1047+
The lookup is done based on the deduplication_algorithm of the given finding test.
1048+
1049+
Args:
1050+
finding (:model:`dojo.Finding`): Finding to be matched
1051+
product (:model:`dojo.Product`, optional): Product to filter findings by
1052+
engagement (:model:`dojo.Engagement`, optional): Engagement to filter findings by
1053+
test (:model:`dojo.Test`, optional): Test to filter findings by
1054+
1055+
"""
1056+
if product:
1057+
custom_filter_type = "product"
1058+
custom_filter = {"test__engagement__product": product}
1059+
1060+
elif engagement:
1061+
custom_filter_type = "engagement"
1062+
custom_filter = {"test__engagement": engagement}
1063+
1064+
elif test:
1065+
custom_filter_type = "test"
1066+
custom_filter = {"test": test}
1067+
1068+
else:
1069+
msg = "No product, engagement or test provided as argument."
1070+
raise ValueError(msg)
1071+
1072+
deduplication_algorithm = finding.test.deduplication_algorithm
1073+
1074+
deduplicationLogger.debug(
1075+
"Matching finding %i:%s to existing findings in %s %s using %s as deduplication algorithm.",
1076+
finding.id, finding.title, custom_filter_type, list(custom_filter.values())[0], deduplication_algorithm,
1077+
)
1078+
1079+
if deduplication_algorithm == "legacy":
1080+
# This is the legacy reimport behavior. Although it's pretty flawed and
1081+
# doesn't match the legacy algorithm for deduplication, this is left as is for simplicity.
1082+
# Re-writing the legacy deduplication here would be complicated and counter-productive.
1083+
# If you have use cases going through this section, you're advised to create a deduplication configuration for your parser
1084+
logger.debug("Legacy dedupe. In case of issue, you're advised to create a deduplication configuration in order not to go through this section")
1085+
1086+
exclude_ids = {finding.id} if finding.id else set()
1087+
qs = _fp_candidates_qs(custom_filter, deduplication_algorithm, [finding], exclude_ids=exclude_ids)
1088+
1089+
if deduplication_algorithm == "unique_id_from_tool_or_hash_code":
1090+
deduplicationLogger.debug(qs.query)
1091+
1092+
return qs

dojo/finding/helper.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@
2121
from dojo.finding.deduplication import (
2222
dedupe_batch_of_findings,
2323
do_dedupe_finding_task_internal,
24+
do_false_positive_history,
25+
do_false_positive_history_batch,
2426
get_finding_models_for_deduplication,
2527
)
2628
from dojo.jira_link.helper import is_keep_in_sync_with_jira
@@ -46,7 +48,6 @@
4648
from dojo.utils import (
4749
calculate_grade,
4850
close_external_issue,
49-
do_false_positive_history,
5051
get_current_user,
5152
get_object_or_none,
5253
mass_model_updater,
@@ -501,8 +502,7 @@ def post_process_findings_batch(
501502
if system_settings.enable_deduplication:
502503
deduplicationLogger.warning("skipping false positive history because deduplication is also enabled")
503504
else:
504-
for finding in findings:
505-
do_false_positive_history(finding, *args, **kwargs)
505+
do_false_positive_history_batch(findings)
506506

507507
# Non-status changing tasks
508508
if issue_updater_option:

0 commit comments

Comments
 (0)