You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
perf: batch duplicate marking in batch deduplication (#14458)
* perf: batch duplicate marking in batch deduplication
Instead of saving each duplicate finding individually, collect all
modified findings during a batch deduplication run and flush them in
a single bulk_update call. Original (existing) findings are still
saved individually to preserve auto_now timestamp updates and
post_save signal behavior, but are deduplicated by id so each is
saved at most once per batch.
Reduces DB writes from O(2N) individual saves to 1 bulk_update +
O(unique originals) saves for a batch of N duplicates.
Performance test shows -23 queries on a second import with duplicates.
* perf: restrict SELECT columns for batch deduplication via only()
Add Finding.DEDUPLICATION_FIELDS — the union of all Finding fields
needed across every deduplication algorithm — and apply it as an
only() clause in get_finding_models_for_deduplication.
This avoids loading large text columns (description, mitigation,
impact, references, steps_to_reproduce, severity_justification, etc.)
when loading findings for the batch deduplication task, reducing
data transferred from the database without affecting query count.
build_candidate_scope_queryset is intentionally excluded: it is also
used for reimport matching (which accesses severity, numerical_severity
and other fields outside this set) and applying only() there would
cause deferred-field extra queries.
* perf(dedup): defer large text fields on candidate queryset
- Add Finding.DEDUPLICATION_DEFERRED_FIELDS constant listing large text
columns (description, mitigation, impact, references, etc.) that are
never read during deduplication or candidate matching.
- Apply .defer(*Finding.DEDUPLICATION_DEFERRED_FIELDS) in
build_candidate_scope_queryset to avoid loading those columns for the
potentially large candidate pool fetched per dedup batch.
Reduces deduplication second-import query count from 213 to 183 (-30).
---------
Co-authored-by: Matt Tesauro <mtesauro@gmail.com>
0 commit comments