Skip to content

perf(importers): batch Vulnerability_Id inserts#14966

Draft
valentijnscholten wants to merge 4 commits into
DefectDojo:devfrom
valentijnscholten:batch-vulnerability-id-inserts
Draft

perf(importers): batch Vulnerability_Id inserts#14966
valentijnscholten wants to merge 4 commits into
DefectDojo:devfrom
valentijnscholten:batch-vulnerability-id-inserts

Conversation

@valentijnscholten
Copy link
Copy Markdown
Member

@valentijnscholten valentijnscholten commented Jun 7, 2026

Summary

  • Fix sanitize_vulnerability_ids in dojo/finding/helper.py — was a silent no-op (reassigned local variable, return value never used by caller); now returns the filtered list
  • Replace per-row Vulnerability_Id.save() loop in save_vulnerability_ids with bulk_create, reducing N INSERT queries to 1 per finding for all callers (importer and reimporter)
  • Add cross-finding batch accumulation in DefaultImporter: store_vulnerability_ids appends Vulnerability_Id objects to self.pending_vulnerability_ids; flush_vulnerability_ids() bulk-inserts the whole batch at each batch boundary alongside location_handler.persist()
  • Add cross-finding batch accumulation in DefaultReImporter: reconcile_vulnerability_ids appends changed findings to self.pending_vuln_id_deletes and new IDs to self.pending_vulnerability_ids; flush_vulnerability_ids() issues one DELETE WHERE finding_id IN (...) followed by one bulk_create per batch; findings with unchanged IDs hit the early-exit path and never touch either buffer

For a scan with 1000 findings × 5 CVEs each: ~5000 INSERT queries reduced to O(number of batches) bulk_create calls. Same reduction applies on reimport for findings whose vulnerability IDs changed.

Replace per-row Vulnerability_Id saves with bulk_create in two layers:

- fix sanitize_vulnerability_ids to return filtered list (was a no-op
  bug — reassigned local variable, caller never saw the result)
- save_vulnerability_ids now uses bulk_create per finding instead of
  one INSERT per ID; fixes all callers including the reimporter path
- DefaultImporter.store_vulnerability_ids accumulates Vulnerability_Id
  objects across all findings in a batch; flush_vulnerability_ids()
  does a single bulk_create at each batch boundary (alongside
  location_handler.persist())

For a scan with 1000 findings × 5 CVEs each: 5000 INSERT queries
reduced to O(batches) bulk_create calls.
Extend the cross-finding accumulation pattern to DefaultReImporter:

- reconcile_vulnerability_ids now accumulates changed findings into
  pending_vuln_id_deletes / pending_vulnerability_ids instead of
  issuing per-finding DELETE + INSERT immediately
- flush_vulnerability_ids (BaseImporter) runs one bulk DELETE WHERE
  finding_id IN (...) followed by one bulk_create for all new IDs
- flush called at both dedupe batch boundaries (alongside
  location_handler.persist()) and after the mitigation loop

Early-exit path (unchanged IDs) never touches either buffer, so the
common case pays zero extra cost.

Add two unit tests: cross-finding batch (3 findings, 2 changed + 1
unchanged, verify buffer contents before flush and DB state after) and
unchanged-IDs early-exit (verify buffers stay empty).
Remove pending-rebaseline skips from TestDojoImporterPerformanceSmall
and TestDojoImporterPerformanceSmallLocations. Update all expected
query counts to reflect the batch Vulnerability_Id insert optimisation
(counts decrease by 1-20 queries per step depending on the scan size
and code path).
@valentijnscholten valentijnscholten added this to the 2.60.0 milestone Jun 7, 2026
The test mocked Vulnerability_Id.save (individual saves) but
save_vulnerability_ids now uses bulk_create. Django's bulk_create
validates FK references before issuing SQL, raising ValueError when
the finding has no pk. Mock bulk_create instead and assert on the
deduplicated object list passed to it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant