perf(importers): batch Vulnerability_Id inserts#14966
Draft
valentijnscholten wants to merge 4 commits into
Draft
perf(importers): batch Vulnerability_Id inserts#14966valentijnscholten wants to merge 4 commits into
valentijnscholten wants to merge 4 commits into
Conversation
Replace per-row Vulnerability_Id saves with bulk_create in two layers: - fix sanitize_vulnerability_ids to return filtered list (was a no-op bug — reassigned local variable, caller never saw the result) - save_vulnerability_ids now uses bulk_create per finding instead of one INSERT per ID; fixes all callers including the reimporter path - DefaultImporter.store_vulnerability_ids accumulates Vulnerability_Id objects across all findings in a batch; flush_vulnerability_ids() does a single bulk_create at each batch boundary (alongside location_handler.persist()) For a scan with 1000 findings × 5 CVEs each: 5000 INSERT queries reduced to O(batches) bulk_create calls.
Extend the cross-finding accumulation pattern to DefaultReImporter: - reconcile_vulnerability_ids now accumulates changed findings into pending_vuln_id_deletes / pending_vulnerability_ids instead of issuing per-finding DELETE + INSERT immediately - flush_vulnerability_ids (BaseImporter) runs one bulk DELETE WHERE finding_id IN (...) followed by one bulk_create for all new IDs - flush called at both dedupe batch boundaries (alongside location_handler.persist()) and after the mitigation loop Early-exit path (unchanged IDs) never touches either buffer, so the common case pays zero extra cost. Add two unit tests: cross-finding batch (3 findings, 2 changed + 1 unchanged, verify buffer contents before flush and DB state after) and unchanged-IDs early-exit (verify buffers stay empty).
Remove pending-rebaseline skips from TestDojoImporterPerformanceSmall and TestDojoImporterPerformanceSmallLocations. Update all expected query counts to reflect the batch Vulnerability_Id insert optimisation (counts decrease by 1-20 queries per step depending on the scan size and code path).
The test mocked Vulnerability_Id.save (individual saves) but save_vulnerability_ids now uses bulk_create. Django's bulk_create validates FK references before issuing SQL, raising ValueError when the finding has no pk. Mock bulk_create instead and assert on the deduplicated object list passed to it.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
sanitize_vulnerability_idsindojo/finding/helper.py— was a silent no-op (reassigned local variable, return value never used by caller); now returns the filtered listVulnerability_Id.save()loop insave_vulnerability_idswithbulk_create, reducing N INSERT queries to 1 per finding for all callers (importer and reimporter)DefaultImporter:store_vulnerability_idsappendsVulnerability_Idobjects toself.pending_vulnerability_ids;flush_vulnerability_ids()bulk-inserts the whole batch at each batch boundary alongsidelocation_handler.persist()DefaultReImporter:reconcile_vulnerability_idsappends changed findings toself.pending_vuln_id_deletesand new IDs toself.pending_vulnerability_ids;flush_vulnerability_ids()issues oneDELETE WHERE finding_id IN (...)followed by onebulk_createper batch; findings with unchanged IDs hit the early-exit path and never touch either bufferFor a scan with 1000 findings × 5 CVEs each: ~5000 INSERT queries reduced to O(number of batches)
bulk_createcalls. Same reduction applies on reimport for findings whose vulnerability IDs changed.