Refactor batch size and snapshot interval handling to use configurable environment variables

andrew · andrew · commit c93f85d26a11 · 2026-01-04T21:46:50.000Z
diff --git a/docs/internals.md b/docs/internals.md
@@ -19,7 +19,7 @@ The schema has six main tables:
 - `dependency_changes` records every add, modify, or remove event
 - `dependency_snapshots` stores full dependency state at intervals
 
-Snapshots exist because replaying thousands of change records to answer "what dependencies existed at commit X?" would be slow. Instead, we store the complete dependency set every 20 commits (`SNAPSHOT_INTERVAL`). Point-in-time queries find the nearest snapshot and replay only the changes since then.
+Snapshots exist because replaying thousands of change records to answer "what dependencies existed at commit X?" would be slow. Instead, we store the complete dependency set every 50 commits by default. Point-in-time queries find the nearest snapshot and replay only the changes since then.
 
 ## Git Access
 
@@ -47,12 +47,12 @@ When you run `git pkgs init` (see [`commands/init.rb`](../lib/git/pkgs/commands/
 2. Switches to bulk write mode (WAL, synchronous off, large cache)
 3. Walks commits chronologically
 4. For each commit with manifest changes, calls `analyzer.analyze_commit`
-5. Batches inserts in transactions of 100 commits
-6. Creates dependency snapshots every 20 commits that changed dependencies
+5. Batches inserts in transactions of 500 commits
+6. Creates dependency snapshots every 50 commits that changed dependencies
 7. Creates indexes after all data is loaded
 8. Switches back to normal sync mode
 
-Deferring index creation until the end speeds things up considerably. The batch size of 100 is a balance between transaction overhead and memory usage.
+Deferring index creation until the end speeds things up considerably. Both batch size and snapshot interval are configurable via environment variables (see Performance Notes below).
 
 ## Incremental Updates
 
@@ -139,6 +139,9 @@ ActiveRecord models live in [`lib/git/pkgs/models/`](../lib/git/pkgs/models/). T
 
 ## Performance Notes
 
-Typical init speed is around 300 commits per second. The main bottlenecks are git blob reads and bibliothecary parsing. The blob OID cache helps a lot: if a Gemfile hasn't changed in 50 commits, we parse it once and reuse the result. The manifest path regex filter also helps by skipping commits that only touch source files.
+Typical init speed is around 75-300 commits per second depending on the repository. The main bottlenecks are git blob reads and bibliothecary parsing. The blob OID cache helps a lot: if a Gemfile hasn't changed in 50 commits, we parse it once and reuse the result. The manifest path regex filter also helps by skipping commits that only touch source files.
 
-For repositories with long histories, the database file can grow to tens of megabytes. The periodic snapshots trade storage for query speed. You could tune `SNAPSHOT_INTERVAL` if you care more about one than the other.
+For repositories with long histories, the database file can grow to tens of megabytes. The periodic snapshots trade storage for query speed. Two environment variables let you tune this:
+
+- `GIT_PKGS_BATCH_SIZE` - Number of commits per database transaction (default: 500). Larger batches reduce transaction overhead but use more memory.
+- `GIT_PKGS_SNAPSHOT_INTERVAL` - Store full dependency state every N commits with changes (default: 50). Lower values speed up point-in-time queries but increase database size.
diff --git a/lib/git/pkgs.rb b/lib/git/pkgs.rb
@@ -44,19 +44,27 @@ class NotInitializedError < Error; end
     class NotInGitRepoError < Error; end
 
     class << self
-      attr_accessor :quiet, :git_dir, :work_tree, :db_path
+      attr_accessor :quiet, :git_dir, :work_tree, :db_path, :batch_size, :snapshot_interval
 
       def configure_from_env
         @git_dir ||= presence(ENV["GIT_DIR"])
         @work_tree ||= presence(ENV["GIT_WORK_TREE"])
         @db_path ||= presence(ENV["GIT_PKGS_DB"])
+        @batch_size ||= int_presence(ENV["GIT_PKGS_BATCH_SIZE"])
+        @snapshot_interval ||= int_presence(ENV["GIT_PKGS_SNAPSHOT_INTERVAL"])
       end
 
       def reset_config!
         @quiet = false
         @git_dir = nil
         @work_tree = nil
         @db_path = nil
+        @batch_size = nil
+        @snapshot_interval = nil
+      end
+
+      def int_presence(value)
+        value && !value.empty? ? value.to_i : nil
       end
 
       def presence(value)
diff --git a/lib/git/pkgs/commands/branch.rb b/lib/git/pkgs/commands/branch.rb
@@ -6,8 +6,16 @@ module Commands
       class Branch
         include Output
 
-        BATCH_SIZE = 100
-        SNAPSHOT_INTERVAL = 20
+        DEFAULT_BATCH_SIZE = 500
+        DEFAULT_SNAPSHOT_INTERVAL = 50
+
+        def batch_size
+          Git::Pkgs.batch_size || DEFAULT_BATCH_SIZE
+        end
+
+        def snapshot_interval
+          Git::Pkgs.snapshot_interval || DEFAULT_SNAPSHOT_INTERVAL
+        end
 
         def initialize(args)
           @args = args
@@ -247,7 +255,7 @@ def bulk_process_commits(commits, branch, analyzer, total, repo)
 
               snapshot = result[:snapshot]
 
-              if dependency_commit_count % SNAPSHOT_INTERVAL == 0
+              if dependency_commit_count % snapshot_interval == 0
                 snapshot.each do |(manifest_path, name), dep_info|
                   pending_snapshots << {
                     sha: rugged_commit.oid,
@@ -262,7 +270,7 @@ def bulk_process_commits(commits, branch, analyzer, total, repo)
               end
             end
 
-            flush.call if pending_commits.size >= BATCH_SIZE
+            flush.call if pending_commits.size >= batch_size
           end
 
           if snapshot.any?
diff --git a/lib/git/pkgs/commands/init.rb b/lib/git/pkgs/commands/init.rb
@@ -6,8 +6,16 @@ module Commands
       class Init
         include Output
 
-        BATCH_SIZE = 100
-        SNAPSHOT_INTERVAL = 20 # Store snapshot every N dependency-changing commits
+        DEFAULT_BATCH_SIZE = 500
+        DEFAULT_SNAPSHOT_INTERVAL = 50
+
+        def batch_size
+          Git::Pkgs.batch_size || DEFAULT_BATCH_SIZE
+        end
+
+        def snapshot_interval
+          Git::Pkgs.snapshot_interval || DEFAULT_SNAPSHOT_INTERVAL
+        end
 
         def initialize(args)
           @args = args
@@ -35,15 +43,17 @@ def run
 
           info "Analyzing branch: #{branch_name}"
 
+          print "Loading commits..." unless Git::Pkgs.quiet
           walker = repo.walk(branch_name, @options[:since])
           commits = walker.to_a
           total = commits.size
+          print "\r#{' ' * 20}\r" unless Git::Pkgs.quiet
 
           stats = bulk_process_commits(commits, branch, analyzer, total)
 
           branch.update(last_analyzed_sha: repo.branch_target(branch_name))
 
-          print "\rCreating indexes..." unless Git::Pkgs.quiet
+          print "\rCreating indexes...#{' ' * 20}" unless Git::Pkgs.quiet
           Database.create_bulk_indexes
           Database.optimize_for_reads
 
@@ -52,7 +62,7 @@ def run
           info "\rDone!#{' ' * 20}"
           info "Analyzed #{total} commits"
           info "Found #{stats[:dependency_commits]} commits with dependency changes"
-          info "Stored #{stats[:snapshots_stored]} snapshots (every #{SNAPSHOT_INTERVAL} changes)"
+          info "Stored #{stats[:snapshots_stored]} snapshots (every #{snapshot_interval} changes)"
           info "Blob cache: #{cache_stats[:cached_blobs]} unique blobs, #{cache_stats[:blobs_with_hits]} had cache hits"
 
           unless @options[:no_hooks]
@@ -135,9 +145,11 @@ def bulk_process_commits(commits, branch, analyzer, total)
             pending_snapshots.clear
           end
 
+          progress_interval = [total / 100, 10].max
+
           commits.each do |rugged_commit|
             processed += 1
-            print "\rProcessing commit #{processed}/#{total}..." if !Git::Pkgs.quiet && (processed % 50 == 0 || processed == total)
+            print "\rProcessing commit #{processed}/#{total}..." if !Git::Pkgs.quiet && (processed % progress_interval == 0 || processed == total)
 
             next if rugged_commit.parents.length > 1 # skip merge commits
 
@@ -191,7 +203,7 @@ def bulk_process_commits(commits, branch, analyzer, total)
               snapshot = result[:snapshot]
 
               # Store snapshot at intervals
-              if dependency_commit_count % SNAPSHOT_INTERVAL == 0
+              if dependency_commit_count % snapshot_interval == 0
                 snapshot.each do |(manifest_path, name), dep_info|
                   pending_snapshots << {
                     sha: rugged_commit.oid,
@@ -206,7 +218,7 @@ def bulk_process_commits(commits, branch, analyzer, total)
               end
             end
 
-            flush.call if pending_commits.size >= BATCH_SIZE
+            flush.call if pending_commits.size >= batch_size
           end
 
           # Always store final snapshot for the last processed commit