Skip to content

Speed up import_data from ~20mins to 1min#60

Open
jpmckinney wants to merge 2 commits intodevfrom
import-data-optimization
Open

Speed up import_data from ~20mins to 1min#60
jpmckinney wants to merge 2 commits intodevfrom
import-data-optimization

Conversation

@jpmckinney
Copy link
Copy Markdown
Collaborator

@jpmckinney jpmckinney commented Apr 16, 2026

COPY FROM bypasses Django's ORM, which means it skips model validation and signals.

This is safe for the Data model because:

  • Model validation: Django's bulk_create (which this replaces) also doesn't call full_clean(), so we weren't getting model validation before either.
  • Database constraints: All constraints (foreign keys, NOT NULL, varchar(100)) are enforced at the database level and apply to COPY just as they did to INSERT.
  • Signals: The only signal on Data is post_save/post_delete for cache invalidation. We now call invalidate_data_caches() only once at the end.
  • auto_now_add/auto_now: These behaviors are implemented in Django's Python code, which COPY doesn't trigger. We set added and modified explicitly in the DataFrame before writing.

- Disconnect post_delete cache invalidation signal during bulk delete so
  Django can issue a direct DELETE instead of SELECTing all rows first to
  fire per-instance signals. Invalidate cache once at the end.
- Use PostgreSQL COPY FROM instead of Django's bulk_create,
  and pd.melt to reshape data in pandas/numpy C code instead of Python loops
- Replace per-geography loop with bulk operations: one SELECT geographies,
  one DELETE data, and one COPY data, for each state
- Add composite index on (geography_id, data_period) to speed up deletions
- Avoid hydrating Geography objects by using values_list
- Remove manual PK generation that caused IntegrityError on interrupted imports
  and ran two unnecessary queries at module import time
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant