Skip to content

Optimize re-sync content#7754

Draft
jobselko wants to merge 2 commits into
pulp:mainfrom
jobselko:optimize_replication
Draft

Optimize re-sync content#7754
jobselko wants to merge 2 commits into
pulp:mainfrom
jobselko:optimize_replication

Conversation

@jobselko
Copy link
Copy Markdown
Member

📜 Checklist

  • Commits are cleanly separated with meaningful messages (simple features and bug fixes should be squashed to one commit)
  • A changelog entry or entries has been added for any significant changes
  • Follows the Pulp policy on AI Usage
  • (For new features) - User documentation and test coverage has been added

See: Pull Request Walkthrough

Assisted By: Claude Opus 4.6
def __init__(self, repo_version=None, deferred_fields=None, *args, **kwargs):
super().__init__(*args, **kwargs)
self._repo_version = repo_version
self._deferred_fields = deferred_fields or {}
Copy link
Copy Markdown
Contributor

@dralley dralley Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woudn't this be the opposite of "deferred"? It's the list of fields we want to query up-front for each type, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. The name I chose is misleading, changed to extra_fields

cached = self._content_cache.get(model_type, {}).get(nat_key)
if cached is not None:
d_content.content = cached
cache_hits_by_type[model_type] |= Q(pk=cached.pk)
Copy link
Copy Markdown
Contributor

@dralley dralley Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to do it this way instead of just collecting a list of PKs and passing pk__in=...?

(there could be, probably the constructed SQL is different, I just don't know)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I just copied the logic from content_q_by_type. Changed

cache_hits_by_type = defaultdict(lambda: Q(pk__in=[]))

for d_content in batch:
if d_content.content._state.adding:
Copy link
Copy Markdown
Contributor

@dralley dralley Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Theoretically content could be passed through already saved, in which case I think we're probably not touch ing it. That might be an existing bug, though not a particularly serious one.


for model_type, content_q in content_q_by_type.items():
try:
await sync_to_async(model_type.objects.filter(content_q).touch)()
Copy link
Copy Markdown
Contributor

@dralley dralley Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really an existing issue, but we're executing the content_q query (w/ natural keys) twice (once to touch, and once for the model swap), and also executing a touch query twice (once with cache hits, once with existing packages that were not in the latest version cache).

Is it possible to collect PKs within the loop, combine them with the cache hit PKs, and have one touch block below this loop?

Or (maybe question for @mdellweg), does that touch really need to happen before the swap for a timing-related reason?

@dralley
Copy link
Copy Markdown
Contributor

dralley commented Jun 1, 2026

Looks good! A few things to look at, maybe we can make this even more efficient

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants