Skip to content

Add alternative shard strategy#545

Draft
kmontemayor2-sc wants to merge 15 commits intomainfrom
kmonte/update-node-shard-strategy
Draft

Add alternative shard strategy#545
kmontemayor2-sc wants to merge 15 commits intomainfrom
kmonte/update-node-shard-strategy

Conversation

@kmontemayor2-sc
Copy link
Collaborator

@kmontemayor2-sc kmontemayor2-sc commented Mar 13, 2026

Making some changes to the way we distribute nodes for graph store mode.

This is one step in allowing us to reduce the produce load across the cluster, and decreasing cluster spin up time and increasing overall stability.

I'm also introducing gigl/distributed/graph_store/messages.py for complicated RPC messages, so we don't have to rely on tuples for this.

There's some minor clean up/etc in remote_dist_dataset_test.py to help reduce complexity there :)

@kmontemayor2-sc
Copy link
Collaborator Author

/all_test

@github-actions
Copy link
Contributor

github-actions bot commented Mar 13, 2026

GiGL Automation

@ 24:19:30UTC : 🔄 Python Unit Test started.

@ 01:26:31UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 13, 2026

GiGL Automation

@ 24:19:30UTC : 🔄 Lint Test started.

@ 24:25:57UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 13, 2026

GiGL Automation

@ 24:19:31UTC : 🔄 Integration Test started.

@ 01:52:42UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 13, 2026

GiGL Automation

@ 24:19:32UTC : 🔄 Scala Unit Test started.

@ 24:28:30UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 13, 2026

GiGL Automation

@ 24:19:32UTC : 🔄 E2E Test started.

@ 01:38:44UTC : ✅ Workflow completed successfully.

@kmontemayor2-sc
Copy link
Collaborator Author

/all_test

@github-actions
Copy link
Contributor

github-actions bot commented Mar 23, 2026

GiGL Automation

@ 23:59:14UTC : 🔄 Lint Test started.

@ 24:06:39UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 23, 2026

GiGL Automation

@ 23:59:15UTC : 🔄 E2E Test started.

@ 01:22:05UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 23, 2026

GiGL Automation

@ 23:59:17UTC : 🔄 Python Unit Test started.

@ 01:15:30UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 23, 2026

GiGL Automation

@ 23:59:17UTC : 🔄 Scala Unit Test started.

@ 24:07:40UTC : ✅ Workflow completed successfully.

@github-actions
Copy link
Contributor

github-actions bot commented Mar 23, 2026

GiGL Automation

@ 23:59:17UTC : 🔄 Integration Test started.

@ 01:19:53UTC : ✅ Workflow completed successfully.

kmonte and others added 9 commits March 24, 2026 16:41
Expand the one-line docstring to include concrete examples showing how
ROUND_ROBIN and CONTIGUOUS strategies distribute node IDs across compute
nodes, including split filtering and fractional server assignment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… check

The world_size != num_compute_nodes validation was unnecessarily
restrictive — callers may legitimately pass a different world_size.
Also extract the validator to a module-level function since it no longer
needs self.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The sliced tensor holds a reference to the original, but in the
contiguous flow the original is a local variable that goes out of scope,
so the slice effectively owns the data. Removing clone() avoids an
unnecessary copy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the all_reduce count comparison with all_gather + sorted tensor
comparison to catch cases where counts match but actual node IDs differ
between CONTIGUOUS and ROUND_ROBIN strategies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Merge _make_rank_aware_async_mock and _make_rank_aware_ablp_async_mock
  into a single generic helper
- Remove _assert_contiguous_node_ids and _assert_contiguous_ablp_inputs
  helpers, inline assertions directly in tests
- Replace @parameterized.expand with separate named test methods for
  better readability
- Fix stale variable reference in integration test log line

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Annotate _mock_request_server, _mock_async_request_server,
_patch_remote_requests, and _create_server_with_splits kwargs with
proper type hints. Add Callable, Iterator, and Any imports.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants