Skip to content

Conversation

@Jackmin801
Copy link
Member

Implements a TCPStore-based micro batch broadcaster, providing an alternative transport mechanism for distributing micro batches using torch.distributed.TCPStore.

This PR introduces TCPStoreMicroBatchSender and TCPStoreMicroBatchReceiver, along with a TCPStoreTransportConfig, to enable efficient micro batch communication between a master and worker ranks. A ready barrier is implemented for synchronization, and the TCPStore transport is explicitly limited to micro batches.


GitHub Issue: [Issue ID]
Linear Issue: Resolves [Issue ID]


Open in Cursor Open in Web

Co-authored-by: ongjackm <ongjackm@gmail.com>
@cursor
Copy link

cursor bot commented Dec 27, 2025

Cursor Agent can help with this pull request. Just @cursor in comments and I'll start working on changes in this branch.
Learn more about Cursor Agents

This commit introduces a script to benchmark filesystem, ZMQ, and TCPStore transport implementations. It measures latency and throughput for various micro batch sizes.

Co-authored-by: ongjackm <ongjackm@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants