Fix PPR sampler memory and labeled homogeneous ABLP by mkolodner-sc · Pull Request #645 · Snapchat/GiGL

mkolodner-sc · 2026-05-19T01:04:30Z

Summary

Makes the PPR sampler cheaper and fixes the labeled-homogeneous ABLP edge case that surfaced while exercising PPR through Graph Store.

Changes include:

Precompute PPR total-degree tensors by node type through DistDataset.degree_tensor.
Store degree tensors as int32 and share them across sampling workers instead of rebuilding/copying per worker.
Update DistPPRNeighborSampler to consume precomputed degree tensors directly.
Fix labeled homogeneous ABLP PPR sampling by only passing etype=None for true homogeneous graphs.
Fix PPR for GraphStore mode so that memory is shared for the degree tensor
Attach single-edge-type PPR outputs directly to homogeneous Data batches.
Expand PPR and degree unit coverage.

… into mkolodner-sc/ppr_gs_memory # Conflicts: # gigl/distributed/dist_ppr_sampler.py

mkolodner-sc · 2026-05-28T19:15:03Z

/all_tests

github-actions · 2026-05-28T19:17:28Z

GiGL Automation

@ 19:17:28UTC : 🔄 C++ Unit Test started.

@ 19:19:34UTC : ✅ Workflow completed successfully.

github-actions · 2026-05-28T19:17:30Z

GiGL Automation

@ 19:17:29UTC : 🔄 Python Unit Test started.

@ 20:14:17UTC : ✅ Workflow completed successfully.

github-actions · 2026-05-28T19:17:31Z

GiGL Automation

@ 19:17:30UTC : 🔄 E2E Test started.

@ 20:47:24UTC : ❌ Workflow failed.
Please check the logs for more details.

github-actions · 2026-05-28T19:17:31Z

GiGL Automation

@ 19:17:30UTC : 🔄 Lint Test started.

@ 19:26:20UTC : ✅ Workflow completed successfully.

github-actions · 2026-05-28T19:17:31Z

GiGL Automation

@ 19:17:31UTC : 🔄 Integration Test started.

@ 20:30:59UTC : ✅ Workflow completed successfully.

github-actions · 2026-05-28T19:17:31Z

GiGL Automation

@ 19:17:31UTC : 🔄 Scala Unit Test started.

@ 19:28:22UTC : ✅ Workflow completed successfully.

mkolodner-sc · 2026-05-28T20:48:24Z

/e2e_test

github-actions · 2026-05-28T20:48:41Z

GiGL Automation

@ 20:48:41UTC : 🔄 E2E Test started.

@ 22:34:48UTC : ❌ Workflow failed.
Please check the logs for more details.

mkolodner-sc · 2026-05-28T21:46:59Z

/e2e_test

github-actions · 2026-05-28T21:47:12Z

GiGL Automation

@ 21:47:12UTC : 🔄 E2E Test started.

@ 23:07:38UTC : ❌ Workflow failed.
Please check the logs for more details.

mkolodner-sc · 2026-05-28T22:42:13Z

/e2e_test

github-actions · 2026-05-28T22:42:25Z

GiGL Automation

@ 22:42:25UTC : 🔄 E2E Test started.

@ 24:51:17UTC : ❌ Workflow failed.
Please check the logs for more details.

mkolodner-sc · 2026-05-29T00:23:58Z

/e2e_test

github-actions · 2026-05-29T00:24:12Z

GiGL Automation

@ 24:24:11UTC : 🔄 E2E Test started.

@ 01:47:47UTC : ❌ Workflow failed.
Please check the logs for more details.

mkolodner-sc · 2026-05-29T00:37:56Z

/e2e_test

github-actions · 2026-05-29T00:38:06Z

GiGL Automation

@ 24:38:06UTC : 🔄 E2E Test started.

@ 02:04:12UTC : ✅ Workflow completed successfully.

…emory # Conflicts: # tests/unit/distributed/utils/degree_test.py

kmontemayor2-sc

Thanks Matt! Did a first pass here, fwiw I feel like this could have been multiple PRs for the different fixes / etc but this pr is fine as-is.

kmontemayor2-sc · 2026-05-29T20:52:30Z

+            self._degree_tensor = compute_and_broadcast_degree_tensor(
+                self.graph, self._edge_dir
+            )
+        share_memory(entity=self._degree_tensor)


ugh sorry to go back and forth on this, it may be weird to do this here instead of in share_ipc? Or do we not call degree_tensor until the subprocess launch?

In fact we do already share memory in share_ipc 1, is the problem here that we don't call degree_tensor until after we're in the subprocesses already?

Can you remind me where we first call degree_tensor and where that's located in the process tree?

Can you remind me where we first call degree_tensor and where that's located in the process tree?

Currently, the degree tensor is not built when we call build_dataset. Instead, it is first built after we are inside one of the data loaders and know we are doing PPR sampling (specified by the SamplingOptions).

In fact we do already share memory in share_ipc 1, is the problem here that we don't call degree_tensor until after we're in the subprocesses already?

Yes exactly, this was calling failures specifically on large-scale graph store cases since it was creating copies of the tensor before we had an opportunity to call share_ipc (and thereby share the memory of the degree tensor). The original fix I had here handled this by explicitly calling share memory before handing the degree tensor to the sampling workers for the GS setting, where otherwise it was creating many copies of this degree tensor.

I think the current solution to call share_memory immediately after building the degree tensor makes the most sense, since it doesn't need to rely on share_ipc and provides the most confidence for putting the degree tensor onto shared memory.

We'd still have local world size copies of this object in shared memory then right? Unless we want to create the degree tensor in the top-process?

IIRC our goal here was to abstract this away from the user s.t. they don't need to pass in this implementation detail into the loaders right?

And in that vein it's probably weird if we ask them to call dataset.degree_tensor naked in the top-level process? I think ideally we can poke at /dev/shm directly to setup the shared memory properly across the local nodes here.

My concern with calling share_memory inside the property is that it may be surprising to users share memory tensors do have additional restrictions on them iirc - I guess the distdataset already does that under the hood so maybe it's not too bad.

I guess I have a slight preference here for calling share_memory in our loaders after we get this tensor, but that's up to you. WDYT?

We'd still have local world size copies of this object in shared memory then right? Unless we want to create the degree tensor in the top-process?
IIRC our goal here was to abstract this away from the user s.t. they don't need to pass in this implementation detail into the loaders right?
And in that vein it's probably weird if we ask them to call dataset.degree_tensor naked in the top-level process? I think ideally we can poke at /dev/shm directly to setup the shared memory properly across the local nodes here.

Correct, this was a consideration made in the initial design. The tradeoff with the current approach is that there may still be local_world_size copies of the degree tensor, but since the degree tensor isn't impacting the memory bottleneck, we proceeded with this approach. Since this is the intention of the design and isn't blocking training or inference with PPR, I'd prefer if we save any further optimization as a follow-up if that becomes a blocker. One note is that the graph_store case doesn't have this constraint I believe, since it makes the share_memory call from the DistServer, so there are not local world size copies in that setting.

My concern with calling share_memory inside the property is that it may be surprising to users share memory tensors do have additional restrictions on them iirc - I guess the distdataset already does that under the hood so maybe it's not too bad.
I guess I have a slight preference here for calling share_memory in our loaders after we get this tensor, but that's up to you. WDYT?

That makes sense too, I can move the share_memory call back to the loaders in that case.

mkolodner-sc added 8 commits May 19, 2026 01:00

potential fix

ed818c2

Update

abb8e56

Update

a0e84fa

Improvements

088fe1b

Change int16 to int32

5ca621c

Fix degree tensor tests and type checks

ac2ef26

Merge branch 'mkolodner-sc/ppr_gs_memory' of github.com:Snapchat/GiGL…

7ad9faa

… into mkolodner-sc/ppr_gs_memory # Conflicts: # gigl/distributed/dist_ppr_sampler.py

Add E2E PPR graphstore test

d850b37

Update

845704b

Fixes

ebbc318

Fix PPR graph-store sampling worker capacity

65eac99

Fix

97bd538

more fixes

92c9f51

mkolodner-sc added 4 commits May 29, 2026 00:49

change back

7e31417

Avoid cast for heterogeneous inference node ids

d9d2086

Trim branch to PPR sampler fixes

fd1e9ae

Merge remote-tracking branch 'origin/main' into mkolodner-sc/ppr_gs_m…

71e1fa1

…emory # Conflicts: # tests/unit/distributed/utils/degree_test.py

mkolodner-sc changed the title ~~Potential Fix PPR Graph Store memory~~ Fix PPR sampler memory and labeled homogeneous ABLP May 29, 2026

mkolodner-sc mentioned this pull request May 29, 2026

Add graph-store PPR E2E wiring #655

Draft

mkolodner-sc added 7 commits May 29, 2026 17:09

Keep PPR test ty ignores

2ef9548

Remove stale PPR test ty ignore

b08f0e5

Use union shape for PPR degree tensors

a6eedd1

Restore useful degree computation comments

68ab0f2

Remove sampler diagnostic wrapper

e71ccdb

Simplify degree all-reduce helper

f76e548

Document degree tensor assumptions

23ee86f

kmontemayor2-sc reviewed May 29, 2026

View reviewed changes

Comment thread gigl/distributed/dist_sampling_producer.py

mkolodner-sc added 2 commits May 29, 2026 20:10

Address PPR degree review comments

3b3497d

Address PPR degree memory review comments

5ac1c63

kmontemayor2-sc reviewed May 29, 2026

View reviewed changes

mkolodner-sc added 3 commits May 29, 2026 21:26

Comments

aa42d7a

Document PPR degree tensor dtype rationale

2641834

Address remaining comments

1ff8635

kmontemayor2-sc approved these changes May 30, 2026

View reviewed changes

yliu2-sc approved these changes May 30, 2026

View reviewed changes

Conversation

mkolodner-sc commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

mkolodner-sc commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

mkolodner-sc commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

mkolodner-sc commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

mkolodner-sc commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

mkolodner-sc commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

mkolodner-sc commented May 29, 2026

Uh oh!

github-actions Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

GiGL Automation

Uh oh!

kmontemayor2-sc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kmontemayor2-sc May 29, 2026

Choose a reason for hiding this comment

Uh oh!

mkolodner-sc May 29, 2026

Choose a reason for hiding this comment

Uh oh!

kmontemayor2-sc May 29, 2026

mkolodner-sc commented May 19, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 29, 2026 •

edited

Loading

github-actions Bot commented May 29, 2026 •

edited

Loading