Skip to content

Conversation

@bhearsum
Copy link
Contributor

@bhearsum bhearsum commented Aug 8, 2025

This is a cleaned up and slightly improved version of @ahal's original patch in #717. Most notably, it uses wait to resubmit new kinds as soon as they become available (instead of waiting for all kinds in each round to be completed). This means that if a slow kind gets submitted before all other (non-downstream) kinds have been submitted, that it won't block them. In the case of Gecko, the effect of this is that the test kind begins to process very quickly, and all other kinds are finished processing before that has completed.

Locally, this took ./mach taskgraph tasks from 1m26s to 1m9s (measured from command start to the final "Generated xxx tasks" message.

On try the results were a bit more mixed. The minimum time I observed without this patch was 140s, while the maximum was 291s (which seems to have been caused by bugbug slowness...which I'm willing to throw out). Outside of that outlier, the maximum was 146s and the mean was 143s. The minimum time I observed with this patch was 130s, while the maximum was 144s and the mean was 138s.

I presume the difference in results locally vs. Try is that locally I'm on a 64-core SSD machine, and the decision tasks run on lowered powered machines on Try, so there ends up being some resource contention (I/O, I suspect, because the ProcessPoolExecutor will only run one process per CPU core) when we process kinds in parallel there.

Despite this disappointing result on Try, this may still be worth taking, as ./mach taskgraph runs twice in the critical path of many try pushes (once on a developer's machine, and again in the decision task).

raw data:
Over 5 runs on try I got, without this patch: 291s, 146s, 146s, 140s, 140s

In each of those, there were 241s, 92s, 94s, 90s, 90s between "Loading tasks for kind test" and "Generated xxxxxx tasks for kind test"

Which means we spent the following amount of time doing non-test kind things in the critical path: 50s, 54s, 52s, 50s, 50s

With this patch: 130s, 141s, and 144s, 140s, 135s

In each of those, there were 105s, 114s, 115s, 114s, 109s between "Loading tasks for kind test" and "Generated xxxxxx tasks for kind test"

Which means we spent the following amount of time doing non-test kind things, but it was almost entirely out of the critical path: 25s, 27s, 29s, 26s, 26s

@bhearsum bhearsum force-pushed the push-zwqqrttyptxp branch from 3a2c2fd to b15ccee Compare August 8, 2025 18:59
@bhearsum bhearsum marked this pull request as ready for review August 8, 2025 19:04
@bhearsum bhearsum requested a review from a team as a code owner August 8, 2025 19:04
@bhearsum bhearsum requested a review from jcristau August 8, 2025 19:04
Copy link
Contributor

@Eijebong Eijebong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a graph fail (for example because some dependency is missing in a transform), exhibits 3 issues.

  1. The error is printed when it happens, which might be in the middle of the graph generation which makes it hard to see because in big graphs it'll be hidden in the middle of the rest
  2. When a kind errors out it doesn't stop graph generation. So if you have a kind that fails in 0.5, but a second kind that takes 30s to complete, the command will take 30s. This is easily reproducible by adding a time.sleep(30) in a random transform and having another one in another kind fail
  3. taskgraph full still exits with 0 despite the graph generation having failed. (on main it properly stops immediately and exits with 1).

ahal and others added 2 commits August 11, 2025 10:41
This is a cleaned up and slightly improved version of @ahal's original patch. Most notably, it uses `wait` to resubmit new kinds as soon as they become available (instead of waiting for all kinds in each round to be completed). This means that if a slow kind gets submitted before all other (non-downstream) kinds have been submitted, that it won't block them. In the case of Gecko, the effect of this is that the `test` kind begins to process very quickly, and all other kinds are finished processing before that has completed.

Locally, this took `./mach taskgraph tasks` from 1m26s to 1m9s (measured from command start to the final "Generated xxx tasks" message.

On try the results were a bit more mixed. The minimum time I observed without this patch was 140s, while the maximum was 291s (which seems to have been caused by bugbug slowness...which I'm willing to throw out). Outside of that outlier, the maximum was 146s and the mean was 143s. The minimum time I observed with this patch was 130s, while the maximum was 144s and the mean was 138s.

I presume the difference in results locally vs. Try is that locally I'm on a 64-core SSD machine, and the decision tasks run on lowered powered machines on Try, so there ends up being some resource contention (I/O, I suspect, because the ProcessPoolExecutor will only run one process per CPU core) when we process kinds in parallel there.

Despite this disappointing result on Try, this may still be worth taking, as `./mach taskgraph` runs twice in the critical path of many try pushes (once on a developer's machine, and again in the decision task).

raw data:
Over 5 runs on try I got, without this patch: 291s, 146s, 146s, 140s, 140s

In each of those, there were 241s, 92s, 94s, 90s, 90s between "Loading tasks for kind test" and "Generated xxxxxx tasks for kind test"

Which means we spent the following amount of time doing non-test kind things in the critical path: 50s, 54s, 52s, 50s, 50s

With this patch: 130s, 141s, and 144s, 140s, 135s

In each of those, there were 105s, 114s, 115s, 114s, 109s between "Loading tasks for kind test" and "Generated xxxxxx tasks for kind test"

Which means we spent the following amount of time doing non-test kind things, but it was almost entirely out of the critical path: 25s, 27s, 29s, 26s, 26s
@bhearsum
Copy link
Contributor Author

Having a graph fail (for example because some dependency is missing in a transform), exhibits 3 issues.

1. The error is printed when it happens, which might be in the middle of the graph generation which makes it hard to see because in big graphs it'll be hidden in the middle of the rest

2. When a kind errors out it doesn't stop graph generation. So if you have a kind that fails in 0.5, but a second kind that takes 30s to complete, the command will take 30s. This is easily reproducible by adding a `time.sleep(30)` in a random transform and having another one in another kind fail

3. `taskgraph full` still exits with `0` despite the graph generation having failed. (on main it properly stops immediately and exits with 1).

Thanks for catching these. As it turns out, all three are addressed by actually checking for exceptions from the futures and shutting down the executor and re-raising the exception.

@bhearsum bhearsum requested a review from Eijebong August 11, 2025 15:05
Copy link
Contributor

@Eijebong Eijebong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks fine. Tested on a bunch of mozilla and non mozilla project and didn't see major differences in the output. The only thing that differed was the order of some dependencies in the dependency list but if that has an impact on anything downstream I'd argue that it's not taskgraph's problem...

Perf wise it's slower on my machine for tiny graphs (30% slower for a graph generating 26 tasks from 5 kinds, with one kind generating 21 of those, 0.172s to 0.232s). I had some better results on a graph generating 2000 tasks, 500 per kind, where it went from 0.8s to 0.5s. This makes sense to me and I think is an acceptable drawback because for those tasks, taskgraph generation was never going to be the main bottleneck anyway.

@bhearsum bhearsum merged commit a5fc968 into taskcluster:main Aug 11, 2025
16 of 17 checks passed
bhearsum added a commit to bhearsum/firefox that referenced this pull request Aug 13, 2025
…ers!

This patch moves mochitests out of the `test` kind and into their own kind. Aside from changing the `kind` and `source` metadata in tasks, it has no effect on task definitions.

On its own, this has no advantages. After we pick up [multiprocess kind evaluation](taskcluster/taskgraph#738) from upstream taskgraph this will make a notable improvement to `./mach taskgraph` runtime.

Differential Revision: https://phabricator.services.mozilla.com/D260707
lando-prod-mozilla bot pushed a commit to mozilla-firefox/firefox that referenced this pull request Sep 2, 2025
…rs,ahal

This patch moves mochitests out of the `test` kind and into their own kind. Aside from changing the `kind` and `source` metadata in tasks, it has no effect on task definitions.

On its own, this has no advantages. After we pick up [multiprocess kind evaluation](taskcluster/taskgraph#738) from upstream taskgraph this will make a notable improvement to `./mach taskgraph` runtime.

Differential Revision: https://phabricator.services.mozilla.com/D260707
i3roly pushed a commit to i3roly/firefox-dynasty that referenced this pull request Sep 8, 2025
…rs,ahal

This patch moves mochitests out of the `test` kind and into their own kind. Aside from changing the `kind` and `source` metadata in tasks, it has no effect on task definitions.

On its own, this has no advantages. After we pick up [multiprocess kind evaluation](taskcluster/taskgraph#738) from upstream taskgraph this will make a notable improvement to `./mach taskgraph` runtime.

Differential Revision: https://phabricator.services.mozilla.com/D260707
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this pull request Sep 8, 2025
…rs,ahal

This patch moves mochitests out of the `test` kind and into their own kind. Aside from changing the `kind` and `source` metadata in tasks, it has no effect on task definitions.

On its own, this has no advantages. After we pick up [multiprocess kind evaluation](taskcluster/taskgraph#738) from upstream taskgraph this will make a notable improvement to `./mach taskgraph` runtime.

Differential Revision: https://phabricator.services.mozilla.com/D260707

UltraBlame original commit: 664d33826bb46965770be6166ade11b6e4c43d94
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this pull request Sep 8, 2025
…rs,ahal

This patch moves mochitests out of the `test` kind and into their own kind. Aside from changing the `kind` and `source` metadata in tasks, it has no effect on task definitions.

On its own, this has no advantages. After we pick up [multiprocess kind evaluation](taskcluster/taskgraph#738) from upstream taskgraph this will make a notable improvement to `./mach taskgraph` runtime.

Differential Revision: https://phabricator.services.mozilla.com/D260707

UltraBlame original commit: 664d33826bb46965770be6166ade11b6e4c43d94
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this pull request Sep 8, 2025
…rs,ahal

This patch moves mochitests out of the `test` kind and into their own kind. Aside from changing the `kind` and `source` metadata in tasks, it has no effect on task definitions.

On its own, this has no advantages. After we pick up [multiprocess kind evaluation](taskcluster/taskgraph#738) from upstream taskgraph this will make a notable improvement to `./mach taskgraph` runtime.

Differential Revision: https://phabricator.services.mozilla.com/D260707

UltraBlame original commit: 664d33826bb46965770be6166ade11b6e4c43d94
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants