Skip to content

Conversation

@MadLittleMods
Copy link
Collaborator

@MadLittleMods MadLittleMods commented Dec 15, 2025

Add environment variables to control container resource constraints (COMPLEMENT_CONTAINER_CPU_CORES, COMPLEMENT_CONTAINER_MEMORY). This is useful to mimic a resource-constrained environment, like a CI environment.

This is spawning from some consistent flaky tests I'm seeing when running the Complement test suite with some GitHub runners in a private project.

For reference, the CI runners provided by GitHub for private projects are less than half as powerful as those for public projects.

Standard GitHub-hosted runners for public repositories

Virtual machine Processor (CPU) Memory (RAM) Storage (SSD) Architecture Workflow label
Linux 4 16 GB 14 GB x64 ubuntu-latest, ubuntu-24.04, ubuntu-22.04

-- Standard GitHub-hosted runners for public repositories


Standard GitHub-hosted runners for private repositories

Virtual Machine Processor (CPU) Memory (RAM) Storage (SSD) Architecture Workflow label
Linux 2 7 GB 14 GB x64 ubuntu-latest, ubuntu-24.04, ubuntu-22.04

-- Standard GitHub-hosted runners for private repositories

I'm now able to reproduce the same failures locally when I constrain the CPU to less than a single core COMPLEMENT_CONTAINER_CPU_CORES=0.5

Dev notes

Docker resource constraints: https://docs.docker.com/engine/containers/resource_constraints/


How to generate ENVIRONMENT.md: go run ./cmd/gendoc --config config/config.go > ENVIRONMENT.md

Pull Request Checklist

constraintStrings = append(constraintStrings, fmt.Sprintf("%.1f CPU cores", cfg.ContainerCPUCores))
}
if cfg.ContainerMemoryBytes > 0 {
// TODO: It would be nice to pretty print this in MB/GB etc.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a future TODO (not for this PR)

// Valid units are "B", (decimal: "KB", "MB", "GB, "TB, "PB"), (binary: "KiB", "MiB",
// "GiB", "TiB", "PiB") or no units (bytes). We also support "K", "M", "G" as per
// Docker's CLI.
func parseByteSizeString(inputString string) (int64, error) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just based off glancing a few implementations and trying to make it as simple as possible for our usage here.

Copy link
Collaborator Author

@MadLittleMods MadLittleMods Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With some hindsight, we could go even simpler and only support the same units that the Docker CLI has:

Most of these options take a positive integer, followed by a suffix of b, k, m, g, to indicate bytes, kilobytes, megabytes, or gigabytes.

-- https://docs.docker.com/engine/containers/resource_constraints/


I built this the other way around though. I wanted to be able to pass in COMPLEMENT_CONTAINER_MEMORY=1GB and wrote this out, then only later added in the Docker CLI unit variants to come to this realization 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'll be nice dev UX to be this flexible.

// These are also supported to match Docker's CLI
"k": 1024,
"m": intPow(1024, 2),
"g": intPow(1024, 3),
Copy link
Collaborator Author

@MadLittleMods MadLittleMods Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See docker/cli -> cli/command/container/opts_test.go#L376-L377 to cross-check decimal vs binary.

Documented as:

Most of these options take a positive integer, followed by a suffix of b, k, m, g, to indicate bytes, kilobytes, megabytes, or gigabytes.

-- https://docs.docker.com/engine/containers/resource_constraints/

@MadLittleMods MadLittleMods marked this pull request as ready for review December 15, 2025 20:10
@MadLittleMods MadLittleMods requested review from a team and kegsay as code owners December 15, 2025 20:10
MadLittleMods added a commit that referenced this pull request Dec 22, 2025
…t` to account for slow servers (de-flake) (#830)

Follow-up to #829

Part of element-hq/synapse#18537

### What does this PR do?

Fix `TestDelayedEvents/delayed_state_events_are_kept_on_server_restart` to account for slow servers (de-flake).

Previously, this was naively using a single delayed event with a 10 second delay. But because we're stopping and starting servers here, it could take up `deployment.GetConfig().SpawnHSTimeout` (defaults to 30 seconds) for the server to start up again so by the time the server is back up, the delayed event may have already been sent, invalidating our assertions below (which expect some delayed events to still be pending and then see one of them be sent after the server is back up).

We could account for this by setting the delayed event delay to be longer than `deployment.GetConfig().SpawnHSTimeout` but that would make the test suite take longer to run in all cases even for homeservers that are quick to restart because we have to wait for that large delay.

We instead account for this by scheduling many delayed events at short intervals (we chose 10 seconds because that's what the test naively chose before). Then whenever the servers comes back, we can just check until it decrements by 1.


### Experiencing the flaky failure

As experienced when running this test against the worker-based Synapse setup we use alongside the Synapse Pro Rust apps, element-hq/synapse-rust-apps#344 (comment). We probably experience this heavily in the private project because GitHub runners are less than half as powerful as those for public projects and that single container with a share of the 2 CPU cores available is just not powerful enough to run all 16 workers effectively.

<details>
<summary>For reference, the CI runners provided by GitHub for private projects are less than half as powerful as those for public projects.</summary>

> #### Standard GitHub-hosted runners for public repositories
>
> Virtual machine | Processor (CPU) | Memory (RAM) | Storage (SSD) | Architecture | Workflow label
> --- | --- | --- | --- | --- | ---
> Linux | 4 | 16 GB | 14 GB | x64 | ubuntu-latest, ubuntu-24.04, ubuntu-22.04
>
> *-- [Standard GitHub-hosted runners for public repositories](https://docs.github.com/en/actions/reference/runners/github-hosted-runners#standard-github-hosted-runners-for-public-repositories)*

---

> #### Standard GitHub-hosted runners for private repositories
>
> Virtual Machine | Processor (CPU) | Memory (RAM) | Storage (SSD) | Architecture | Workflow label
> --- | --- | --- | --- | --- | ---
> Linux | 2 | 7 GB | 14 GB | x64 | ubuntu-latest, ubuntu-24.04, ubuntu-22.04
>
> *-- [Standard GitHub-hosted runners for private repositories](https://docs.github.com/en/actions/reference/runners/github-hosted-runners#standard-github-hosted-runners-for-public-repositories)*

</details>


And for the same slow reasons, why we're also experiencing this as an occasional [flake](element-hq/synapse#18537) with `(workers, postgres)` in the public Synapse CI as well.


### Reproduction instructions

I can easily reproduce this problem if I use #827 to limit the number of CPU's available for the homeserver containers to use: `COMPLEMENT_CONTAINER_CPUS=0.5`
Copy link
Member

@anoadragon453 anoadragon453 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just a couple small things before it's ready to go.

config/config.go Outdated
// each iteration had a 50ms sleep between tries so the timeout is 50 * iteration ms
cfg.SpawnHSTimeout = time.Duration(50*parseEnvWithDefault("COMPLEMENT_VERSION_CHECK_ITERATIONS", 100)) * time.Millisecond
}
cfg.ContainerCPUCores, _ = strconv.ParseFloat(os.Getenv("COMPLEMENT_CONTAINER_CPUS"), 64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be:

Suggested change
cfg.ContainerCPUCores, _ = strconv.ParseFloat(os.Getenv("COMPLEMENT_CONTAINER_CPUS"), 64)
cfg.ContainerCPUCores, _ = strconv.ParseFloat(os.Getenv("COMPLEMENT_CONTAINER_CPUS"), 0)

so it is also unlimited?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

64 is the bitsize of the float:

https://pkg.go.dev/strconv#ParseFloat

// ParseFloat converts the string s to a floating-point number
// with the precision specified by bitSize: 32 for float32, or 64 for float64.
// When bitSize=32, the result still has type float64, but it will be
// convertible to float32 without changing its value.

// Valid units are "B", (decimal: "KB", "MB", "GB, "TB, "PB"), (binary: "KiB", "MiB",
// "GiB", "TiB", "PiB") or no units (bytes). We also support "K", "M", "G" as per
// Docker's CLI.
func parseByteSizeString(inputString string) (int64, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'll be nice dev UX to be this flexible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add this to .gitignore?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤷 nahh, we can actually notice this way. I've added some example usage to the top of cmd/gendoc/main.go but ideally, the script should just do the right thing without all of the options and piping.

@MadLittleMods MadLittleMods changed the title Add environment variables to control container resource constraints (COMPLEMENT_CONTAINER_CPUS, COMPLEMENT_CONTAINER_MEMORY) Add environment variables to control container resource constraints (COMPLEMENT_CONTAINER_CPU_CORES, COMPLEMENT_CONTAINER_MEMORY) Dec 24, 2025
@MadLittleMods MadLittleMods merged commit ee6acd9 into main Dec 24, 2025
4 checks passed
@MadLittleMods MadLittleMods deleted the madlittlemods/resource-constraints branch December 24, 2025 17:11
@MadLittleMods
Copy link
Collaborator Author

Thanks for the review @anoadragon453 🐇

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants