-
Notifications
You must be signed in to change notification settings - Fork 21
Description
So I woke up with jetlag with this crazy idea in my head.
Currently our AMI generation process for generic-worker relies on mechanics provided by EC2 to bootstrap our instances. We have some magic to get logs from this process into taskcluster logs, and it is tricky for the process which snapshots the instance to produce the AMI to know whether the installation steps were successful or not. Also the code to set up the environment needs to install the worker itself, which means it is possible for this to go wrong and to produce AMIs for workers that don't actually have the worker installed and functioning on them.
This idea I had was pretty vague, but I'm dumping it here initially so that I can iterate on it, and others can join in with the conversation if they wish.
Imagine that instead we would bootstrap base images of a particular OS with generic-worker. When Bug 1439588 - Add feature to support running Windows tasks in an elevated process (Administrator) lands, we could introduce a mechanism that, given necessary scopes, a user could submit a job which installs packages / performs environment setup in a task directly as Administrator, and then cloud-specific mechanism that snapshots the instance and produces an image for the worker type. This could be used as a mechanism for people to customise the worker types.
The workflow could look something like this:
- Base image worker types are created, running generic-worker, with nothing else installed on them. The worker type name may be something like "win2012r2-base" if it is say running Windows Server 2012 R2
- A task is submitted to win2012r2-base with a scope-protected feature enabled (
"features": [ "createWorkerType" ]) in the task payload ("scopes": ["generic-worker:create-worker-type:<provisioner>/<workerType>"]). - This task installs tools etc, and either resolves successfully if all went well, or fails if it did not.
- The task could publish some artifacts with metadata about the changes it applied
- Instead of the worker rebooting as normal after task completion, it simply shuts down.
- Something external waits for the shutdown, takes an image, processes the task artifacts with the metadata, and creates (or updates) a worker type definition to use the newly generated image(s).
I'm not 100% sure about all this at the moment, this is very much a brainstorming exercise around the idea, but the objectives I was trying to achieve were:
- To have some neutral mechanism that is cloud-independent, for executing install steps on a worker environment. Currently we rely on passing userdata and the ec2 config service reading it, and executing it. Using this proposal, we'd have a cloud-agnostic way of applying installation steps
- It never sat well with me that defining a worker environment involved installing the generic-worker and configuring it too. This separates the two responsibilities rather well, so installation of toolchains etc is defined/implemented in a different context to the setup and management/configuration of the worker
- It makes it simple to see whether an environment bootstrapping process was successful or not, since the task can simply resolve with success or failure depending on any checks it makes. Previously we had no easy way to communicate from the installation process that the bootstrapping process was successful or not, and so AMIs could get produced even if the installation steps had failed. This was because the process that created the AMIs ran on a separate machine and had no easy way to get context back from the process running in the ec2 config service.
- It hopefully introduces a nice separation that will make it easier in future for us to move the environment bootstrapping in-tree into gecko.
- It uses existing frameworks for authorizing who/what can create AMIs (taskcluster auth), track logs from the process (task artifacts), and meet demands of running at scale by running taskcluster tasks.