Fleet Validation
After the control-plane successfully requests new EC2 instances from AWS, a critical validation process must confirm that every new instance has successfully bootstrapped, registered with GitHub, and is healthy before it can be used.
This process is governed by a strict, all-or-nothing principle: if even a single instance in the newly created fleet fails its validation checks, the entire fleet is considered compromised and is terminated. This ensures that only fully operational and reliable sets of runners are introduced into the active pool.
The Validation Handshake 🤝
Validation relies on a handshake between the control-plane and the new instances, using DynamoDB as the coordination point. This avoids direct communication and creates a robust, event-driven process.
The sequence is as follows:
-
Control-Plane Initiates & Assigns
runId
: For each new EC2 instance, the control-plane creates a record in DynamoDB. This initial record has its state set tocreated
and, most importantly, includes therunId
of the workflow that requested it. -
Instance Acts on
runId
: The bootstrap script on the new instance (detailed in Instance Initialization) polls its DynamoDB record. Once it detects therunId
, it uses that ID to register itself as a GitHub Actions runner. ThisrunId
becomes the runner's label, ensuring it is targeted by the correct workflow jobs. -
Instance Signals Back: Upon successful registration with GitHub, the instance writes a registration signal (e.g.,
UD_REG_OK
) back to its DynamoDB record. In parallel, aheartbeat.sh
agent sends periodic pings to DynamoDB to signal its liveness. -
Control-Plane Observes: The control-plane's fleet validation routine polls DynamoDB, waiting for the expected registration and heartbeat signals from every instance in the fleet.
Sequence Diagram: Successful Fleet Validation
sequenceDiagram
participant CP as "ControlPlane"
participant DDB as "DynamoDB"
participant Instance1
participant Instance2
Note over CP, Instance2: Fleet Creation successfully creates Instance1 & Instance2
Note over DDB, Instance2: Instance1 & Instance2 periodically write heartbeat soon after startup ♻️
CP->>DDB: Create record for Instance 1 & Instance 2<br>(state: created, runId: 123)
CP->>+DDB: Poll for signals from all instances
Instance1->>+DDB: Poll for runId
DDB-->>-Instance1: Found runId: 123
Instance1->>DDB: Write Registration Signal (UD_REG_OK) 👌
Instance2->>+DDB: Poll for runId
DDB-->>-Instance2: Found runId: 123
Instance2->>DDB: Write Registration Signal (UD_REG_OK) 👌
DDB-->>-CP: All signals and heartbeats are OK<br>(within timeout)
CP->>DDB: Update Instance 1 & Instance 2<br>(state: created -> running)
Note right of CP: Fleet is validated as successful ✅
Validation Logic & Outcomes
The control-plane's validation process waits for two conditions to be met for the entire fleet before a validation timeout expires.
-
Success: If all instances pass both registration and heartbeat checks, the control-plane transitions their state in DynamoDB from
created
torunning
, and the fleet is ready for use. -
Failure: If any instance fails a check or the validation timeout is reached, the control-plane terminates every instance in the fleet. The creation process is marked as failed, preventing a partially-healthy or unreliable fleet from being used.
Sequence Diagram: 1 out of 2 Instances Fail to Register
sequenceDiagram
participant CP as "ControlPlane"
participant DDB as "DynamoDB"
participant Instance1
participant Instance2
participant AWS
Note over CP, AWS: Fleet Creation successfully creates Instance 1 & Instance 2
Note over DDB, Instance2: Instance1 & Instance2 periodically write heartbeat soon after startup ♻️
CP->>DDB: Create record for Instance 1 & 2 (state: created, runId: 123)
CP->>+DDB: Poll for signals from all instances (Validation Timeout Starts)
Instance1->>+DDB: Poll for runId
DDB-->>-Instance1: Found runId: 123
Instance1->>DDB: Write Registration Signal (UD_REG_OK) 👌
Note over Instance2: Fails to boot or register.<br>Never signals `UD_REG_OK`.
loop Until Validation Timeout
DDB-->>CP: Still waiting for signal from Instance 2...
end
DDB-->>-CP: Final response: Polling timed out.
Note right of CP: Instance 2 failed the check.<br>Terminating fleet.
CP->>AWS: TerminateInstances(Instance1, Instance2)
AWS->>Instance1: Terminate
AWS->>Instance2: Terminate
Note right of CP: Fleet is validated as failed ❌