Refresh: System Initialization & Maintenance ⚙️
The Refresh process is a periodic, idempotent background operation, executed by a scheduled GitHub Actions workflow (e.g., via cron
), that acts as the system's central initializer and maintainer. Unlike the event-driven provision
and release
modes which are tied to workflows, refresh
operates on the entire system to ensure the live infrastructure and configuration in AWS align with a centrally defined desired state.
Its core purpose is to initialize foundational infrastructure and to perform routine maintenance.
The Refresh Model: Desired vs. Actual State
The entire Refresh process is built on this model: it compares a Desired State (the configuration you provide) with the Actual State (the resources that exist in AWS) and performs the necessary actions to align them.
- Desired State: The configuration you provide as input. This includes the AMI for runners, resource class definitions, maximum instance lifetimes, and approved subnets. This is your "source of truth."
- Actual State: The resources and records that currently exist, such as EC2 Launch Templates, SQS queues, and instance state records in DynamoDB.
This model makes the process inherently idempotent, it can be run repeatedly with the same desired state, and it will only make changes if a drift is detected.
Core Refresh Tasks
The Refresh process performs two primary architectural functions: initialization and maintenance.
1. System Initialization 🏗️
This task handles the setup and configuration of foundational AWS resources. It ensures the system has the necessary infrastructure and parameters in place before the provision
and release
processes need them. Each row in the table represents an independent initialization task.
Component | Desired State (Input) | Initialization Action | Architectural Purpose |
---|---|---|---|
🚀 EC2 Launch Template | AMI ID, IAM Profile, User Data script, etc. | Creates or updates the EC2 Launch Template. Stores its ID in DynamoDB. | Decouples instance provisioning from the details of instance configuration. |
📥 Resource Pools | A list of resourceClass definitions. |
For each resourceClass , it ensures a corresponding SQS queue exists. Stores the queue URLs in DynamoDB. |
Provides a discoverable, central endpoint for the idle runner pools. |
🔧 Operational Parameters | Values for max-idle-time , subnet-ids , etc. |
Updates the corresponding key-value records in DynamoDB. | Allows for dynamic, centralized tuning of the system's operational behavior. |
🔑 GitHub Auth | A GitHub PAT with repo scope. |
Periodically generates a new, short-lived GitHub Actions registration token and stores it in DynamoDB. | Enhances security by ensuring new instances register with a temporary token. |
2. Instance Lifecycle Maintenance 🗑️
This task acts as the system's garbage collector, enforcing the defined lifecycle rules to prevent orphaned or expired resources.
- Identifying Expired Instances: The primary policy is the
threshold
timestamp on each instance record in DynamoDB. The maintainer scans for anyidle
,claimed
, orrunning
instances that have lived past their expiration time. - Resilient Termination Process: To prevent race conditions, it uses a safe, two-phase commit process:
- Phase 1: Mark for Termination (in DynamoDB): It first attempts to "lock" the instance by changing its state to
terminated
. - Phase 2: Terminate Instance (in EC2): Only after an instance is successfully marked in the database does it issue the
TerminateInstances
command to AWS.
- Phase 1: Mark for Termination (in DynamoDB): It first attempts to "lock" the instance by changing its state to
- Artifact Cleanup: After confirming termination, it removes associated data for the terminated instances (like heartbeat records) from DynamoDB.
Sequence Diagram: Resilient Termination
sequenceDiagram
participant Maintainer as "Refresh Maintainer"
participant DDB as "DynamoDB"
participant EC2
Maintainer->>+DDB: Find instances where state is 'idle'/'running'<br>and threshold is expired
DDB-->>-Maintainer: Return list of expired instances
loop For each expired instance
Maintainer->>+DDB: Conditionally update state to 'terminated'
DDB-->>-Maintainer: Update OK
end
Note right of Maintainer: Now we have a locked list<br>of instances to terminate.
Maintainer->>+EC2: Terminate instances by ID
EC2-->>-Maintainer: Termination initiated
Maintainer->>+DDB: Delete associated artifacts<br>(heartbeats, worker signals)
DDB-->>-Maintainer: Cleanup OK
Centralized Config & Decoupling ✨
This centralized refresh
approach provides advantages to decoupling:
- Centralized Configuration: All major changes to the runner environment (like updating the AMI) can be done by changing a single configuration value and letting the initializer handle the rollout.
- Decoupling of Concerns: The
provision
logic doesn't need to know how to build an instance; it just needs to know which Launch Template to use. Therefresh
process handles the "how," keeping concerns cleanly separated.