Skip to content

Detailed Design

Detailed Architecture Overview

Below is the detailed architecture diagram showcasing all components:

Overall Architecture

The core components in more detail:

  • Controlplane: Orchestrates instance management via distinct operational modes.
  • Resource Pool (SQS Queue): Manages idle instances ready for reuse.
  • Central State Store (DynamoDB): Maintains current instance state, signaling indirectly between controlplane and instances.
  • Instances/Runners: EC2 machines executing the actual CI jobs.

Core Concepts Expanded

Controlplane Operational Modes

The controlplane operates in three distinct modes:

  • Provision: Acquires EC2 resources either by claiming from the resource pool or provisioning new instances.
  • Release: Returns instances to the resource pool after workflows complete, resetting instance state.
  • Refresh: Periodically validates instance health and lifecycle thresholds, initiating termination of unhealthy or expired instances.

Controlplane modes visualized

graph TD
    A[Workflow Starts ⚡️] --> B(Provision Mode);
    B --> C[Instance Ready for CI];

    D[Workflow Completes ⚡️] --> E(Release Mode);
    E --> F[Instance Returned to Pool];

    G[Manual 👈/Cron ⏱️] --> H(Refresh Mode);
    H --> I[Initialization/Maintenance];

These modes interact through a shared state mechanism stored in DynamoDB, enabling indirect signaling between runners and the controlplane.

Instances and Runners

EC2 Instances (as self-hosted runners) are dynamically managed resources executing CI workflows. They maintain minimal agents installed at instance startup for:

  • Initializing runner environments via custom scripts (pre-runner-script).
  • Registering and deregistering with GitHub Actions runner APIs.
  • Sending periodic health signals (heartbeats) to the shared state store.
  • Observing state changes (e.g. runId assignments) for timely registration.

Instance States and Transition Mechanisms

Each instance has a clearly defined lifecycle managed through distinct states stored in DynamoDB:

State Technical Definition
created Instance created; initialization pending (waiting on runner registration & scripts).
running Runner initialized and actively accepting CI jobs.
idle Instance healthy and available in the resource pool for reuse.
claimed Instance reserved by a workflow, undergoing final validation checks.
terminated Instance terminated following expiration, unhealthy status, or explicit shutdown.

These states allow the controlplane to track and manage instances seamlessly.

Detailed State Transition Diagram

state-transition

The transitions between states are triggered by specific events in the system:

From State To State Triggering Event
created running The instance agent successfully initializes and registers.
running idle The workflow completes, and the release process begins.
idle claimed The provision process selects the instance from the pool for a new workflow.
claimed idle The instance fails a health/registration check after being claimed and is returned to the pool.
(any) terminated The instance's threshold and is transitioned by the refresh process.

This state management gives the controlplane ability to reuse runners effectively and terminate them automatically when no longer viable.

Lifecycle of an Instance

To understand how these components and concepts work together, read our End-to-End Lifecycle Walkthrough.

Technical Deep Dives

For a detailed exploration of individual components, their challenges, and solutions, see:

  • Provision: Instance reuse/creation, selection logic, resource matching, and AWS API interactions.
  • Release: Safe instance deregistration and resource pool placement.
  • Refresh: Controlplane initialization, maintenance, and periodic checks for safe termination of long-running instances.
  • Resource Pool: SQS-backed resource pool, message structure, and producer/consumer roles.
  • Instance Initialization: Instance agent managing startup, GitHub registration/deregistration loop, heartbeats, and safe self-termination.

☀