Pickup Manager
The Pickup Manager (PM) is an in-process component instantiated per workflow selection attempt. It is responsible for retrieving and filtering idle instance messages from the SQS-based Resource Pool to find suitable warm runners for the requestors (Claim Workers) to fulfill the current workflow's requirements.
The Pickup Manager's core responsibilities are:
- Dequeueing Candidate Messages: Retrieving messages from the relevant SQS resource pool queue.
- Filtering: Evaluating the dequeued message (whose format is detailed in the Resource Pool documentation) against the requesting workflow’s specific compute requirements (e.g., instance type, usage class, CPU, memory).
- Dispatching or Re-queuing:
- ✅ If a message represents a suitable instance, it's passed to a Claim Worker for an attempt to claim the instance.
- ♻️ If unsuitable for the current request, the message is returned to the SQS queue for other workflows.
- ❌ If the message represents an invalid or malformed entry, it may be discarded.
Sequence Diagram: Valid Message
sequenceDiagram
participant ClaimWorker
participant PM as "Pickup Manager"
participant SQS_Pool as "Resource Pool (SQS)"
Note over PM: Intitialized with workflow's compute requirements <br>(resource class, usage class, etc.)
Note over PM, SQS_Pool: Reference a specific queue from pool. <br> Dedicated queue per resource class
ClaimWorker->>+PM: Request Instance
%% Start of the conceptual "loop" for a single successful attempt
PM->>+SQS_Pool: Dequeue Message
SQS_Pool-->>-PM: instance_message_data (with instance_id)
PM->>PM: Check Pickup Frequency (for instance_id)
%% Happy Path: Frequency is OK
PM->>PM: Compare instance_message_data.payload vs compute requirements
%% Happy Path: Message is OK (Suitable)
PM-->>-ClaimWorker: Return InstanceMessage <br> (instance_message_data.payload)
Note over ClaimWorker: Begin Claiming routine <br> Will re-request from pool if needed
1. Interfacing with SQS: The Pickup Lifecycle
The Pickup Manager interacts with SQS in a tight loop designed for low latency and efficient message handling. This process involves several conceptual SQS operations:
| Action by Pickup Manager | Conceptual SQS Operation | Purpose & Notes |
|---|---|---|
| 1 Request a message from the resource class-specific SQS queue. | ReceiveMessage |
A short poll used for discovering available idle runners. |
| 2 Delete the message from SQS. | DeleteMessage |
As soon as received, delete the message from the queue to minimize contention by multiple concurrent Provision processes or workers |
| 3 Filter the message content in-memory. | N/A (Local operation) | The contents of message checked against the workflow’s provision inputs (e.g., allowedInstanceTypes, usageClass, etc.) |
| 4a |
N/A (Dispatch to Claim Worker) | If check is OK - PM hands to requesting Claim Worker |
| 4b |
SendMessage |
If mismatching current workflow's compute constraints - message is requeued with a short delay to minimize contention |
| 4c |
N/A (Already deleted) | Something is wrong with the message (ie. parsing, unexpected attributes), it is discarded |
2. Message Handling and Classification Logic
The Pickup Manager evaluates each dequeued message to determine its fate based on the workflow's requirements:
-
✅ OK (Pass-On): The message's attributes (instance type, usage class, CPU/memory) are a suitable match.
- Action: The message is passed to a Claim Worker to attempt claiming the instance.
-
♻️ Re-queue: The instance is valid but doesn't meet the current workflow's specific filters (e.g., a different instance type is needed).
- Action: The message is sent back to its SQS queue, making it available for other, potentially more suitable, workflow requests.
-
❌ Discard: The message is malformed, describes an instance with specs that don't match its resource class, or is otherwise invalid.
- Action: The message is logged and effectively dropped, as it was already deleted from the SQS queue upon receipt.
3. Pool Exhaustion Detection
The Pickup Manager can determine that a resource pool is "exhausted" for the current workflow's request through two main heuristics:
- Globally Exhausted (Queue Empty):
- If an attempt to receive a message from the SQS queue indicates that the queue for the target
resourceClassis currently empty. - Outcome: The Pickup Manager signals to the Provision orchestrator that no idle instances are available in this pool at this moment.
- If an attempt to receive a message from the SQS queue indicates that the queue for the target
Sequence Diagram: Global Exhaustion
sequenceDiagram
participant ClaimWorker
participant PM as "Pickup Manager"
participant SQS_Pool as "SQS Resource Pool"
ClaimWorker->>+PM: Attempt Pickup (workflow_requirements)
PM->>+SQS_Pool: Request Message
SQS_Pool-->>-PM: No message 🤷
PM-->>-ClaimWorker: Return Null (Locally Exhausted for this workflow)
- Locally Exhausted (Repetitive Unsuitable Messages):
- To prevent an infinite loop due to the Pickup Manager requeueing behavior with unsuitable messages, it maintains an internal frequency count for each unique
instanceIdit encounters during its current operational cycle. - If the same
instanceIdis dequeued more than a defined tolerance threshold (ie., 5 times), the Pickup Manager considers the pool exhausted for the workflow. - Outcome: The message that triggered this tolerance is still re-queued (so it remains available for other, potentially different, workflow requests). However, the Pickup Manager signals
nullany interfacing claim workers - effectively suspending pickups for this workflow.
- To prevent an infinite loop due to the Pickup Manager requeueing behavior with unsuitable messages, it maintains an internal frequency count for each unique
Sequence Diagram: Local Exhaustion
sequenceDiagram
participant ClaimWorker
participant PM as "Pickup Manager"
participant SQS_Pool as "Resource Pool (SQS)"
ClaimWorker->>+PM: Request Instance
Note over PM: Initialize frequency counter for instances
rect
Note right of PM: Loop: First instance pickup attempt
PM-->SQS_Pool: Dequeue instance_A Message
Note over PM: Register instance_A (freq=1)<br>But unsuitable
PM-->SQS_Pool: Re-queue instance_A
end
Note over PM: Additional cycles (truncated)
rect
Note right of PM: Loop: Later pickup attempts <br> imagine only instance_A is in Pool
PM-->SQS_Pool: Dequeue instance_A Message (again)
Note over PM: Register instance_A (freq=n)<br>Still unsuitable
Note over PM: Frequency exceeds tolerance
PM-->SQS_Pool: Re-queue instance_A
end
PM-->>-ClaimWorker: Return Null (Locally Exhausted)
Note over ClaimWorker: Claim worker informs <br>controlplane of exhaustion