Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 14 additions & 2 deletions lambdas/functions/control-plane/src/aws/runners.test.ts
Original file line number Diff line number Diff line change
Expand Up @@ -513,7 +513,7 @@ describe('create runner with errors', () => {

await expect(createRunner(createRunnerConfig(defaultRunnerConfig))).rejects.toMatchObject({
name: 'ScaleError',
failedInstanceCount: 2,
failedInstanceCount: 1, // numberOfRunners when zero instances created
});

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Fixed in e8cdc4e — test now uses numberOfRunners: 3 so the assertion actually validates the value.

expect(mockEC2Client).toHaveReceivedCommandWith(
CreateFleetCommand,
Expand Down Expand Up @@ -543,6 +543,16 @@ describe('create runner with errors', () => {
);
});

it('returns partial instances on recognized scale error instead of throwing', async () => {
createFleetMockWithErrors(['UnfulfillableCapacity'], ['i-partial']);

await expect(createRunner(createRunnerConfig(defaultRunnerConfig))).resolves.toEqual(['i-partial']);
expect(mockEC2Client).toHaveReceivedCommandWith(
CreateFleetCommand,
expectedCreateFleetRequest(defaultExpectedFleetRequestValues),
);

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Fixed in e8cdc4e — test now requests 3 runners and gets 1 instance back, making it a genuine partial-success scenario.

});

it('test error by create fleet call is thrown.', async () => {
mockEC2Client.on(CreateFleetCommand).rejects(new Error('Some error'));

Expand Down Expand Up @@ -688,12 +698,14 @@ describe('create runner with errors fail over to OnDemand', () => {
// fallback to on demand for UnfulfillableCapacity but InsufficientInstanceCapacity is thrown
createFleetMockWithWithOnDemandFallback(['UnfulfillableCapacity'], instancesIds);

// Partial success: 1 instance created, unrecognized error for the rest.
// Returns partial instances instead of throwing to prevent orphans.
await expect(
createRunner({
...createRunnerConfig(defaultRunnerConfig),
numberOfRunners: 2,
}),
).rejects.toBeInstanceOf(Error);
).resolves.toEqual(['i-123']);

expect(mockEC2Client).toHaveReceivedCommandTimes(CreateFleetCommand, 1);

Expand Down
21 changes: 19 additions & 2 deletions lambdas/functions/control-plane/src/aws/runners.ts
Original file line number Diff line number Diff line change
Expand Up @@ -201,9 +201,26 @@ async function processFleetResult(

const failedCount = countScaleErrors(errors, scaleErrors);
if (failedCount > 0) {
logger.warn('Create fleet failed, ScaleError will be thrown to trigger retry for ephemeral runners.');
if (instances.length > 0) {
logger.warn(
`Partial fleet success: ${instances.length}/${runnerParameters.numberOfRunners} instances created. ` +
`Returning partial results; caller will retry the shortfall via SQS.`,
{ data: fleet.Errors },

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Fixed in e8cdc4e — removed SQS reference, reworded to mechanism-agnostic, and added created/requested to log metadata.

);
return instances;
}
Comment on lines +223 to +230

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The caller (scaleUp in scale-up.ts) already handles instances.length < numberOfRunners — it tracks how many instances were successfully created vs. how many SQS messages it received, and marks the shortfall as batchItemFailures. This is the existing contract; we're not changing it.

Introducing a structured return type would be a larger refactor that touches every callsite of createRunner (scale-up, pool, on-demand fallback recursion) for no behavioral benefit — the length comparison already provides the signal.

logger.warn('Create fleet failed with zero instances, ScaleError will be thrown to trigger retry.');
logger.debug('Create fleet failed.', { data: fleet.Errors });
throw new ScaleError(failedCount);
throw new ScaleError(runnerParameters.numberOfRunners);
}

if (instances.length > 0) {
logger.warn(
`Partial fleet success: ${instances.length}/${runnerParameters.numberOfRunners} instances created. ` +
`Error not recognized as scaling error; returning partial results.`,
{ data: fleet.Errors },
);
return instances;
}

logger.warn('Create fleet failed, error not recognized as scaling error.', { data: fleet.Errors });
Expand Down