Skip to content

fix: properly clean up finished jobs and return failure when max-age exceeded#1128

Merged
eikek merged 8 commits into
mainfrom
eikek/fix-culling-jobs
Jun 9, 2026
Merged

fix: properly clean up finished jobs and return failure when max-age exceeded#1128
eikek merged 8 commits into
mainfrom
eikek/fix-culling-jobs

Conversation

@eikek

@eikek eikek commented Jun 3, 2026

Copy link
Copy Markdown
Member
  • Since finished jobs are actively deleted by amalthea, there is no need to specify ttlSecondsAfterFinished for a job. It will only introduce a race condition
  • Once a job is done (success or failure), amalthea will remove it after maxHibernatedDuration is passed
  • A finished job must not be touched in the reconcile loop, otherwise it will be restarted
  • If a job exceeds its MaxAge time, k8s will terminate the pod and the job goes into failed state. This state is now properly propagated to the amalthea session status
  • in cull.go the jobs are removed from being "hibernated", as this only applies to interactive sessions. Jobs are being removed by k8s automatically due to its activeDeadlineSeconds setting

/deploy

@RenkuBot

RenkuBot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

You can access the deployment of this PR at https://renku-ci-am-1128.dev.renku.ch

@eikek eikek marked this pull request as ready for review June 4, 2026 09:16
@eikek eikek requested review from a team and olevski as code owners June 4, 2026 09:16

@leafty leafty left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK, some small comments below.

Comment thread internal/controller/cull.go
Comment thread internal/controller/cull.go Outdated
Comment thread internal/controller/cull.go Outdated
- better test for SessionTypeNonInteractive to work for old existing
  resources that don't have that value

- short-circuit immediately in updateHibernationState
@eikek eikek requested a review from leafty June 5, 2026 06:10
@eikek eikek merged commit c04087e into main Jun 9, 2026
66 of 81 checks passed
@eikek eikek deleted the eikek/fix-culling-jobs branch June 9, 2026 07:46
@RenkuBot

RenkuBot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Tearing down the temporary RenkuLab deplyoment for this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants