Skip to content

Fix controller hang when submission exits early#1695

Open
simonlindholm wants to merge 4 commits into
cms-dev:mainfrom
simonlindholm:fix-interactive-hang
Open

Fix controller hang when submission exits early#1695
simonlindholm wants to merge 4 commits into
cms-dev:mainfrom
simonlindholm:fix-interactive-hang

Conversation

@simonlindholm
Copy link
Copy Markdown

Having the u_to_c_w fd inherit to the controller means that when the submission exits, the controller will never read EOF, because it still itself has the write end of the pipe open. Easy fix is just to make that not inherit: nothing actually needs the inheritance, in fact it's rather a bit of a security issue that each spawned submission get access to extra fd's.

Small additional related fixes:

  • if the interactive keeper hits walltime limit (claude pointed out some obscure circumstances in which this can happen, e.g. if the submission causes lots of isolate warning spam which clogs an stderr pipe which is never read from), we currently crash and leave the child un-waited. Better to generate a proper error.
  • the change to make c_to_u_r not inherit means that now if the submission exits, controller writing will cause SIGPIPE unless the controller ignores that signal. We don't particularly want that to happen (it makes it impossible to detect which process exited first, which we really ought to do to give proper verdicts); better to leave the pipe unclosed.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 53.90%. Comparing base (15c547e) to head (3e4700c).
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
cms/grading/tasktypes/Interactive.py 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1695      +/-   ##
==========================================
- Coverage   53.90%   53.90%   -0.01%     
==========================================
  Files         341      341              
  Lines       27983    27985       +2     
==========================================
+ Hits        15084    15085       +1     
- Misses      12899    12900       +1     
Flag Coverage Δ
functionaltests 0.00% <0.00%> (ø)
unittests 53.90% <0.00%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread cms/grading/tasktypes/interactive_keeper.py Outdated
@gollux
Copy link
Copy Markdown
Contributor

gollux commented May 29, 2026

Maybe I am missing something, but not closing the pipe can potentially make the problem much worse. If the solution sends multiple queries and then it exits, the controller will try to print answers for all the queries, filling up the pipe buffer and blocking indefinitely.

I would argue that it is much better to ignore both SIGPIPE and write errors inside the controller.

@simonlindholm
Copy link
Copy Markdown
Author

That issue can occur regardless. A solution which sends a large number of queries without reading responses will cause the checker->solution pipe to fill up, which causes the checker to stop processing queries, and then after a while the solution->checker pipe fills up too and we get a deadlock. In Kattis, this has been mitigated somewhat by increasing pipe buffer size to the Linux maximum of 1 MiB, and it is eventually planned to be fixed for real by introducing a proxy process (see Kattis/problemtools#113; additional upsides of that solution is being able to log the interaction and show it for sample testcases, and being able to impose communication traffic limits to eliminate judge errors from resource exhaustion caused by unbounded-size cin >> str; in checkers; downsides are complexity and increased judging walltime).

I suppose a small difference compared to that scenario is that with the pipe left open the checker could be the one left to take the blame, if the solution exits early and the judging system treats checker walltime limit timeouts as judge errors.

Anyway, this not being a hill I want to die on, I just pushed a commit that closes the pipe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants