Skip to content

Reject requests on stale session or sleeping engine#4496

Merged
lvhan028 merged 8 commits intoInternLM:mainfrom
lvhan028:fix-abort
Apr 9, 2026
Merged

Reject requests on stale session or sleeping engine#4496
lvhan028 merged 8 commits intoInternLM:mainfrom
lvhan028:fix-abort

Conversation

@lvhan028
Copy link
Copy Markdown
Collaborator

@lvhan028 lvhan028 commented Apr 4, 2026

No description provided.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR binds an AsyncEngine “epoch” to each HTTP-bound Session so that /stop_all_session (and related abort flows) can reliably invalidate in-flight work that was bound before the stop/abort, reducing races between request binding and generation.

Changes:

  • Stamp session.epoch in the OpenAI API server when resolving/binding a session for a request.
  • Add epoch to Session objects (with reset behavior) and improve abort logging to include epoch.
  • In AsyncEngine.generate(), drop “stale” sessions when stop_all_session() has bumped the engine epoch since the request bound the session; also adjust some metrics accounting and /sleep behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
lmdeploy/serve/openai/api_server.py Binds AsyncEngine.epoch to sessions on request bind; /sleep now stops all sessions before sleeping.
lmdeploy/serve/managers/session_manager.py Adds Session.epoch state, resets it, and logs epoch during abort.
lmdeploy/serve/core/async_engine.py Adds stale-session detection based on epoch, bumps epoch on stop-all, and updates metrics/abort handling in generate().
Comments suppressed due to low confidence (2)

lmdeploy/serve/core/async_engine.py:413

  • This branch yields finish_reason='length' (normal completion due to token limit) but increments increase_failed_requests('error'). That will skew scheduler metrics by counting expected length-limited completions as errors; consider incrementing succeeded requests (or not marking as failed) in this case.
        if gen_config.max_new_tokens == 0:
            logger.info(f'run out of tokens. session={session_id}.')
            metrics_processor.increase_failed_requests('error')
            yield GenOut(response='',
                         history_token_len=session.step,
                         input_token_len=len(input_ids),
                         generate_token_len=0,
                         finish_reason='length',
                         token_ids=[])

lmdeploy/serve/core/async_engine.py:462

  • This pre-inference abort path yields finish_reason='abort' even though GenOut.finish_reason is not typed to allow 'abort'. Align the finish_reason enum/typing across GenOut and response models so metrics/logging and downstream code don't see unexpected values.
            if session.epoch is not None and session.epoch != self.epoch:
                logger.warning(f'[generate] session {session_id} got aborted before starting inference, '
                               f'session.epoch={session.epoch}, epoch={self.epoch}')
                metrics_processor.increase_failed_requests('abort')
                yield GenOut(response='',
                             history_token_len=0,
                             input_token_len=len(input_ids),
                             generate_token_len=0,
                             finish_reason='abort',
                             token_ids=[])

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@lvhan028 lvhan028 changed the title bind epoch to session Reject requests when session stale or engine sleep Apr 7, 2026
@lvhan028 lvhan028 changed the title Reject requests when session stale or engine sleep Reject requests on stale session or sleeping engine Apr 7, 2026
@lvhan028 lvhan028 requested review from Copilot and lzhangzz April 7, 2026 03:53
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +359 to 363
metrics_processor.increase_total_requests()

if (messages is not None) ^ (input_ids is None):
raise ValueError('You must specify exactly one of messages or input_ids')
if isinstance(session_id, Session):
Copy link

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metrics_processor.increase_total_requests() is now called before the input validation that can raise ValueError (e.g. the messages/input_ids XOR check). If a caller triggers these errors, total requests will be incremented without a corresponding failed-request metric, skewing metrics. Consider moving increase_total_requests() after validation (or ensuring validation errors are counted as failures).

Copilot uses AI. Check for mistakes.
@lvhan028 lvhan028 requested a review from CUHKSZzxy April 7, 2026 06:44
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

lmdeploy/serve/core/async_engine.py:376

  • metrics_processor.increase_total_requests() is executed before argument validation that can raise (messages vs input_ids XOR, invalid session_id type). If those errors occur, the request is counted but no failed metric is recorded, and the exception may propagate without a proper response. Move the increment after input validation (or wrap validation in try/except to record increase_failed_requests('error')).
        metrics_processor.increase_total_requests()

        if (messages is not None) ^ (input_ids is None):
            raise ValueError('You must specify exactly one of messages or input_ids')
        if isinstance(session_id, Session):
            session = session_id
        elif isinstance(session_id, int):
            session = self.session_mgr.get(session_id, step=step)
        else:
            raise ValueError(f'Invalid session_id: {session_id}. It should be an instance of Session or an integer.')

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1548 to 1555
def is_engine_sleeping() -> bool:
eng = VariableInterface.async_engine
return eng is not None and eng.is_sleeping
app.add_middleware(EngineSleepingMiddleware, is_sleeping=is_engine_sleeping)

# set the maximum number of concurrent requests
if max_concurrent_requests is not None:
app.add_middleware(ConcurrencyLimitMiddleware, max_concurrent_requests=max_concurrent_requests)
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Middleware ordering: EngineSleepingMiddleware is added before ConcurrencyLimitMiddleware, and Starlette middleware stacking makes the last added middleware outermost. That means sleeping inference requests still acquire a concurrency semaphore slot before being rejected with 503, which can unnecessarily block other endpoints (including /wakeup) under load. Consider adding EngineSleepingMiddleware after the concurrency limiter (or implementing the sleep gate inside the concurrency middleware) so rejections happen before acquiring the semaphore.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

lmdeploy/turbomind/turbomind.py:292

  • TurboMind.sleep was changed to async def, but the implementation still performs a blocking wait over a ThreadPoolExecutor (and contains no await). When awaited from FastAPI (e.g., POST /sleep), this will block the event loop until all GPU workers finish sleeping, preventing the server from responding to other endpoints during that time. Consider running the whole blocking section via asyncio.to_thread(...) / loop.run_in_executor(...), or keeping sleep() synchronous and calling it via run_in_executor from higher-level async code.
    async def sleep(self, level: int = 1):
        """Sleep the model."""
        with ThreadPoolExecutor(max_workers=self.gpu_count) as e:
            for _ in e.map(self.model_comm.sleep, range(self.gpu_count), [level] * self.gpu_count):
                pass

lmdeploy/serve/core/async_engine.py:421

  • In generate(), the max_new_tokens == 0 early-exit yields finish_reason='length', but increments failed requests as increase_failed_requests('error'). This will inflate num_errored_reqs for a non-error condition and makes metrics inconsistent with the returned finish_reason. Consider treating this as a succeeded request (or at least not counting it as an error) to keep metrics aligned with behavior.
        if gen_config.max_new_tokens == 0:
            logger.info(f'run out of tokens. session={session_id}.')
            metrics_processor.increase_failed_requests('error')
            yield GenOut(response='',
                         history_token_len=session.step,
                         input_token_len=len(input_ids),
                         generate_token_len=0,
                         finish_reason='length',
                         token_ids=[])

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@lvhan028 lvhan028 merged commit f7f7546 into InternLM:main Apr 9, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants