Reject requests on stale session or sleeping engine by lvhan028 · Pull Request #4496 · InternLM/lmdeploy

lvhan028 · 2026-04-04T09:55:30Z

No description provided.

Copilot

Pull request overview

This PR binds an AsyncEngine “epoch” to each HTTP-bound Session so that /stop_all_session (and related abort flows) can reliably invalidate in-flight work that was bound before the stop/abort, reducing races between request binding and generation.

Changes:

Stamp session.epoch in the OpenAI API server when resolving/binding a session for a request.
Add epoch to Session objects (with reset behavior) and improve abort logging to include epoch.
In AsyncEngine.generate(), drop “stale” sessions when stop_all_session() has bumped the engine epoch since the request bound the session; also adjust some metrics accounting and /sleep behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
lmdeploy/serve/openai/api_server.py	Binds `AsyncEngine.epoch` to sessions on request bind; `/sleep` now stops all sessions before sleeping.
lmdeploy/serve/managers/session_manager.py	Adds `Session.epoch` state, resets it, and logs epoch during abort.
lmdeploy/serve/core/async_engine.py	Adds stale-session detection based on epoch, bumps epoch on stop-all, and updates metrics/abort handling in `generate()`.

Comments suppressed due to low confidence (2)

lmdeploy/serve/core/async_engine.py:413

This branch yields finish_reason='length' (normal completion due to token limit) but increments increase_failed_requests('error'). That will skew scheduler metrics by counting expected length-limited completions as errors; consider incrementing succeeded requests (or not marking as failed) in this case.

        if gen_config.max_new_tokens == 0:
            logger.info(f'run out of tokens. session={session_id}.')
            metrics_processor.increase_failed_requests('error')
            yield GenOut(response='',
                         history_token_len=session.step,
                         input_token_len=len(input_ids),
                         generate_token_len=0,
                         finish_reason='length',
                         token_ids=[])

lmdeploy/serve/core/async_engine.py:462

This pre-inference abort path yields finish_reason='abort' even though GenOut.finish_reason is not typed to allow 'abort'. Align the finish_reason enum/typing across GenOut and response models so metrics/logging and downstream code don't see unexpected values.

            if session.epoch is not None and session.epoch != self.epoch:
                logger.warning(f'[generate] session {session_id} got aborted before starting inference, '
                               f'session.epoch={session.epoch}, epoch={self.epoch}')
                metrics_processor.increase_failed_requests('abort')
                yield GenOut(response='',
                             history_token_len=0,
                             input_token_len=len(input_ids),
                             generate_token_len=0,
                             finish_reason='abort',
                             token_ids=[])

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lmdeploy/serve/core/async_engine.py

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lmdeploy/serve/openai/api_server.py

Copilot · 2026-04-07T03:58:02Z

lmdeploy/serve/core/async_engine.py

+        metrics_processor.increase_total_requests()
+
        if (messages is not None) ^ (input_ids is None):
            raise ValueError('You must specify exactly one of messages or input_ids')
        if isinstance(session_id, Session):


metrics_processor.increase_total_requests() is now called before the input validation that can raise ValueError (e.g. the messages/input_ids XOR check). If a caller triggers these errors, total requests will be incremented without a corresponding failed-request metric, skewing metrics. Consider moving increase_total_requests() after validation (or ensuring validation errors are counted as failures).

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

lmdeploy/serve/core/async_engine.py:376

metrics_processor.increase_total_requests() is executed before argument validation that can raise (messages vs input_ids XOR, invalid session_id type). If those errors occur, the request is counted but no failed metric is recorded, and the exception may propagate without a proper response. Move the increment after input validation (or wrap validation in try/except to record increase_failed_requests('error')).

        metrics_processor.increase_total_requests()

        if (messages is not None) ^ (input_ids is None):
            raise ValueError('You must specify exactly one of messages or input_ids')
        if isinstance(session_id, Session):
            session = session_id
        elif isinstance(session_id, int):
            session = self.session_mgr.get(session_id, step=step)
        else:
            raise ValueError(f'Invalid session_id: {session_id}. It should be an instance of Session or an integer.')

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lmdeploy/serve/core/async_engine.py

lmdeploy/serve/utils/server_utils.py

Copilot · 2026-04-08T07:43:19Z

lmdeploy/serve/openai/api_server.py

+    def is_engine_sleeping() -> bool:
+        eng = VariableInterface.async_engine
+        return eng is not None and eng.is_sleeping
+    app.add_middleware(EngineSleepingMiddleware, is_sleeping=is_engine_sleeping)
+
    # set the maximum number of concurrent requests
    if max_concurrent_requests is not None:
        app.add_middleware(ConcurrencyLimitMiddleware, max_concurrent_requests=max_concurrent_requests)


Middleware ordering: EngineSleepingMiddleware is added before ConcurrencyLimitMiddleware, and Starlette middleware stacking makes the last added middleware outermost. That means sleeping inference requests still acquire a concurrency semaphore slot before being rejected with 503, which can unnecessarily block other endpoints (including /wakeup) under load. Consider adding EngineSleepingMiddleware after the concurrency limiter (or implementing the sleep gate inside the concurrency middleware) so rejections happen before acquiring the semaphore.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

lmdeploy/turbomind/turbomind.py:292

TurboMind.sleep was changed to async def, but the implementation still performs a blocking wait over a ThreadPoolExecutor (and contains no await). When awaited from FastAPI (e.g., POST /sleep), this will block the event loop until all GPU workers finish sleeping, preventing the server from responding to other endpoints during that time. Consider running the whole blocking section via asyncio.to_thread(...) / loop.run_in_executor(...), or keeping sleep() synchronous and calling it via run_in_executor from higher-level async code.

    async def sleep(self, level: int = 1):
        """Sleep the model."""
        with ThreadPoolExecutor(max_workers=self.gpu_count) as e:
            for _ in e.map(self.model_comm.sleep, range(self.gpu_count), [level] * self.gpu_count):
                pass

lmdeploy/serve/core/async_engine.py:421

In generate(), the max_new_tokens == 0 early-exit yields finish_reason='length', but increments failed requests as increase_failed_requests('error'). This will inflate num_errored_reqs for a non-error condition and makes metrics inconsistent with the returned finish_reason. Consider treating this as a succeeded request (or at least not counting it as an error) to keep metrics aligned with behavior.

        if gen_config.max_new_tokens == 0:
            logger.info(f'run out of tokens. session={session_id}.')
            metrics_processor.increase_failed_requests('error')
            yield GenOut(response='',
                         history_token_len=session.step,
                         input_token_len=len(input_ids),
                         generate_token_len=0,
                         finish_reason='length',
                         token_ids=[])

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

lmdeploy/serve/openai/api_server.py

bind epoch to session

e0ceca3

lvhan028 requested review from Copilot and removed request for Copilot April 4, 2026 09:55

Copilot started reviewing on behalf of lvhan028 April 4, 2026 09:56 View session

Copilot AI reviewed Apr 4, 2026

View reviewed changes

lmdeploy/serve/core/async_engine.py Outdated Show resolved Hide resolved

lmdeploy/serve/core/async_engine.py Show resolved Hide resolved

reject requests when engine sleeps

20883ea

lvhan028 changed the title ~~bind epoch to session~~ Reject requests when session stale or engine sleep Apr 7, 2026

lvhan028 changed the title ~~Reject requests when session stale or engine sleep~~ Reject requests on stale session or sleeping engine Apr 7, 2026

lvhan028 requested review from Copilot and lzhangzz April 7, 2026 03:53

lvhan028 added the improvement label Apr 7, 2026

Copilot started reviewing on behalf of lvhan028 April 7, 2026 03:54 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

lvhan028 requested a review from CUHKSZzxy April 7, 2026 06:44

lvhan028 added 5 commits April 7, 2026 08:33

implement EngineSleepingMiddleware

3fe7f01

merge latest main

d72fcf9

validate sleep request

e29c31c

fix race window

ea9aa7a

async sleep and sync wakeup

cbcdfa8

lvhan028 force-pushed the fix-abort branch from 999c92f to cbcdfa8 Compare April 8, 2026 07:23

lvhan028 requested review from Copilot and grimoire April 8, 2026 07:37

Copilot started reviewing on behalf of lvhan028 April 8, 2026 07:37 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

change to async sleep in turbomind

cf4a597

lvhan028 requested a review from Copilot April 8, 2026 08:29

Copilot started reviewing on behalf of lvhan028 April 8, 2026 08:29 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

lmdeploy/serve/openai/api_server.py Show resolved Hide resolved

grimoire approved these changes Apr 9, 2026

View reviewed changes

lzhangzz approved these changes Apr 9, 2026

View reviewed changes

lvhan028 merged commit f7f7546 into InternLM:main Apr 9, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reject requests on stale session or sleeping engine#4496

Reject requests on stale session or sleeping engine#4496
lvhan028 merged 8 commits intoInternLM:mainfrom
lvhan028:fix-abort

lvhan028 commented Apr 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

lvhan028 commented Apr 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants