Add server metrics monitor and DPO client by Yunnglin · Pull Request #132 · modelscope/twinkle

Yunnglin · 2026-03-30T06:32:52Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

add monitor
add tinker DPO
fix megatron save on adapter_config.json
fix transformers>=5.4 processor call
fix tinker grpo cookbook
fix /healthz need X-Ray-Serve-Request-Id

Experiment results

Paste your experiment result here(if needed).

gemini-code-assist

Code Review

This pull request implements a centralized metrics system using Ray's utility metrics to monitor HTTP requests, task queues, and server resources across the gateway, model, processor, and sampler components. Feedback focuses on preventing metric cardinality explosion in the middleware by using route templates, adhering to PEP 8 by moving inline imports to the top level, and refining type hints and function signatures for better maintainability and type safety.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Copilot

Pull request overview

This PR adds Ray Serve–compatible observability (HTTP/task-queue/resource metrics) and extends Tinker/Twinkle training flows to support DPO-style reference logprobs, alongside several operational/cookbook updates.

Changes:

Introduce a centralized ray.util.metrics module and wire it into Gateway/Model/Sampler/Processor apps plus task queue + rate limiter instrumentation.
Extend Tinker forward/forward_backward plumbing to support DPO (ref_logps extraction + ref_outputs propagation) and adjust backend collect behavior.
Update Ray launcher initialization behavior and add/refresh cookbook configs & scripts (including DPO client examples).

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/twinkle/server/utils/task_queue.py	Adds task-queue metrics (queue depth/wait, execution time, task status counts) and passes gauges into the rate limiter.
src/twinkle/server/utils/state/server_state.py	Adds a background loop to periodically publish resource gauges (sessions/models/futures).
src/twinkle/server/utils/rate_limiter.py	Adds optional metrics gauge updates for active tokens tracked by the limiter.
src/twinkle/server/utils/metrics.py	New central metrics definitions + FastAPI middleware for request counters/latency histograms.
src/twinkle/server/sampler/app.py	Registers HTTP metrics middleware; sets task queue deployment label to `Sampler`.
src/twinkle/server/processor/app.py	Registers HTTP metrics middleware for `Processor`.
src/twinkle/server/model/app.py	Registers HTTP metrics middleware; sets task queue deployment label to `Model`.
src/twinkle/server/model/tinker_handlers.py	Adjusts template selection for Qwen3.5 and changes forward path to use updated backend return shape.
src/twinkle/server/model/backends/transformers_model.py	Refactors Tinker forward paths and updates Twinkle-native collect behavior for forward outputs.
src/twinkle/server/model/backends/megatron_model.py	Refactors Tinker forward paths and updates Twinkle-native collect behavior for forward outputs.
src/twinkle/server/model/backends/common.py	Adds shared helpers for Tinker loss setup/output building and ref_logps → ref_outputs conversion.
src/twinkle/server/common/datum.py	Extracts `ref_logps` from Datum loss inputs for DPO.
src/twinkle/model/megatron/multi_lora_megatron.py	Binds adapter_name into LoRA save converter via `functools.partial`.
src/twinkle/metric/dpo.py	Accepts non-tensor `logps` by converting to a tensor before alignment.
src/twinkle/loss/dpo.py	Accepts non-tensor `ref_logps` by converting to a tensor before alignment.
src/twinkle_client/utils/patch_tinker.py	Extends typing imports and introduces a new patch-state flag variable.
src/twinkle_client/common/serialize.py	Adds BaseModel serialization handling for client HTTP parameter serialization.
src/twinkle/server/launcher.py	Changes Ray initialization to attempt connecting to an existing cluster via `address='auto'`.
src/twinkle/server/gateway/server.py	Registers HTTP metrics middleware for `Gateway`.
src/twinkle/server/gateway/twinkle_gateway_handlers.py	Adds a `/twinkle/status` endpoint returning cleanup/resource counts.
pyproject.toml	Removes the upper bound on the `datasets` dependency.
cookbook/client/twinkle/self_host/dpo.py	Adds a Twinkle-native self-host DPO training example script.
cookbook/client/tinker/self_host/dpo.py	Adds a Tinker-compatible self-host DPO training example script.
cookbook/client/server/megatron/server_config.yaml	Minor YAML formatting tweak.
cookbook/client/server/megatron/server_config_4b.yaml	Updates sample deployment ports and various sizing/limit parameters.
cookbook/client/server/megatron/run.sh	Replaces the minimal launcher with a parameterized Ray+Prometheus+server startup script.

Yunnglin added 5 commits March 26, 2026 23:29

feat(metrics): add monitoring layer with status endpoint and ray metrics

cd383b2

feat(server_state): 添加资源指标更新功能

b546807

fix: resolve conflicts and add monitoring system

783648e

update

3e86e4f

update

88a94fd

gemini-code-assist Bot reviewed Mar 30, 2026

View reviewed changes

Yunnglin and others added 16 commits March 30, 2026 15:37

Merge remote-tracking branch 'origin' into add_monitor

3dbd995

update

d61482c

update

a3cd140

Update src/twinkle/server/utils/metrics.py

85cc08f

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

update

74c6c19

update run shell

52f21b3

update

4a50c43

update

4fe4551

update twinkle dpo

1c42798

Merge remote-tracking branch 'origin' into add_monitor

f45fc13

update twinkle dpo

21a7c17

Merge remote-tracking branch 'origin' into add_monitor

7e46338

update transformers dpo

b396ade

update megatron dpo

df1f9a0

update megatron dpo

cec02d2

update megatron dpo

ea6df52

Yunnglin changed the title ~~Add server metrics monitor~~ Add server metrics monitor and DPO client Apr 7, 2026

Yunnglin marked this pull request as ready for review April 7, 2026 12:05

Copilot AI review requested due to automatic review settings April 7, 2026 12:05

Copilot started reviewing on behalf of Yunnglin April 7, 2026 12:06 View session

Copilot AI reviewed Apr 7, 2026

View reviewed changes

Yunnglin added 3 commits April 8, 2026 15:33

update megatron dpo

a28ea8d

update megatron dpo

8927caa

Merge remote-tracking branch 'origin' into add_monitor

ac2cb14

Yunnglin added 6 commits April 9, 2026 15:32

udpate

7c633c9

Merge remote-tracking branch 'origin' into add_monitor

65ad451

update template and validation

fe6eb14

fix

1315d44

Merge branch 'main' into add_monitor

11738c4

update

37066d9

tastelikefeet approved these changes Apr 9, 2026

View reviewed changes

fix lint

9a91aef

tastelikefeet merged commit dca29d4 into main Apr 9, 2026
2 of 4 checks passed

Yunnglin added a commit that referenced this pull request Jun 2, 2026

Add server metrics monitor and DPO client (#132)

1537832

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add server metrics monitor and DPO client#132

Add server metrics monitor and DPO client#132
tastelikefeet merged 31 commits into
mainfrom
add_monitor

Yunnglin commented Mar 30, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Yunnglin commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Yunnglin commented Mar 30, 2026 •

edited

Loading