feat(server): add frequency/presence penalty sampling parameters#250
feat(server): add frequency/presence penalty sampling parameters#250howard0su wants to merge 1 commit into
Conversation
Add OpenAI-compatible frequency_penalty and presence_penalty to the C++ server's sampling pipeline. Implementation is model-independent and reusable across all backends (qwen35, qwen3, gemma4, laguna). Changes: - Add freq_pen, pres_pen fields to SamplerCfg with needs_logit_processing() helper - Implement penalty logic in sample_logits() (applied before top-k/softmax) - Parse frequency_penalty, presence_penalty, repetition_penalty, rep_window from HTTP JSON - Fix temp=0 bypass bug: penalties now apply even with greedy decoding - Update all backends to use needs_logit_processing() instead of raw temp check - Spec decode correctly disabled when penalties are active - Add 19 unit tests (1364 assertions total) - Add 14 HTTP integration tests for sampling parameters - Add API.md documenting supported/TODO parameters with priority levels Sampling chain: rep_penalty → freq/pres_penalty → top_k → softmax(temp) → top_p → draw Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
3 issues found across 10 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/scripts/test_server_integration.py">
<violation number="1" location="dflash/scripts/test_server_integration.py:232">
P2: Flaky probabilistic integration test: `test_temperature_high_produces_variance` asserts `len(results) >= 2` across only 5 generations with different seeds, but legitimate sampling behavior may still produce identical outputs for all seeds — especially for a prompt like "Name a color." where one token may dominate the distribution. This creates a flaky test that can fail on correct behavior.</violation>
<violation number="2" location="dflash/scripts/test_server_integration.py:279">
P2: `test_top_p_restrictive` asserts deterministic output without a fixed seed, which can cause flaky failures</violation>
<violation number="3" location="dflash/scripts/test_server_integration.py:343">
P2: Test assertion for frequency penalty is too weak to detect regressions</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| def test_temperature_high_produces_variance(self): | ||
| """High temperature with different seeds should produce different outputs.""" | ||
| base = { | ||
| "model": MODEL_NAME, | ||
| "messages": [{"role": "user", "content": "Name a color."}], | ||
| "stream": False, | ||
| "max_tokens": 8, | ||
| "temperature": 1.5, | ||
| } | ||
| results = set() | ||
| for seed in [1, 2, 3, 4, 5]: | ||
| r = post_json("/v1/chat/completions", {**base, "seed": seed}) | ||
| assert r.status_code == 200 | ||
| results.add(r.json()["choices"][0]["message"]["content"].strip()) | ||
| # With high temp and different seeds, we expect at least some variance | ||
| assert len(results) >= 2, f"Expected variance, got: {results}" |
There was a problem hiding this comment.
P2: Flaky probabilistic integration test: test_temperature_high_produces_variance asserts len(results) >= 2 across only 5 generations with different seeds, but legitimate sampling behavior may still produce identical outputs for all seeds — especially for a prompt like "Name a color." where one token may dominate the distribution. This creates a flaky test that can fail on correct behavior.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/test_server_integration.py, line 232:
<comment>Flaky probabilistic integration test: `test_temperature_high_produces_variance` asserts `len(results) >= 2` across only 5 generations with different seeds, but legitimate sampling behavior may still produce identical outputs for all seeds — especially for a prompt like "Name a color." where one token may dominate the distribution. This creates a flaky test that can fail on correct behavior.</comment>
<file context>
@@ -209,6 +209,241 @@ def test_temperature_zero_is_deterministic(self):
+ content = r.json()["choices"][0]["message"]["content"]
+ assert len(content) > 0
+
+ def test_temperature_high_produces_variance(self):
+ """High temperature with different seeds should produce different outputs."""
+ base = {
</file context>
| def test_temperature_high_produces_variance(self): | |
| """High temperature with different seeds should produce different outputs.""" | |
| base = { | |
| "model": MODEL_NAME, | |
| "messages": [{"role": "user", "content": "Name a color."}], | |
| "stream": False, | |
| "max_tokens": 8, | |
| "temperature": 1.5, | |
| } | |
| results = set() | |
| for seed in [1, 2, 3, 4, 5]: | |
| r = post_json("/v1/chat/completions", {**base, "seed": seed}) | |
| assert r.status_code == 200 | |
| results.add(r.json()["choices"][0]["message"]["content"].strip()) | |
| # With high temp and different seeds, we expect at least some variance | |
| assert len(results) >= 2, f"Expected variance, got: {results}" | |
| # Removed: testing statistical variance with small sample size is inherently flaky | |
| # Temperature correctness is covered by test_temperature_nonzero_accepted, | |
| # test_temperature_zero_is_deterministic, and test_seed_reproducibility. |
| text_no_pen = r_no_pen.json()["choices"][0]["message"]["content"].lower() | ||
| text_pen = r_pen.json()["choices"][0]["message"]["content"].lower() | ||
| # The penalized version should have fewer repetitions of "hello" | ||
| assert text_no_pen.count("hello") >= text_pen.count("hello") |
There was a problem hiding this comment.
P2: Test assertion for frequency penalty is too weak to detect regressions
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/test_server_integration.py, line 343:
<comment>Test assertion for frequency penalty is too weak to detect regressions</comment>
<file context>
@@ -209,6 +209,241 @@ def test_temperature_zero_is_deterministic(self):
+ text_no_pen = r_no_pen.json()["choices"][0]["message"]["content"].lower()
+ text_pen = r_pen.json()["choices"][0]["message"]["content"].lower()
+ # The penalized version should have fewer repetitions of "hello"
+ assert text_no_pen.count("hello") >= text_pen.count("hello")
+
+ def test_presence_penalty_accepted(self):
</file context>
| assert r.status_code == 200 | ||
| assert len(r.json()["choices"][0]["message"]["content"]) > 0 | ||
|
|
||
| def test_top_p_restrictive(self): |
There was a problem hiding this comment.
P2: test_top_p_restrictive asserts deterministic output without a fixed seed, which can cause flaky failures
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/scripts/test_server_integration.py, line 279:
<comment>`test_top_p_restrictive` asserts deterministic output without a fixed seed, which can cause flaky failures</comment>
<file context>
@@ -209,6 +209,241 @@ def test_temperature_zero_is_deterministic(self):
+ assert r.status_code == 200
+ assert len(r.json()["choices"][0]["message"]["content"]) > 0
+
+ def test_top_p_restrictive(self):
+ """Very low top_p should make output more deterministic."""
+ body = {
</file context>
Add OpenAI-compatible frequency_penalty and presence_penalty to the C++ server's sampling pipeline. Implementation is model-independent and reusable across all backends (qwen35, qwen3, gemma4, laguna).
Changes:
Sampling chain: rep_penalty → freq/pres_penalty → top_k → softmax(temp) → top_p → draw