dflash_generate: draft sampler ignores temperature; speculative decoding distribution diverges from target for temperature > 0

### Summary

In `dflash_generate`, the draft sampler is invoked without the user's `temperature`, while the target sampler does receive it. For `temperature > 0` this means the draft is deterministic (greedy argmax) while the target samples stochastically, so the two paths sample from different distributions and the speculative-decoding distribution guarantees do not hold.

`dflash/model.py:121` (draft):
```python
block_output_ids[:, 1:] = sample(draft_logits)            # uses default temperature=0.0
```

`dflash/model.py:134` (target):
```python
posterior = sample(output.logits, temperature)
```

### Reproduction (no model required)

Verbatim copy of `sample()` from `model.py:48-54`:

```python
import torch
def sample(logits, temperature=0.0):
    if temperature < 1e-5:
        return torch.argmax(logits, dim=-1)
    bsz, seq_len, vocab_size = logits.shape
    logits = logits.view(-1, vocab_size) / temperature
    probs = torch.softmax(logits, dim=-1)
    return torch.multinomial(probs, num_samples=1).view(bsz, seq_len)

torch.manual_seed(0)
logits = torch.tensor([[[2.0, 1.5, 1.0, 0.5]]])
draft  = sum(int(sample(logits).item() == 0) for _ in range(4000))
target = sum(int(sample(logits, temperature=1.0).item() == 0) for _ in range(4000))
print(draft, target)   # 4000 1894
```

The draft picks the mode 100% of the time; the target picks it ~47% of the time. Acceptance is decided by token equality (`block_output_ids[:, 1:] == posterior[:, :-1]`), so this mismatch artificially depresses acceptance for any `temperature > 0` and the accepted-token distribution does not match `p_target`.

### Suggested fix

The minimal correctness improvement is to pass `temperature` to the draft `sample()`:

```python
block_output_ids[:, 1:] = sample(draft_logits, temperature)
```

This makes draft and target sample under the same scheme. Acceptance is still token-equality (not Leviathan-style rejection), so it would also be helpful to document that `dflash_generate` provides exact-distribution semantics only for `temperature == 0` and approximate semantics otherwise.

Happy to send a PR with the one-line change plus a docstring update if the team agrees.

### Environment

- repo at `HEAD` `6e0c951`
- python 3.12, torch 2.11
- related, but distinct from PR #67 (which targeted `sample()`'s output shape)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dflash_generate: draft sampler ignores temperature; speculative decoding distribution diverges from target for temperature > 0 #74

Summary

Reproduction (no model required)

Suggested fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

dflash_generate: draft sampler ignores temperature; speculative decoding distribution diverges from target for temperature > 0 #74

Description

Summary

Reproduction (no model required)

Suggested fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions