VAD causes incorrect token timestamps when audio starts with music


**Description:**

When enabling VAD, if the audio contains music at the beginning (before speech starts), the token-level timestamps become incorrect.

It appears that the timestamps are reset or misaligned, especially for the first spoken tokens.

---

**Steps to reproduce:**

```bash
./build/bin/whisper-cli -vm ggml-silero-v6.2.0.bin --vad -f ggml-medium.en.bin -ml 60 -ojf
```

(Audio file contains music at the beginning, followed by speech.)

---

**Expected result:**

Token timestamps should align correctly with the segment timestamps.

```json
{
  "timestamps": {
    "from": "00:00:06,940",
    "to": "00:00:09,740"
  },
  "offsets": {
    "from": 6940,
    "to": 9740
  },
  "text": " What do these animals have in common?",
  "tokens": [
    {
      "text": " What",
      "timestamps": {
        "from": "00:00:06,980",
        "to": "00:00:07,270"
      },
      "offsets": {
        "from": 6980,
        "to": 7270
      },
      "id": 1867,
      "p": 0.999192,
      "t_dtw": -1
    }
  ]
}
```

---

**Actual result:**

The first tokens have incorrect timestamps (reset to 0), even though the segment timestamps are correct.

```json
{
  "timestamps": {
    "from": "00:00:06,750",
    "to": "00:00:09,760"
  },
  "offsets": {
    "from": 6750,
    "to": 9760
  },
  "text": " What do these animals have in common?",
  "tokens": [
    {
      "text": "[_BEG_]",
      "timestamps": {
        "from": "00:00:00,000",
        "to": "00:00:00,000"
      },
      "offsets": {
        "from": 0,
        "to": 0
      },
      "id": 50363,
      "p": 0.998205,
      "t_dtw": -1
    },
    {
      "text": " What",
      "timestamps": {
        "from": "00:00:00,000",
        "to": "00:00:00,270"
      },
      "offsets": {
        "from": 0,
        "to": 270
      },
      "id": 1867,
      "p": 0.634643,
      "t_dtw": -1
    }
  ]
}
```

---

**Observation:**

The token `"What"` should start around `~6980 ms`, but instead starts at `0 ms`.

This suggests that VAD segmentation may be breaking the alignment between segment offsets and token timestamps when non-speech audio (e.g., music) appears before speech.

[ted-test.wav](https://github.com/user-attachments/files/26753484/ted-test.wav)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VAD causes incorrect token timestamps when audio starts with music #3754

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

VAD causes incorrect token timestamps when audio starts with music #3754

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions