Skip to content

VAD causes incorrect token timestamps when audio starts with music #3754

@craterone

Description

@craterone

Description:

When enabling VAD, if the audio contains music at the beginning (before speech starts), the token-level timestamps become incorrect.

It appears that the timestamps are reset or misaligned, especially for the first spoken tokens.


Steps to reproduce:

./build/bin/whisper-cli -vm ggml-silero-v6.2.0.bin --vad -f ggml-medium.en.bin -ml 60 -ojf

(Audio file contains music at the beginning, followed by speech.)


Expected result:

Token timestamps should align correctly with the segment timestamps.

{
  "timestamps": {
    "from": "00:00:06,940",
    "to": "00:00:09,740"
  },
  "offsets": {
    "from": 6940,
    "to": 9740
  },
  "text": " What do these animals have in common?",
  "tokens": [
    {
      "text": " What",
      "timestamps": {
        "from": "00:00:06,980",
        "to": "00:00:07,270"
      },
      "offsets": {
        "from": 6980,
        "to": 7270
      },
      "id": 1867,
      "p": 0.999192,
      "t_dtw": -1
    }
  ]
}

Actual result:

The first tokens have incorrect timestamps (reset to 0), even though the segment timestamps are correct.

{
  "timestamps": {
    "from": "00:00:06,750",
    "to": "00:00:09,760"
  },
  "offsets": {
    "from": 6750,
    "to": 9760
  },
  "text": " What do these animals have in common?",
  "tokens": [
    {
      "text": "[_BEG_]",
      "timestamps": {
        "from": "00:00:00,000",
        "to": "00:00:00,000"
      },
      "offsets": {
        "from": 0,
        "to": 0
      },
      "id": 50363,
      "p": 0.998205,
      "t_dtw": -1
    },
    {
      "text": " What",
      "timestamps": {
        "from": "00:00:00,000",
        "to": "00:00:00,270"
      },
      "offsets": {
        "from": 0,
        "to": 270
      },
      "id": 1867,
      "p": 0.634643,
      "t_dtw": -1
    }
  ]
}

Observation:

The token "What" should start around ~6980 ms, but instead starts at 0 ms.

This suggests that VAD segmentation may be breaking the alignment between segment offsets and token timestamps when non-speech audio (e.g., music) appears before speech.

ted-test.wav

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions