Description:
When enabling VAD, if the audio contains music at the beginning (before speech starts), the token-level timestamps become incorrect.
It appears that the timestamps are reset or misaligned, especially for the first spoken tokens.
Steps to reproduce:
./build/bin/whisper-cli -vm ggml-silero-v6.2.0.bin --vad -f ggml-medium.en.bin -ml 60 -ojf
(Audio file contains music at the beginning, followed by speech.)
Expected result:
Token timestamps should align correctly with the segment timestamps.
{
"timestamps": {
"from": "00:00:06,940",
"to": "00:00:09,740"
},
"offsets": {
"from": 6940,
"to": 9740
},
"text": " What do these animals have in common?",
"tokens": [
{
"text": " What",
"timestamps": {
"from": "00:00:06,980",
"to": "00:00:07,270"
},
"offsets": {
"from": 6980,
"to": 7270
},
"id": 1867,
"p": 0.999192,
"t_dtw": -1
}
]
}
Actual result:
The first tokens have incorrect timestamps (reset to 0), even though the segment timestamps are correct.
{
"timestamps": {
"from": "00:00:06,750",
"to": "00:00:09,760"
},
"offsets": {
"from": 6750,
"to": 9760
},
"text": " What do these animals have in common?",
"tokens": [
{
"text": "[_BEG_]",
"timestamps": {
"from": "00:00:00,000",
"to": "00:00:00,000"
},
"offsets": {
"from": 0,
"to": 0
},
"id": 50363,
"p": 0.998205,
"t_dtw": -1
},
{
"text": " What",
"timestamps": {
"from": "00:00:00,000",
"to": "00:00:00,270"
},
"offsets": {
"from": 0,
"to": 270
},
"id": 1867,
"p": 0.634643,
"t_dtw": -1
}
]
}
Observation:
The token "What" should start around ~6980 ms, but instead starts at 0 ms.
This suggests that VAD segmentation may be breaking the alignment between segment offsets and token timestamps when non-speech audio (e.g., music) appears before speech.
ted-test.wav
Description:
When enabling VAD, if the audio contains music at the beginning (before speech starts), the token-level timestamps become incorrect.
It appears that the timestamps are reset or misaligned, especially for the first spoken tokens.
Steps to reproduce:
(Audio file contains music at the beginning, followed by speech.)
Expected result:
Token timestamps should align correctly with the segment timestamps.
{ "timestamps": { "from": "00:00:06,940", "to": "00:00:09,740" }, "offsets": { "from": 6940, "to": 9740 }, "text": " What do these animals have in common?", "tokens": [ { "text": " What", "timestamps": { "from": "00:00:06,980", "to": "00:00:07,270" }, "offsets": { "from": 6980, "to": 7270 }, "id": 1867, "p": 0.999192, "t_dtw": -1 } ] }Actual result:
The first tokens have incorrect timestamps (reset to 0), even though the segment timestamps are correct.
{ "timestamps": { "from": "00:00:06,750", "to": "00:00:09,760" }, "offsets": { "from": 6750, "to": 9760 }, "text": " What do these animals have in common?", "tokens": [ { "text": "[_BEG_]", "timestamps": { "from": "00:00:00,000", "to": "00:00:00,000" }, "offsets": { "from": 0, "to": 0 }, "id": 50363, "p": 0.998205, "t_dtw": -1 }, { "text": " What", "timestamps": { "from": "00:00:00,000", "to": "00:00:00,270" }, "offsets": { "from": 0, "to": 270 }, "id": 1867, "p": 0.634643, "t_dtw": -1 } ] }Observation:
The token
"What"should start around~6980 ms, but instead starts at0 ms.This suggests that VAD segmentation may be breaking the alignment between segment offsets and token timestamps when non-speech audio (e.g., music) appears before speech.
ted-test.wav