Skip to content

Commit 8e3c93a

Browse files
author
linxiaodong
committed
whisper : add VAD-mapped token timestamp getters
whisper_full_get_token_data().t0/t1 are in VAD "processed" time when VAD is enabled (silence removed), so only segment timestamps were mapped back to the original timeline; callers had no way to get token/word-level times on the original timeline. Add whisper_full_get_token_t0/t1 (+ _from_state) which apply the same vad_mapping_table that the segment getters use. With VAD off, or when no mapping table exists, they return the raw token times, so existing behavior is unchanged.
1 parent c8ae48a commit 8e3c93a

2 files changed

Lines changed: 38 additions & 0 deletions

File tree

include/whisper.h

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -667,6 +667,15 @@ extern "C" {
667667
WHISPER_API whisper_token_data whisper_full_get_token_data (struct whisper_context * ctx, int i_segment, int i_token);
668668
WHISPER_API whisper_token_data whisper_full_get_token_data_from_state(struct whisper_state * state, int i_segment, int i_token);
669669

670+
// Get token-level start/end timestamps mapped back to the original timeline.
671+
// Unlike whisper_full_get_token_data().t0/t1 (which are in VAD "processed" time
672+
// when VAD is enabled), these apply the same VAD mapping as the segment getters.
673+
// Requires token-level timestamps (params.token_timestamps = true).
674+
WHISPER_API int64_t whisper_full_get_token_t0 (struct whisper_context * ctx, int i_segment, int i_token);
675+
WHISPER_API int64_t whisper_full_get_token_t0_from_state(struct whisper_state * state, int i_segment, int i_token);
676+
WHISPER_API int64_t whisper_full_get_token_t1 (struct whisper_context * ctx, int i_segment, int i_token);
677+
WHISPER_API int64_t whisper_full_get_token_t1_from_state(struct whisper_state * state, int i_segment, int i_token);
678+
670679
// Get the probability of the specified token in the specified segment
671680
WHISPER_API float whisper_full_get_token_p (struct whisper_context * ctx, int i_segment, int i_token);
672681
WHISPER_API float whisper_full_get_token_p_from_state(struct whisper_state * state, int i_segment, int i_token);

src/whisper.cpp

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8075,6 +8075,35 @@ struct whisper_token_data whisper_full_get_token_data(struct whisper_context * c
80758075
return ctx->state->result_all[i_segment].tokens[i_token];
80768076
}
80778077

8078+
// Token-level timestamps mapped back to the original timeline.
8079+
// whisper_full_get_token_data().t0/t1 are in "processed" time when VAD is enabled
8080+
// (silence removed); these helpers apply the same VAD mapping table used by the
8081+
// segment getters so token times line up with the original audio. Requires
8082+
// token-level timestamps to have been computed (params.token_timestamps = true).
8083+
int64_t whisper_full_get_token_t0_from_state(struct whisper_state * state, int i_segment, int i_token) {
8084+
const int64_t t0 = state->result_all[i_segment].tokens[i_token].t0;
8085+
if (!state->has_vad_segments || state->vad_mapping_table.empty()) {
8086+
return t0;
8087+
}
8088+
return map_processed_to_original_time(t0, state->vad_mapping_table);
8089+
}
8090+
8091+
int64_t whisper_full_get_token_t0(struct whisper_context * ctx, int i_segment, int i_token) {
8092+
return whisper_full_get_token_t0_from_state(ctx->state, i_segment, i_token);
8093+
}
8094+
8095+
int64_t whisper_full_get_token_t1_from_state(struct whisper_state * state, int i_segment, int i_token) {
8096+
const int64_t t1 = state->result_all[i_segment].tokens[i_token].t1;
8097+
if (!state->has_vad_segments || state->vad_mapping_table.empty()) {
8098+
return t1;
8099+
}
8100+
return map_processed_to_original_time(t1, state->vad_mapping_table);
8101+
}
8102+
8103+
int64_t whisper_full_get_token_t1(struct whisper_context * ctx, int i_segment, int i_token) {
8104+
return whisper_full_get_token_t1_from_state(ctx->state, i_segment, i_token);
8105+
}
8106+
80788107
float whisper_full_get_token_p_from_state(struct whisper_state * state, int i_segment, int i_token) {
80798108
return state->result_all[i_segment].tokens[i_token].p;
80808109
}

0 commit comments

Comments
 (0)