Skip to content

Commit c008fa5

Browse files
author
linxiaodong
committed
whisper : map token timestamps to original time when VAD is enabled
The segment timestamp getters already remap t0/t1 back to the original audio timeline when VAD is used, but the per-token timestamps returned by whisper_full_get_token_data() are left in VAD-processed time. That makes word-level timing unusable with VAD, since tokens end up shifted by however much silence was removed. Add whisper_full_get_token_t0/t1 (and the _from_state variants) that run the token times through the same vad_mapping_table the segment getters use. With VAD off they just return the stored token times, so existing callers are unaffected.
1 parent 43d78af commit c008fa5

2 files changed

Lines changed: 42 additions & 0 deletions

File tree

include/whisper.h

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -667,6 +667,14 @@ extern "C" {
667667
WHISPER_API whisper_token_data whisper_full_get_token_data (struct whisper_context * ctx, int i_segment, int i_token);
668668
WHISPER_API whisper_token_data whisper_full_get_token_data_from_state(struct whisper_state * state, int i_segment, int i_token);
669669

670+
// Get the start/end time of the specified token in the specified segment
671+
// When VAD is enabled these are mapped back to the original audio timeline,
672+
// unlike whisper_full_get_token_data().t0/t1 which stay in VAD-processed time
673+
WHISPER_API int64_t whisper_full_get_token_t0 (struct whisper_context * ctx, int i_segment, int i_token);
674+
WHISPER_API int64_t whisper_full_get_token_t0_from_state(struct whisper_state * state, int i_segment, int i_token);
675+
WHISPER_API int64_t whisper_full_get_token_t1 (struct whisper_context * ctx, int i_segment, int i_token);
676+
WHISPER_API int64_t whisper_full_get_token_t1_from_state(struct whisper_state * state, int i_segment, int i_token);
677+
670678
// Get the probability of the specified token in the specified segment
671679
WHISPER_API float whisper_full_get_token_p (struct whisper_context * ctx, int i_segment, int i_token);
672680
WHISPER_API float whisper_full_get_token_p_from_state(struct whisper_state * state, int i_segment, int i_token);

src/whisper.cpp

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8075,6 +8075,40 @@ struct whisper_token_data whisper_full_get_token_data(struct whisper_context * c
80758075
return ctx->state->result_all[i_segment].tokens[i_token];
80768076
}
80778077

8078+
// Function to get the starting timestamp of a token
8079+
int64_t whisper_full_get_token_t0_from_state(struct whisper_state * state, int i_segment, int i_token) {
8080+
const int64_t t0 = state->result_all[i_segment].tokens[i_token].t0;
8081+
8082+
// If VAD wasn't used, return the original timestamp
8083+
if (!state->has_vad_segments || state->vad_mapping_table.empty()) {
8084+
return t0;
8085+
}
8086+
8087+
// Map to original time using the mapping table
8088+
return map_processed_to_original_time(t0, state->vad_mapping_table);
8089+
}
8090+
8091+
int64_t whisper_full_get_token_t0(struct whisper_context * ctx, int i_segment, int i_token) {
8092+
return whisper_full_get_token_t0_from_state(ctx->state, i_segment, i_token);
8093+
}
8094+
8095+
// Function to get the ending timestamp of a token
8096+
int64_t whisper_full_get_token_t1_from_state(struct whisper_state * state, int i_segment, int i_token) {
8097+
const int64_t t1 = state->result_all[i_segment].tokens[i_token].t1;
8098+
8099+
// If VAD wasn't used, return the original timestamp
8100+
if (!state->has_vad_segments || state->vad_mapping_table.empty()) {
8101+
return t1;
8102+
}
8103+
8104+
// Map to original time using the mapping table
8105+
return map_processed_to_original_time(t1, state->vad_mapping_table);
8106+
}
8107+
8108+
int64_t whisper_full_get_token_t1(struct whisper_context * ctx, int i_segment, int i_token) {
8109+
return whisper_full_get_token_t1_from_state(ctx->state, i_segment, i_token);
8110+
}
8111+
80788112
float whisper_full_get_token_p_from_state(struct whisper_state * state, int i_segment, int i_token) {
80798113
return state->result_all[i_segment].tokens[i_token].p;
80808114
}

0 commit comments

Comments
 (0)