Description
Description
When using the client.speech_to_text.convert API with diarization enabled, the returned word-level timestamps occasionally become "stuck" - multiple consecutive words are assigned identical start and end timestamps. This breaks the continuity of the transcription timeline.
Example of Problem:
json
{
"text": "休",
"start": 452.3,
"end": 452.3
},
{
"text": "息",
"start": 452.3,
"end": 452.3
}
Impact:
Timeline Accuracy: Renders timestamp information useless for applications requiring precise timing
Speaker Diarization: Prevents accurate speaker attribution when timestamps don't progress
Audio Alignment: Makes it impossible to sync transcription with original audio
Data Processing: Requires complex workarounds to handle corrupted timing data
Steps to Reproduce
Use a long audio file (>10 minutes) with multiple speaker changes
Call the API with parameters:
python
ElevenLabs.speech_to_text.convert(
file=audio_data,
model_id="scribe_v1",
diarize=True,
language_code="zh", # Also reproducible with other languages
tag_audio_events=True
)
Inspect word-level timestamps in the response
Observe duplicate timestamps for consecutive words, especially after speaker changes
Expected Behavior
Each word should have a unique timestamp range (start < end)
Timestamps should monotonically increase throughout the transcription
Speaker changes should not cause timestamp stagnation
Consecutive words should have increasing end timestamps
Code example
No response
Additional context
No response
Description
Description
When using the client.speech_to_text.convert API with diarization enabled, the returned word-level timestamps occasionally become "stuck" - multiple consecutive words are assigned identical start and end timestamps. This breaks the continuity of the transcription timeline.
Example of Problem:
json
{
"text": "休",
"start": 452.3,
"end": 452.3
},
{
"text": "息",
"start": 452.3,
"end": 452.3
}
Impact:
Steps to Reproduce
Use a long audio file (>10 minutes) with multiple speaker changes
Call the API with parameters:
python
ElevenLabs.speech_to_text.convert(
file=audio_data,
model_id="scribe_v1",
diarize=True,
language_code="zh", # Also reproducible with other languages
tag_audio_events=True
)
Inspect word-level timestamps in the response
Observe duplicate timestamps for consecutive words, especially after speaker changes
Expected Behavior
Code example
No response
Additional context
No response