Feature: Add streaming support for parakeet#3900
Conversation
|
Verified test on ubuntu 24.04: StreamingNon-streaming #3735 |
|
I would add that there is a typo in I am happy to include this fix in the current merge request. |
|
I've left the indentation in |
Thanks, I've opened #3906 to fix this. |
|
@justynleung Thanks for your effort on this! I also went down this route in the original parakeet PR but reverted the changes as I could not get it to work correctly, and I used a similar implementation to yours. I ran into issues like the one in your above streaming output for example:
My understanding is that the TDT (Token-and-Duration Transducer) will not work well for streaming because of the emitted token durations. A different, dedicated streaming model, like the parakeet_realtime_eou_120m-v1, or nemotron-3.5-asr-streaming-0.6b should be used instead. Depending on the interest from the community this is something we could look into supporting in the future. |
|
Thank you for the feedback. I am glad that it was taken into consideration already. I ran some test again and can confirm the same issue arise. I agree that other models will be better at streaming transcription. I will close the pull request. |
Thanks @danbev again for the amazing parakeet support (#3735).
This merge request aim to add streaming support for Parakeet models as seen in huggingface and Nvidia Nemo ASR example.
(AI disclosure) Streaming mode graphical explanation:
A buffer window will slide across the full audio. Full window is encoded and only middle chunk is decoded.
Streaming reuse predictor state for next chunk processing until full audio is processed.
Disclosure:
AI disclosure:
Disclosure: AI was used for C++ 11 syntax assistance and researching related issue #3735. All logic has been manually verified and tested. Test result in second comment for readability.