Skip to content

Implement whisper_token_to_bytes#148

Merged
absadiki merged 1 commit intoabsadiki:mainfrom
githubnemo:feature/token-bytes
Dec 20, 2025
Merged

Implement whisper_token_to_bytes#148
absadiki merged 1 commit intoabsadiki:mainfrom
githubnemo:feature/token-bytes

Conversation

@githubnemo
Copy link
Copy Markdown

Not every token seems to be valid unicode but every token is interpreted as such in pw.whisper_token_to_str. While this can be caught with an exception handler it might be worthwhile to have a way of getting the token bytes instead and parsing them using .decode, e.g.:

str(pw.whisper_token_to_bytes(ctx, tid), 'utf8', 'ignore')

Not every token seems to be valid unicode but every token is interpreted
as such in `pw.whisper_token_to_str`. While this can be caught with an
exception handler it might be worthwhile to have a way of getting the
token bytes instead and parsing them using `.decode`, e.g.:

```python
str(pw.whisper_token_to_bytes(ctx, tid), 'utf8', 'ignore')
```
@absadiki
Copy link
Copy Markdown
Owner

Thanks for pointing this out! Yes, We can definitely expose the raw token bytes, similar to how full_segment already works.
but it looks like CI is failing, can you please take a look ?

@githubnemo
Copy link
Copy Markdown
Author

I'm not sure this is related to the PR. Can you restart the CI for these runs (it's only Windows).

@absadiki
Copy link
Copy Markdown
Owner

I reran the CI and this time all jobs passed, including Windows. Not sure what caused the earlier failures, but it looks good now.
Thanks @githubnemo for the contribution!

@absadiki absadiki merged commit 1223b5b into absadiki:main Dec 20, 2025
103 of 114 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants