[Inference] Together: add feature-extraction, text-to-speech, automatic-speech-recognition#2130
[Inference] Together: add feature-extraction, text-to-speech, automatic-speech-recognition#2130nbroad1881 wants to merge 7 commits into
Conversation
…ch, automatic-speech-recognition Together exposes more modalities than the three currently wired up (conversational, text-generation, text-to-image). This adds three new task helpers, all hitting Together's existing public endpoints: - TogetherFeatureExtractionTask -> POST /v1/embeddings - TogetherTextToSpeechTask -> POST /v1/audio/speech - TogetherAutomaticSpeechRecognitionTask -> POST /v1/audio/transcriptions ASR is the only one that doesn't follow the existing JSON-body pattern: Together (and OpenAI's Whisper-compatible API) requires `multipart/form-data`. The new task overrides `makeBody` to construct a real `FormData` (audio under `file`, the rest as form fields), and overrides `prepareHeaders` to leave `Content-Type` unset so `fetch` populates the multipart boundary itself. `verbose_json` segments are mapped to the existing `AutomaticSpeechRecognitionOutput.chunks` shape. Also fixes a pre-existing bug in `utils/request.ts::bodyToJson` that crashed with "Cannot read properties of null (reading 'accessToken')" whenever the body wasn't a Blob/ArrayBuffer/string -- which is now hit by any FormData body during error reporting. Verified live against api.together.xyz with a Together API key: - Embeddings: intfloat/multilingual-e5-large-instruct, dim=1024 ✓ - TTS: hexgrad/Kokoro-82M, voice af_alloy -> 152 KB WAV ✓ - ASR: openai/whisper-large-v3 on test/sample2.wav -> "He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and can discover in it but little of rocky Ithaca." ✓ Note: For these tasks to be usable through HF tokens (not just direct Together keys), Together still needs to register at least one model per task in the partner mapping at https://huggingface.co/api/partners/together/models, which currently only lists `conversational` and `text-to-image` models. The comment at the top of `together.ts` already describes that workflow. Made-with: Cursor
hanouticelina
left a comment
There was a problem hiding this comment.
made a first pass, thanks! looks good overall
|
I have resolved your comments |
hanouticelina
left a comment
There was a problem hiding this comment.
@nbroad1881 thanks! looks good to me! I've tested text-to-speech and asr and it works as expected. /v1/embeddings returns service unavailable on your side, is it expected?
also we need to allow the new routes (v1/audio/speech, v1/audio/transcriptions, v1/embeddings) server-side first, as soon as it's done and https://api.together.ai/v1/embeddings is available again, I will merge the PR!
@hanouticelina , my tests show it working. I use |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default mode and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 0247505. Configure here.
| return result; | ||
| } | ||
| throw new InferenceClientProviderOutputError("Received malformed response from Together image-to-image API"); | ||
| } |
There was a problem hiding this comment.
Inherited error message names wrong task for image-to-image
Low Severity
TogetherImageToImageTask.getResponse delegates to super.getResponse(...) from TogetherTextToImageTask. When the parent throws (e.g. empty data array or missing b64_json), the propagated error message says "Received malformed response from Together text-to-image API" even though the caller is performing an image-to-image request. This can mislead users debugging failures. The child only catches the non-Blob-return case with its own correctly-labeled error, but never catches the parent's thrown exceptions.
Reviewed by Cursor Bugbot for commit 0247505. Configure here.


Summary
The Together provider currently supports
conversational,text-generation, andtext-to-image. Per the Together docs, Together also serves audio (TTS + STT) and embedding models, so this PR adds three new task helpers inpackages/inference/src/providers/together.ts:TogetherFeatureExtractionTask→POST /v1/embeddings(OpenAI-compatible:{ input, model }, returnsdata[].embedding)TogetherTextToSpeechTask→POST /v1/audio/speech({ input, model, voice, response_format, ... }, returns binary audio as aBlob)TogetherAutomaticSpeechRecognitionTask→POST /v1/audio/transcriptionsASR is the only one that doesn't follow the existing JSON-body pattern: Together (and OpenAI's Whisper-compatible API) requires
multipart/form-data. The new helper overridesmakeBodyto construct a realFormData(audio asfile, all other args as form fields) and overridesprepareHeadersto leaveContent-Typeunset sofetchpopulates the multipart boundary itself.verbose_jsonsegments are mapped into the existingAutomaticSpeechRecognitionOutput.chunksshape.The three new helpers are wired into
getProviderHelper.tsundertogether: { ... }.Drive-by fix
While testing the ASR path I hit a pre-existing bug in
utils/request.ts::bodyToJson— it crashed withCannot read properties of null (reading 'accessToken')whenever the body wasn't aBlob/ArrayBuffer/string, which anyFormDatabody would now trigger during error reporting. Fixed in the same PR (also handles the case where the parsed body is a non-object like a JSON string).Note on HF model mapping
These new tasks work today when callers pass a direct Together API key (the SDK then bypasses the HF router and POSTs straight to
api.together.xyz). For them to also work with HF tokens routed throughrouter.huggingface.co/together, Together needs to register at least one model per task in the partner mapping at https://huggingface.co/api/partners/together/models — which currently only listsconversationalandtext-to-imagemodels. The header comment intogether.tsalready describes that workflow.Test plan
pnpm --filter @huggingface/inference run check(tsc) — passespnpm --filter @huggingface/inference run lint:check(eslint) — passesapi.together.xyzwith a direct Together API key:intfloat/multilingual-e5-large-instruct→ 2 vectors,dim=1024, ~313 mshexgrad/Kokoro-82M, voiceaf_alloy,response_format=wav→ 152 KBaudio/wavblob, ~303 msopenai/whisper-large-v3onpackages/inference/test/sample2.wav→ "He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and can discover in it but little of rocky Ithaca.", ~1008 mshf_…tokens throughrouter.huggingface.co/togethershould also succeed (no further SDK changes needed)Made with Cursor
Note
Medium Risk
Adds multiple new Together task integrations (multipart audio upload, async video polling, and image blob handling) plus touches shared request error-reporting, increasing surface area for provider-specific edge cases and regressions.
Overview
Extends the
togetherprovider beyond chat/text/image generation by adding helpers for embeddings (feature-extraction), audio (text-to-speech,automatic-speech-recognitionviamultipart/form-data), and video generation (text-to-video,image-to-videowith async job polling/download).Updates Together image payloads to better match Together’s parameter names (e.g. mapping
num_inference_steps→steps), adds Blob→data-URL preprocessing for image inputs, and wires all new tasks intogetProviderHelper.ts.Fixes
utils/request.tserror-context serialization to safely handleFormDatabodies (and avoid null/object assumptions) while still redactingaccessToken.Reviewed by Cursor Bugbot for commit 0247505. Bugbot is set up for automated code reviews on this repo. Configure here.