Skip to content

[Inference] Together: add feature-extraction, text-to-speech, automatic-speech-recognition#2130

Open
nbroad1881 wants to merge 7 commits into
huggingface:mainfrom
nbroad1881:together-add-embeddings-tts-asr
Open

[Inference] Together: add feature-extraction, text-to-speech, automatic-speech-recognition#2130
nbroad1881 wants to merge 7 commits into
huggingface:mainfrom
nbroad1881:together-add-embeddings-tts-asr

Conversation

@nbroad1881
Copy link
Copy Markdown
Contributor

@nbroad1881 nbroad1881 commented Apr 28, 2026

Summary

The Together provider currently supports conversational, text-generation, and text-to-image. Per the Together docs, Together also serves audio (TTS + STT) and embedding models, so this PR adds three new task helpers in packages/inference/src/providers/together.ts:

  • TogetherFeatureExtractionTaskPOST /v1/embeddings (OpenAI-compatible: { input, model }, returns data[].embedding)
  • TogetherTextToSpeechTaskPOST /v1/audio/speech ({ input, model, voice, response_format, ... }, returns binary audio as a Blob)
  • TogetherAutomaticSpeechRecognitionTaskPOST /v1/audio/transcriptions

ASR is the only one that doesn't follow the existing JSON-body pattern: Together (and OpenAI's Whisper-compatible API) requires multipart/form-data. The new helper overrides makeBody to construct a real FormData (audio as file, all other args as form fields) and overrides prepareHeaders to leave Content-Type unset so fetch populates the multipart boundary itself. verbose_json segments are mapped into the existing AutomaticSpeechRecognitionOutput.chunks shape.

The three new helpers are wired into getProviderHelper.ts under together: { ... }.

Drive-by fix

While testing the ASR path I hit a pre-existing bug in utils/request.ts::bodyToJson — it crashed with Cannot read properties of null (reading 'accessToken') whenever the body wasn't a Blob/ArrayBuffer/string, which any FormData body would now trigger during error reporting. Fixed in the same PR (also handles the case where the parsed body is a non-object like a JSON string).

Note on HF model mapping

These new tasks work today when callers pass a direct Together API key (the SDK then bypasses the HF router and POSTs straight to api.together.xyz). For them to also work with HF tokens routed through router.huggingface.co/together, Together needs to register at least one model per task in the partner mapping at https://huggingface.co/api/partners/together/models — which currently only lists conversational and text-to-image models. The header comment in together.ts already describes that workflow.

Test plan

  • pnpm --filter @huggingface/inference run check (tsc) — passes
  • pnpm --filter @huggingface/inference run lint:check (eslint) — passes
  • Live request against api.together.xyz with a direct Together API key:
    • Embeddings: intfloat/multilingual-e5-large-instruct → 2 vectors, dim=1024, ~313 ms
    • TTS: hexgrad/Kokoro-82M, voice af_alloy, response_format=wav → 152 KB audio/wav blob, ~303 ms
    • ASR: openai/whisper-large-v3 on packages/inference/test/sample2.wav"He has grave doubts whether Sir Frederick Leighton's work is really Greek after all, and can discover in it but little of rocky Ithaca.", ~1008 ms
  • Offline mock-fetch verification of request URL/headers/body shape for all three tasks
  • Once Together registers models for these tasks in the HF partner mapping, end-to-end via hf_… tokens through router.huggingface.co/together should also succeed (no further SDK changes needed)

Made with Cursor


Note

Medium Risk
Adds multiple new Together task integrations (multipart audio upload, async video polling, and image blob handling) plus touches shared request error-reporting, increasing surface area for provider-specific edge cases and regressions.

Overview
Extends the together provider beyond chat/text/image generation by adding helpers for embeddings (feature-extraction), audio (text-to-speech, automatic-speech-recognition via multipart/form-data), and video generation (text-to-video, image-to-video with async job polling/download).

Updates Together image payloads to better match Together’s parameter names (e.g. mapping num_inference_stepssteps), adds Blob→data-URL preprocessing for image inputs, and wires all new tasks into getProviderHelper.ts.

Fixes utils/request.ts error-context serialization to safely handle FormData bodies (and avoid null/object assumptions) while still redacting accessToken.

Reviewed by Cursor Bugbot for commit 0247505. Bugbot is set up for automated code reviews on this repo. Configure here.

…ch, automatic-speech-recognition

Together exposes more modalities than the three currently wired up
(conversational, text-generation, text-to-image). This adds three new
task helpers, all hitting Together's existing public endpoints:

- TogetherFeatureExtractionTask    -> POST /v1/embeddings
- TogetherTextToSpeechTask         -> POST /v1/audio/speech
- TogetherAutomaticSpeechRecognitionTask -> POST /v1/audio/transcriptions

ASR is the only one that doesn't follow the existing JSON-body pattern:
Together (and OpenAI's Whisper-compatible API) requires
`multipart/form-data`. The new task overrides `makeBody` to construct a
real `FormData` (audio under `file`, the rest as form fields), and
overrides `prepareHeaders` to leave `Content-Type` unset so `fetch`
populates the multipart boundary itself. `verbose_json` segments are
mapped to the existing `AutomaticSpeechRecognitionOutput.chunks` shape.

Also fixes a pre-existing bug in `utils/request.ts::bodyToJson` that
crashed with "Cannot read properties of null (reading 'accessToken')"
whenever the body wasn't a Blob/ArrayBuffer/string -- which is now hit
by any FormData body during error reporting.

Verified live against api.together.xyz with a Together API key:
- Embeddings: intfloat/multilingual-e5-large-instruct, dim=1024 ✓
- TTS: hexgrad/Kokoro-82M, voice af_alloy -> 152 KB WAV ✓
- ASR: openai/whisper-large-v3 on test/sample2.wav ->
  "He has grave doubts whether Sir Frederick Leighton's work is really
  Greek after all, and can discover in it but little of rocky Ithaca." ✓

Note: For these tasks to be usable through HF tokens (not just direct
Together keys), Together still needs to register at least one model per
task in the partner mapping at
https://huggingface.co/api/partners/together/models, which currently
only lists `conversational` and `text-to-image` models. The comment at
the top of `together.ts` already describes that workflow.

Made-with: Cursor
Copy link
Copy Markdown
Contributor

@hanouticelina hanouticelina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made a first pass, thanks! looks good overall

Comment thread packages/inference/src/providers/together.ts Outdated
Comment thread packages/inference/src/providers/together.ts Outdated
Comment thread packages/inference/src/providers/together.ts
Comment thread packages/inference/src/providers/together.ts Outdated
Comment thread packages/inference/src/providers/together.ts
@nbroad1881
Copy link
Copy Markdown
Contributor Author

@hanouticelina ,

I have resolved your comments

Copy link
Copy Markdown
Contributor

@hanouticelina hanouticelina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nbroad1881 thanks! looks good to me! I've tested text-to-speech and asr and it works as expected. /v1/embeddings returns service unavailable on your side, is it expected?
also we need to allow the new routes (v1/audio/speech, v1/audio/transcriptions, v1/embeddings) server-side first, as soon as it's done and https://api.together.ai/v1/embeddings is available again, I will merge the PR!

Comment thread packages/inference/src/providers/together.ts
@nbroad1881
Copy link
Copy Markdown
Contributor Author

@nbroad1881 thanks! looks good to me! I've tested text-to-speech and asr and it works as expected. /v1/embeddings returns service unavailable on your side, is it expected? also we need to allow the new routes (v1/audio/speech, v1/audio/transcriptions, v1/embeddings) server-side first, as soon as it's done and https://api.together.ai/v1/embeddings is available again, I will merge the PR!

@hanouticelina , my tests show it working. I use intfloat/multilingual-e5-large-instruct and openai/whisper-large-v3 and canopylabs/orpheus-3b-0.1-ft

Comment thread packages/inference/src/providers/together.ts
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default mode and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 0247505. Configure here.

return result;
}
throw new InferenceClientProviderOutputError("Received malformed response from Together image-to-image API");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inherited error message names wrong task for image-to-image

Low Severity

TogetherImageToImageTask.getResponse delegates to super.getResponse(...) from TogetherTextToImageTask. When the parent throws (e.g. empty data array or missing b64_json), the propagated error message says "Received malformed response from Together text-to-image API" even though the caller is performing an image-to-image request. This can mislead users debugging failures. The child only catches the non-Blob-return case with its own correctly-labeled error, but never catches the parent's thrown exceptions.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0247505. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants