Skip to content

fix(deepgram): STT keyterm prompting is broken for non-Latin languages (Thai, Chinese, etc.)#1318

Open
tomasz-stefaniak wants to merge 1 commit intolivekit:mainfrom
tomasz-stefaniak:fix/deepgram-stt-double-encoding
Open

fix(deepgram): STT keyterm prompting is broken for non-Latin languages (Thai, Chinese, etc.)#1318
tomasz-stefaniak wants to merge 1 commit intolivekit:mainfrom
tomasz-stefaniak:fix/deepgram-stt-double-encoding

Conversation

@tomasz-stefaniak
Copy link
Copy Markdown

Problem

SpeechStream.run() in plugins/deepgram/src/stt.ts (lines 201 & 203) calls encodeURIComponent() on each query parameter value before passing it to URL.searchParams.append(). Since searchParams.append() already percent-encodes values on serialization, non-ASCII characters get double-encoded:

encodeURIComponent("ครับ")  →  "%E0%B8%84%E0%B8%A3%E0%B8%B1%E0%B8%9A"
searchParams.append(...)    →  "%25E0%25B8%2584%25E0%25B8%25A3%25E0%25B8%25B1%25E0%25B8%259A"

This causes two issues:

  1. Keyterm prompting is silently broken for non-Latin languages. Deepgram receives percent-encoded ASCII strings (%E0%B8%84%E0%B8%A3%E0%B8%B1%E0%B8%9A) instead of the actual text (ครับ), so keyterms have no effect on transcription accuracy.

  2. 400 rejections with modest keyterm counts. The double-encoded URL is ~45% larger. In our testing, just 13 Thai keyterms triggered a Deepgram 400 — while 12 succeeded. The same keyterms work fine in Latin scripts.

Latin-only keyterms are unaffected because encodeURIComponent is a no-op for ASCII letters and digits — which is why this hasn't been caught until now.

Fix

Remove the redundant encodeURIComponent() calls and let searchParams.append() handle encoding on its own.

Reproduction script

Save and run with npx tsx repro.ts (no dependencies needed):

const latinTerm = 'pierogi';
const thaiTerm = 'ครับ';

console.log('=== Double-encoding bug in Deepgram STT plugin ===');
console.log('=== File: plugins/deepgram/src/stt.ts, lines 201 & 203 ===\n');

console.log(`Latin keyterm: "${latinTerm}"`);
console.log(`  encodeURIComponent → "${encodeURIComponent(latinTerm)}"`);
console.log(`  Same as input! So double-encoding is harmless for Latin.\n`);

console.log(`Thai keyterm:  "${thaiTerm}"`);
console.log(`  encodeURIComponent → "${encodeURIComponent(thaiTerm)}"`);
console.log(`  Now searchParams.append re-encodes the % signs → double-encoded.\n`);

const latinBuggy = new URL('wss://api.deepgram.com/v1/listen');
latinBuggy.searchParams.append('keyterm', encodeURIComponent(latinTerm));
const latinCorrect = new URL('wss://api.deepgram.com/v1/listen');
latinCorrect.searchParams.append('keyterm', latinTerm);

const thaiBuggy = new URL('wss://api.deepgram.com/v1/listen');
thaiBuggy.searchParams.append('keyterm', encodeURIComponent(thaiTerm));
const thaiCorrect = new URL('wss://api.deepgram.com/v1/listen');
thaiCorrect.searchParams.append('keyterm', thaiTerm);

console.log('Latin in URL:');
console.log(`  Buggy:   ${latinBuggy.search}`);
console.log(`  Correct: ${latinCorrect.search}`);
console.log(`  Identical! No impact.\n`);

console.log('Thai in URL:');
console.log(`  Buggy:   ${thaiBuggy.search}`);
console.log(`  Correct: ${thaiCorrect.search}`);
console.log(`  Different! Buggy version is ${thaiBuggy.search.length - thaiCorrect.search.length} chars longer.\n`);

console.log('What Deepgram receives after URL-decoding the query string:');
console.log(`  Latin buggy:   "${latinBuggy.searchParams.get('keyterm')}" ✓ correct (no damage)`);
console.log(`  Latin correct: "${latinCorrect.searchParams.get('keyterm')}" ✓ correct`);
console.log(`  Thai buggy:    "${thaiBuggy.searchParams.get('keyterm')}" ✗ not Thai text, just percent-encoded ASCII`);
console.log(`  Thai correct:  "${thaiCorrect.searchParams.get('keyterm')}" ✓ actual Thai text`);

Script output

=== Double-encoding bug in Deepgram STT plugin ===
=== File: plugins/deepgram/src/stt.ts, lines 201 & 203 ===

Latin keyterm: "pierogi"
  encodeURIComponent → "pierogi"
  Same as input! So double-encoding is harmless for Latin.

Thai keyterm:  "ครับ"
  encodeURIComponent → "%E0%B8%84%E0%B8%A3%E0%B8%B1%E0%B8%9A"
  Now searchParams.append re-encodes the % signs → double-encoded.

Latin in URL:
  Buggy:   ?keyterm=pierogi
  Correct: ?keyterm=pierogi
  Identical! No impact.

Thai in URL:
  Buggy:   ?keyterm=%25E0%25B8%2584%25E0%25B8%25A3%25E0%25B8%25B1%25E0%25B8%259A
  Correct: ?keyterm=%E0%B8%84%E0%B8%A3%E0%B8%B1%E0%B8%9A
  Different! Buggy version is 24 chars longer.

What Deepgram receives after URL-decoding the query string:
  Latin buggy:   "pierogi" ✓ correct (no damage)
  Latin correct: "pierogi" ✓ correct
  Thai buggy:    "%E0%B8%84%E0%B8%A3%E0%B8%B1%E0%B8%9A" ✗ not Thai text, just percent-encoded ASCII
  Thai correct:  "ครับ" ✓ actual Thai text

`URL.searchParams.append()` already percent-encodes values on
serialization. The extra `encodeURIComponent()` call causes
double-encoding for non-ASCII characters (Thai, Chinese, etc.):

  encodeURIComponent("ครับ") → "%E0%B8%84..."
  searchParams.append re-encodes % → "%25E0%25B8%2584..."

This means:
1. Non-Latin keyterms arrive at Deepgram as percent-encoded strings
   instead of actual text, making keyterm prompting ineffective.
2. The inflated URL length hits Deepgram's limits sooner — in our
   testing, 13 Thai keyterms triggered a 400 rejection.

Latin-only keyterms are unaffected because encodeURIComponent is
a no-op for ASCII letters/digits, so this has been invisible to
English users.

The fix: let searchParams.append handle encoding on its own.
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Apr 26, 2026

⚠️ No Changeset found

Latest commit: c7d20aa

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 26, 2026

CLA assistant check
All committers have signed the CLA.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 1 additional finding.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants