Skip to content

On-device transcription via sherpa-onnx + SAF/MediaStore + Bluetooth#365

Draft
julianjc84 wants to merge 12 commits intoFossifyOrg:mainfrom
julianjc84:VoiceToTextTranscribe
Draft

On-device transcription via sherpa-onnx + SAF/MediaStore + Bluetooth#365
julianjc84 wants to merge 12 commits intoFossifyOrg:mainfrom
julianjc84:VoiceToTextTranscribe

Conversation

@julianjc84
Copy link
Copy Markdown

Summary

Draft for discussion — not yet ready to merge. Adds on-device speech-to-text
on top of two foundational changes:

  • SAF + MediaStore storage (3dd79248) — recordings via content URIs.
  • Bluetooth recording (66880eb5 + 5d177606) — SCO mic with live detection.
  • Transcription (e977cd1623bb97aa) — sherpa-onnx Whisper engine, model
    download manager, inline transcript indicator, full-screen transcript view
    with multi-segment selection, pipelined decode+transcribe with live ETA.

Based on upstream 1a2f0963.

Acknowledgements

Huge thanks to @FossifyOrg and the Fossify Voice Recorder maintainers and
contributors — this branch sits directly on top of main and reuses the
existing recorder/player/settings architecture throughout.

This branch also incorporates two open upstream PRs that I rebased and
resolved conflicts for:

All transcription work on top is mine.

julian richards and others added 12 commits April 30, 2026 00:39
Squash-merge of upstream PR FossifyOrg#317 (28 commits)
onto current main. Adds a new :store Gradle module that abstracts
recording I/O behind a unified RecordingStore interface, with two
backends:

  - MediaStore (default; no folder picker on first run)
  - Storage Access Framework (lets the user save into any folder or
    document provider, including cloud / sync apps)

Switches Recording from path-based to URI-based and removes the old
DocumentFile/path extension helpers. Includes the upstream test suite
under store/src/androidTest.

Conflict resolved: kept main's commons 6.1.6 over the PR's 6.1.0.
Verified ./gradlew :app:assembleDebug succeeds across Core/Foss/Gplay.

Original PR: FossifyOrg#317

Co-Authored-By: Adam Cigánek <adam.ciganek@proton.me>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
it also improves Android 12+ compat and reduce warnings on build
Adds a new :transcribe Gradle module wrapping sherpa-onnx (JitPack v1.12.40)
with a Whisper-tiny multilingual int8 model downloaded on first use. A new
foreground TranscriptionService streams the recording through MediaCodec
+ a linear resampler to 16 kHz mono Float32, runs inference per 30 s chunk,
and writes a JSON sidecar (.transcript.json) next to the audio via
TranscriptStore (SAF + MediaStore parity).

In the player UI a transcript icon next to the title opens TranscriptDialog,
which renders idle/busy/ready states and supports tap-segment-to-seek.
Progress is throttled to ≤4 events/sec so the foreground notification
isn't re-posted thousands of times per second during model download.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cribe

Adds a Transcription section in Settings with a model picker that lists
all catalog entries with install state, size, and per-row download/delete
actions, plus a language preference. Downloads run through the existing
foreground service via a new ACTION_DOWNLOAD_MODEL path so they survive
backgrounding and reuse the cancellation/notification plumbing.

Transcripts now persist processing wall-clock time (processing_ms in the
sidecar JSON), shown in the transcript dialog ready state. Adds a
Re-transcribe button that overwrites the existing sidecar.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Wrap the model list in a ScrollView so all entries are reachable
  (matches how commons' own dialog_radio_group.xml works).
- Restructure each row so the action button sits on its own line,
  preventing the radio + name from being squeezed by long button labels.
- Pass forceFinished = true from the Completed/Failed/Cancelled
  subscribers so the row no longer races against the service clearing
  TranscriptionService.downloadingModelId in its finally block — fixes
  the row briefly showing "0%" right after a download completes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Detection used to be a one-shot poll on lifecycle events, so plugging
or unplugging a BT headset while sitting on the recorder screen had no
visible effect until something else triggered a refresh.

- Register an AudioDeviceCallback on attach (unregister on destroy)
  and refresh the BT tab on every onAudioDevicesAdded/Removed.
- Keep the mic selector row visible whenever recording is stopped
  (instead of vanishing when no BT device is present) and render the
  Bluetooth tab as dimmed + non-tappable when unavailable, with a
  "Bluetooth · Not connected" label.
- When a BT mic is connected and BT_CONNECT permission is granted,
  show the device productName ("Bluetooth · AirPods Pro") so the
  user can confirm which headset is in use.
- If BT was selected and the device disconnects, fall back to Device
  Mic automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t toggle

Promotes transcript viewing from a popup dialog to a dedicated activity
and gives the player a dedicated view onto transcribed recordings.

Recordings list (PlayerFragment):
- Add a two-tab segmented control at the top: "Audio" (current behaviour)
  and "Transcripts" (filtered to recordings with a sidecar JSON).
- "Has transcript" set is recomputed on a background thread when entering
  Transcripts mode and on every refresh while in it.
- In Transcripts mode, tap on a row opens the new TranscriptActivity;
  long-press shows transcript-flavoured cab actions: Share transcript
  (text/plain), Share transcript JSON, Copy transcript, Delete
  transcript. The original Audio long-press menu is unchanged.

Transcript viewer (TranscriptActivity, replaces TranscriptDialog):
- Material toolbar with back arrow, in-toolbar SearchView, and overflow
  (Re-transcribe / Copy / Share JSON / Delete). Menu is inflated directly
  on MaterialToolbar — commons' setupTopAppBar does not register the
  toolbar as the support action bar, so onCreateOptionsMenu does not fire
  for an AppCompat menu attached that way.
- Search highlights all matches across segments via SpannableString
  (active match in primary colour, passive matches in semi-transparent
  primary), with prev/next chevrons and an "X / Y matches" counter.
- Self-contained MediaPlayer with mini play/pause + seekbar; segment tap
  seeks and plays. The fragment's player is paused before launching so
  the two players don't compete.
- A 200ms tick syncs a playhead-segment highlight (tinted background +
  bold/primary timestamp) and auto-scrolls the active segment into view
  only when it has moved off-screen.

Plumbing:
- TranscriptStore.sidecarUri exposes the JSON URI for ACTION_SEND.
- TranscriptShare.kt builds the plain-text body and the share intents.
- New strings for tabs, share / copy / delete labels, and search UI.
- Manifest entry for TranscriptActivity (parented to MainActivity).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
While transcribing, the activity now shows a "Elapsed: 00:42 · ETA: 03:24"
line under the progress bar, ticking once per second. The ETA is a linear
extrapolation from the consumer-side fraction (audio actually transcribed,
not merely decoded), suppressed until fraction > 5% to avoid noise from
chunk-boundary jitter, and only shown during the TRANSCRIBING phase
(the model-download and decode/write phases are too short to extrapolate
from). TranscriptionService exposes a transcriptionStartMs companion so
the activity can reconstruct elapsed time even when opened mid-job.

Pipelined decode+transcribe in TranscriptionService:
- AudioDecoder runs on a worker thread, pushing PcmChunks into a small
  bounded queue (capacity 2 for one in-flight + one waiting).
- The recognizer drains the queue on the existing pipeline coroutine,
  so MediaCodec / extractor wait time on chunk N+1 overlaps with
  inference on chunk N.
- Progress is now emitted from the consumer side (chunk.endMs vs. the
  recording's known duration) so the bar tracks real transcription
  progress instead of running ahead of it.
- Cancellation propagates both ways: the producer polls isCancelled
  while waiting to enqueue, and any consumer error sets isCancelled
  so the producer drops out of decodeChunks cleanly. An EOF sentinel
  always unblocks the consumer's queue.take().

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pipelined version posted progress only on each chunk's completion,
which made the bar jump roughly every 5–10 s of wall time instead of
moving continuously like the old decoder-driven progress did.

A 400 ms ticker thread now interpolates inside the current chunk using
a rolling EMA of per-chunk wall time (seeded at 6 s, alpha 0.4) to
estimate where we are between the chunk's start and end fraction. The
notification rebuild is skipped when the rounded % is unchanged so the
foreground notif isn't churned several times per second.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Long-press a segment in TranscriptActivity to enter selection mode.
Tap or long-press additional segments to toggle them, then copy or
share the selection as "[mm:ss] text" lines via the toolbar. The
toolbar swaps to an X-nav + Copy / Share / Select all menu, with
hardware back wired through OnBackPressedCallback to exit cleanly.

Row styling is unified into a single applyRowStyle helper that
picks selection > playhead > none, so the playhead can pass under
selected rows without disturbing them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er-row menu

Drop the Audio/Transcripts toggle and the player-bar transcript button in
favour of a single recordings list where each row carries its own transcript
state inline. Rows with a transcript show an italic preview snippet and the
transcript icon; rows without one show a dim "Transcribe" prompt. Tapping
either opens the full transcript view, which already handles both idle and
ready states.

Add a per-row 3-dot overflow menu for single-item operations (rename, open
with, share/delete audio, copy/share/delete transcript) and trim the CAB to
bulk-only operations (share, delete, delete transcript, select all). This
removes the icon doubling that occurred when the CAB tried to host both
audio and transcript actions for one selection.

Subscribe PlayerFragment to TranscriptionCompleted so the affected row's
indicator refreshes in place once a transcription finishes, no longer
requiring an app restart to see the new state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… full-screen view

Deleting a transcript from the row's 3-dot popup already refreshed the list
because the adapter calls refreshListener.refreshRecordings(). Deleting from
the toolbar inside TranscriptActivity skipped that path, so the row's
indicator kept showing the stale preview snippet until the app was reopened.

Add Events.TranscriptDeleted, post it from TranscriptActivity's delete flow,
and subscribe in PlayerFragment to recompute the preview map. The affected
row's indicator now flips back to the "Transcribe" prompt in place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants