This repo is now a Cog/Replicate wrapper around i4Ds/stt4sg-transcribe, with the pipeline vendored locally and a repo-local model/cache layout so the same assets are available both on the host and inside the built image.
- Replaced the old WhisperX-only predictor with the STT4SG transcription pipeline.
- Added a
uvproject (pyproject.toml,uv.lockonce generated) for local development. - Added repo-local cache helpers so Whisper, alignment, Silero, and pyannote assets can live under
model-store/. - Exposed the upstream pipeline parameters through Cog inputs instead of hardcoding a narrow set.
Downloaded assets are stored locally in this repository:
model-store/
├── whisper/ # local faster-whisper model directories
├── alignment/ # torchaudio / HF alignment models
├── huggingface/ # pyannote and other HF caches
├── torch/ # torch / torchaudio cache
├── xdg/ # xdg cache used by some backends
├── speechbrain/ # optional speechbrain cached models
└── nemo/ # optional nemo assets
That directory is copied into the Cog build context, so pre-downloaded assets are available inside the container without extra runtime downloads.
Create the local environment:
uv venv
source .venv/bin/activate
uv syncPrefetch the default local assets:
uv run python scripts/download_assets.pyNotes:
- Default Whisper model:
i4ds/daily-brook-134 - Default alignment model:
VOXPOPULI_ASR_BASE_10K_DE - Silero is prefetched automatically.
- Pyannote downloads require a Hugging Face token or an existing local HF login. The script will use your saved local login if one exists.
You can prefetch additional Whisper models too:
uv run python scripts/download_assets.py \
--whisper-model i4ds/daily-brook-134 \
--whisper-model Systran/faster-whisper-large-v3Run the vendored pipeline directly:
uv run main.py 97_Brugg.mp3Example with explicit parameters:
uv run main.py 97_Brugg.mp3 \
--model i4ds/daily-brook-134 \
--vad-method silero \
--batch-size 8 \
--no-alignmentBuild the image:
cog buildRun the included test file:
cog predict \
-i audio_file=@97_Brugg.mp3 \
-i model=i4ds/daily-brook-134 \
-i use_vad=true \
-i vad_method=silero \
-i use_alignment=trueThe predictor returns:
srt_filesrt_contentdetected_languageduration_secondsnum_segmentslogs_zipwhen log saving is enabled
The Cog predictor surfaces the upstream runtime controls:
audio_filemodellanguagetasklog_progressuse_vadvad_methodvad_paramsdiarizationdiarization_methoddiarization_paramsnum_speakersmin_speakersmax_speakersuse_alignmentalignment_modelbatch_sizedevicecompute_typeinclude_speaker_labelssave_logshf_token
vad_params and diarization_params are JSON strings so backend-specific parameters can be passed through unchanged.
cog login
cog push r8.im/your-username/stt4sg-replicateThe intended smoke test for this repo is:
cog predict -i audio_file=@97_Brugg.mp3If pyannote assets are not available locally, keep diarization=false and use vad_method=silero for the basic transcription path.