diff --git a/agent_notes.md b/agent_notes.md new file mode 100644 index 00000000..213088a5 --- /dev/null +++ b/agent_notes.md @@ -0,0 +1,52 @@ +# Speaker Segment Persistence & Error Handling Fixes + +## Changes Summary + +1. **Backend Models**: Added `SpeakerSegment` model to `internal/models/transcription.go` to persist timestamped audio segments for each identified speaker. Added to GORM auto-migration. +2. **Database Layer**: + - Updated `JobRepository` interface in `internal/repository/implementations.go` with `SaveSpeakerSegments` and `GetSegmentsBySpeakerID`. + - Implemented these methods in `jobRepository`. +3. **Transcription Pipeline**: + - Updated `UnifiedTranscriptionService.saveTranscriptionResults` in `internal/transcription/unified_service.go` to automatically extract and save speaker segments after successful transcription. +4. **API Layer**: + - Added `GET /api/v1/speakers/:id/segments` endpoint in `internal/api/speaker_handlers.go`. + - Registered the new route in `internal/api/router.go`. +5. **Speaker Management Fixes**: + - Corrected a Go-style syntax error (`func` instead of `def`) in `internal/transcription/adapters/py/nvidia/titanet_manage.py`. + - Enhanced `TitanetAdapter` to capture and return `stderr` from Python commands for better diagnostics. + - Updated API handlers to return these descriptive error messages to the frontend. +6. **Frontend Enhancements**: + - Updated `web/frontend/src/lib/speakersApi.ts` to include the `getSegments` method and improved error parsing from API responses. + - Updated `AudioFilesTable.tsx` to display speaker names in the table view. +7. **Tests**: Updated `MockJobRepository` in test suites to match the new interface; all `internal/transcription` tests passed. + +## Environment Resolution +- The `uv run` issue in `data/whisperx-env/parakeet/` was resolved by running `uv lock` (performed by user), fixing dependency resolution for the private registry. +- Syntax error in `titanet_manage.py` was manually patched in both the source and the active environment. + + +* * * +# Speaker Persistence Implementation + +Implemented global speaker identity tracking with high-dimensional embedding storage. + +## Backend Changes +- Added `SpeakerSegment` and `SpeakerJobCentroid` models in SQLite. +- Updated `UnifiedTranscriptionService` to save reference segments and job-level centroids. +- Enhanced `titanet_identify.py` to extract and return segment-level embeddings and the calculated centroid. +- Added `SaveSpeakerJobCentroids` to `JobRepository`. +- Updated API routes and handlers for speaker management (Rename, List, Delete). +- Fixed build errors in `unified_service.go` related to variable scope and function signatures. + +## Frontend Changes +- Created a "Speakers" tab in the Settings page. +- Implemented an `AudioChip` component that plays speaker voice samples using browser-side seeking. +- Added global speaker renaming and deletion capabilities. +- Optimized API calls to handle large transcript payloads by removing redundant preloads in the segments endpoint. + +## Format & Consistency +- Standardized speaker IDs in the database (supporting multiple prefix formats like `Speaker-` and `Spk-`). +- Implemented trailing slash consistency for Gin routing. + + +* * * diff --git a/api-docs/docs.go b/api-docs/docs.go index ff55538a..e00eac62 100644 --- a/api-docs/docs.go +++ b/api-docs/docs.go @@ -1602,6 +1602,163 @@ const docTemplate = `{ } } }, + "/api/v1/speakers": { + "get": { + "description": "Get a list of all identified speakers", + "produces": [ + "application/json" + ], + "tags": [ + "speakers" + ], + "summary": "List speakers", + "responses": { + "200": { + "description": "OK", + "schema": { + "type": "array", + "items": {} + } + }, + "500": { + "description": "Internal Server Error", + "schema": { + "$ref": "#/definitions/api.ErrorResponse" + } + } + } + } + }, + "/api/v1/speakers/{id}": { + "put": { + "description": "Rename an identified speaker and update past transcripts", + "consumes": [ + "application/json" + ], + "produces": [ + "application/json" + ], + "tags": [ + "speakers" + ], + "summary": "Rename speaker", + "parameters": [ + { + "type": "string", + "description": "Speaker ID", + "name": "id", + "in": "path", + "required": true + }, + { + "description": "New Name", + "name": "request", + "in": "body", + "required": true, + "schema": { + "$ref": "#/definitions/api.RenameSpeakerRequest" + } + } + ], + "responses": { + "200": { + "description": "OK", + "schema": { + "type": "object", + "additionalProperties": { + "type": "string" + } + } + }, + "400": { + "description": "Bad Request", + "schema": { + "$ref": "#/definitions/api.ErrorResponse" + } + }, + "500": { + "description": "Internal Server Error", + "schema": { + "$ref": "#/definitions/api.ErrorResponse" + } + } + } + }, + "delete": { + "description": "Delete a speaker identity", + "produces": [ + "application/json" + ], + "tags": [ + "speakers" + ], + "summary": "Delete speaker", + "parameters": [ + { + "type": "string", + "description": "Speaker ID", + "name": "id", + "in": "path", + "required": true + } + ], + "responses": { + "200": { + "description": "OK", + "schema": { + "type": "object", + "additionalProperties": { + "type": "string" + } + } + }, + "500": { + "description": "Internal Server Error", + "schema": { + "$ref": "#/definitions/api.ErrorResponse" + } + } + } + } + }, + "/api/v1/speakers/{id}/segments": { + "get": { + "description": "Get all audio segments and their associated transcription jobs for a speaker", + "produces": [ + "application/json" + ], + "tags": [ + "speakers" + ], + "summary": "Get speaker segments", + "parameters": [ + { + "type": "string", + "description": "Speaker ID", + "name": "id", + "in": "path", + "required": true + } + ], + "responses": { + "200": { + "description": "OK", + "schema": { + "type": "array", + "items": { + "$ref": "#/definitions/models.SpeakerSegment" + } + } + }, + "500": { + "description": "Internal Server Error", + "schema": { + "$ref": "#/definitions/api.ErrorResponse" + } + } + } + } + }, "/api/v1/summaries": { "get": { "security": [ @@ -4256,6 +4413,17 @@ const docTemplate = `{ } } }, + "api.RenameSpeakerRequest": { + "type": "object", + "required": [ + "name" + ], + "properties": { + "name": { + "type": "string" + } + } + }, "api.SetUserDefaultProfileRequest": { "type": "object", "required": [ @@ -4535,6 +4703,81 @@ const docTemplate = `{ } } }, + "models.SpeakerMapping": { + "type": "object", + "properties": { + "created_at": { + "type": "string" + }, + "custom_name": { + "description": "e.g., \"John Doe\"", + "type": "string" + }, + "id": { + "type": "integer" + }, + "original_speaker": { + "description": "e.g., \"speaker_00\"", + "type": "string" + }, + "transcription_job": { + "description": "Relationships", + "allOf": [ + { + "$ref": "#/definitions/models.TranscriptionJob" + } + ] + }, + "transcription_job_id": { + "type": "string" + }, + "updated_at": { + "type": "string" + } + } + }, + "models.SpeakerSegment": { + "type": "object", + "properties": { + "created_at": { + "type": "string" + }, + "embedding": { + "description": "JSON-serialized float32 array", + "type": "array", + "items": { + "type": "integer" + } + }, + "end": { + "type": "number" + }, + "id": { + "type": "integer" + }, + "speaker_id": { + "description": "The global speaker ID (UUID) or local name", + "type": "string" + }, + "start": { + "type": "number" + }, + "text": { + "type": "string" + }, + "transcription_job": { + "description": "Relationships", + "allOf": [ + { + "$ref": "#/definitions/models.TranscriptionJob" + } + ] + }, + "transcription_job_id": { + "type": "string" + } + } + }, "models.Summary": { "type": "object", "properties": { @@ -4654,6 +4897,12 @@ const docTemplate = `{ } ] }, + "speaker_mappings": { + "type": "array", + "items": { + "$ref": "#/definitions/models.SpeakerMapping" + } + }, "status": { "$ref": "#/definitions/models.JobStatus" }, diff --git a/api-docs/swagger.json b/api-docs/swagger.json index 6d7d5e80..e456e2fd 100644 --- a/api-docs/swagger.json +++ b/api-docs/swagger.json @@ -1596,6 +1596,163 @@ } } }, + "/api/v1/speakers": { + "get": { + "description": "Get a list of all identified speakers", + "produces": [ + "application/json" + ], + "tags": [ + "speakers" + ], + "summary": "List speakers", + "responses": { + "200": { + "description": "OK", + "schema": { + "type": "array", + "items": {} + } + }, + "500": { + "description": "Internal Server Error", + "schema": { + "$ref": "#/definitions/api.ErrorResponse" + } + } + } + } + }, + "/api/v1/speakers/{id}": { + "put": { + "description": "Rename an identified speaker and update past transcripts", + "consumes": [ + "application/json" + ], + "produces": [ + "application/json" + ], + "tags": [ + "speakers" + ], + "summary": "Rename speaker", + "parameters": [ + { + "type": "string", + "description": "Speaker ID", + "name": "id", + "in": "path", + "required": true + }, + { + "description": "New Name", + "name": "request", + "in": "body", + "required": true, + "schema": { + "$ref": "#/definitions/api.RenameSpeakerRequest" + } + } + ], + "responses": { + "200": { + "description": "OK", + "schema": { + "type": "object", + "additionalProperties": { + "type": "string" + } + } + }, + "400": { + "description": "Bad Request", + "schema": { + "$ref": "#/definitions/api.ErrorResponse" + } + }, + "500": { + "description": "Internal Server Error", + "schema": { + "$ref": "#/definitions/api.ErrorResponse" + } + } + } + }, + "delete": { + "description": "Delete a speaker identity", + "produces": [ + "application/json" + ], + "tags": [ + "speakers" + ], + "summary": "Delete speaker", + "parameters": [ + { + "type": "string", + "description": "Speaker ID", + "name": "id", + "in": "path", + "required": true + } + ], + "responses": { + "200": { + "description": "OK", + "schema": { + "type": "object", + "additionalProperties": { + "type": "string" + } + } + }, + "500": { + "description": "Internal Server Error", + "schema": { + "$ref": "#/definitions/api.ErrorResponse" + } + } + } + } + }, + "/api/v1/speakers/{id}/segments": { + "get": { + "description": "Get all audio segments and their associated transcription jobs for a speaker", + "produces": [ + "application/json" + ], + "tags": [ + "speakers" + ], + "summary": "Get speaker segments", + "parameters": [ + { + "type": "string", + "description": "Speaker ID", + "name": "id", + "in": "path", + "required": true + } + ], + "responses": { + "200": { + "description": "OK", + "schema": { + "type": "array", + "items": { + "$ref": "#/definitions/models.SpeakerSegment" + } + } + }, + "500": { + "description": "Internal Server Error", + "schema": { + "$ref": "#/definitions/api.ErrorResponse" + } + } + } + } + }, "/api/v1/summaries": { "get": { "security": [ @@ -4250,6 +4407,17 @@ } } }, + "api.RenameSpeakerRequest": { + "type": "object", + "required": [ + "name" + ], + "properties": { + "name": { + "type": "string" + } + } + }, "api.SetUserDefaultProfileRequest": { "type": "object", "required": [ @@ -4529,6 +4697,81 @@ } } }, + "models.SpeakerMapping": { + "type": "object", + "properties": { + "created_at": { + "type": "string" + }, + "custom_name": { + "description": "e.g., \"John Doe\"", + "type": "string" + }, + "id": { + "type": "integer" + }, + "original_speaker": { + "description": "e.g., \"speaker_00\"", + "type": "string" + }, + "transcription_job": { + "description": "Relationships", + "allOf": [ + { + "$ref": "#/definitions/models.TranscriptionJob" + } + ] + }, + "transcription_job_id": { + "type": "string" + }, + "updated_at": { + "type": "string" + } + } + }, + "models.SpeakerSegment": { + "type": "object", + "properties": { + "created_at": { + "type": "string" + }, + "embedding": { + "description": "JSON-serialized float32 array", + "type": "array", + "items": { + "type": "integer" + } + }, + "end": { + "type": "number" + }, + "id": { + "type": "integer" + }, + "speaker_id": { + "description": "The global speaker ID (UUID) or local name", + "type": "string" + }, + "start": { + "type": "number" + }, + "text": { + "type": "string" + }, + "transcription_job": { + "description": "Relationships", + "allOf": [ + { + "$ref": "#/definitions/models.TranscriptionJob" + } + ] + }, + "transcription_job_id": { + "type": "string" + } + } + }, "models.Summary": { "type": "object", "properties": { @@ -4648,6 +4891,12 @@ } ] }, + "speaker_mappings": { + "type": "array", + "items": { + "$ref": "#/definitions/models.SpeakerMapping" + } + }, "status": { "$ref": "#/definitions/models.JobStatus" }, diff --git a/api-docs/swagger.yaml b/api-docs/swagger.yaml index 24051b1f..d9ebfdc0 100644 --- a/api-docs/swagger.yaml +++ b/api-docs/swagger.yaml @@ -291,6 +291,13 @@ definitions: description: Match tests expecting snake_case key type: boolean type: object + api.RenameSpeakerRequest: + properties: + name: + type: string + required: + - name + type: object api.SetUserDefaultProfileRequest: properties: profile_id: @@ -479,6 +486,54 @@ definitions: updated_at: type: string type: object + models.SpeakerMapping: + properties: + created_at: + type: string + custom_name: + description: e.g., "John Doe" + type: string + id: + type: integer + original_speaker: + description: e.g., "speaker_00" + type: string + transcription_job: + allOf: + - $ref: '#/definitions/models.TranscriptionJob' + description: Relationships + transcription_job_id: + type: string + updated_at: + type: string + type: object + models.SpeakerSegment: + properties: + created_at: + type: string + embedding: + description: JSON-serialized float32 array + items: + type: integer + type: array + end: + type: number + id: + type: integer + speaker_id: + description: The global speaker ID (UUID) or local name + type: string + start: + type: number + text: + type: string + transcription_job: + allOf: + - $ref: '#/definitions/models.TranscriptionJob' + description: Relationships + transcription_job_id: + type: string + type: object models.Summary: properties: content: @@ -556,6 +611,10 @@ definitions: allOf: - $ref: '#/definitions/models.WhisperXParams' description: WhisperX parameters + speaker_mappings: + items: + $ref: '#/definitions/models.SpeakerMapping' + type: array status: $ref: '#/definitions/models.JobStatus' summary: @@ -1777,6 +1836,111 @@ paths: summary: Set default transcription profile tags: - profiles + /api/v1/speakers: + get: + description: Get a list of all identified speakers + produces: + - application/json + responses: + "200": + description: OK + schema: + items: {} + type: array + "500": + description: Internal Server Error + schema: + $ref: '#/definitions/api.ErrorResponse' + summary: List speakers + tags: + - speakers + /api/v1/speakers/{id}: + delete: + description: Delete a speaker identity + parameters: + - description: Speaker ID + in: path + name: id + required: true + type: string + produces: + - application/json + responses: + "200": + description: OK + schema: + additionalProperties: + type: string + type: object + "500": + description: Internal Server Error + schema: + $ref: '#/definitions/api.ErrorResponse' + summary: Delete speaker + tags: + - speakers + put: + consumes: + - application/json + description: Rename an identified speaker and update past transcripts + parameters: + - description: Speaker ID + in: path + name: id + required: true + type: string + - description: New Name + in: body + name: request + required: true + schema: + $ref: '#/definitions/api.RenameSpeakerRequest' + produces: + - application/json + responses: + "200": + description: OK + schema: + additionalProperties: + type: string + type: object + "400": + description: Bad Request + schema: + $ref: '#/definitions/api.ErrorResponse' + "500": + description: Internal Server Error + schema: + $ref: '#/definitions/api.ErrorResponse' + summary: Rename speaker + tags: + - speakers + /api/v1/speakers/{id}/segments: + get: + description: Get all audio segments and their associated transcription jobs + for a speaker + parameters: + - description: Speaker ID + in: path + name: id + required: true + type: string + produces: + - application/json + responses: + "200": + description: OK + schema: + items: + $ref: '#/definitions/models.SpeakerSegment' + type: array + "500": + description: Internal Server Error + schema: + $ref: '#/definitions/api.ErrorResponse' + summary: Get speaker segments + tags: + - speakers /api/v1/summaries: get: description: Get all summarization templates diff --git a/docker-compose.blackwell.yml b/docker-compose.blackwell.yml index 2a170c95..540c10d6 100644 --- a/docker-compose.blackwell.yml +++ b/docker-compose.blackwell.yml @@ -6,7 +6,7 @@ services: scriberr: image: ghcr.io/rishikanthc/scriberr-cuda-blackwell:latest ports: - - "8080:8080" + - "5318:5318" volumes: - scriberr_data:/app/data - env_data:/app/whisperx-env diff --git a/docker-compose.build.blackwell.yml b/docker-compose.build.blackwell.yml index d526c0dd..4ad34de7 100644 --- a/docker-compose.build.blackwell.yml +++ b/docker-compose.build.blackwell.yml @@ -9,7 +9,7 @@ services: image: scriberr:local-blackwell container_name: scriberr-blackwell ports: - - "8080:8080" + - "5318:5318" deploy: resources: reservations: diff --git a/docker-compose.build.cuda.yml b/docker-compose.build.cuda.yml index 04896242..300a382d 100644 --- a/docker-compose.build.cuda.yml +++ b/docker-compose.build.cuda.yml @@ -8,7 +8,7 @@ services: image: scriberr:local-cuda container_name: scriberr ports: - - "8080:8080" + - "5318:5318" deploy: resources: reservations: @@ -22,7 +22,7 @@ services: - NVIDIA_DRIVER_CAPABILITIES=compute,utility # environment: # - HOST=0.0.0.0 - # - PORT=8080 + # - PORT=5318 # - DATABASE_PATH=/app/data/scriberr.db # - UPLOAD_DIR=/app/data/uploads - PUID=${PUID:-1000} diff --git a/docker-compose.build.yml b/docker-compose.build.yml index f86d9d85..247642c6 100644 --- a/docker-compose.build.yml +++ b/docker-compose.build.yml @@ -11,10 +11,10 @@ services: # image: ghcr.io/rishikanthc/scriberr:v1.0.0 container_name: scriberr ports: - - "8080:8080" + - "5318:5318" # environment: # - HOST=0.0.0.0 - # - PORT=8080 + # - PORT=5318 # - DATABASE_PATH=/app/data/scriberr.db # - UPLOAD_DIR=/app/data/uploads # - PUID=${PUID:-1000} diff --git a/docker-compose.cuda.yml b/docker-compose.cuda.yml index 19e3ea79..5658b6e2 100644 --- a/docker-compose.cuda.yml +++ b/docker-compose.cuda.yml @@ -3,11 +3,13 @@ services: scriberr: image: ghcr.io/rishikanthc/scriberr:v1.0.4-cuda ports: - - "8080:8080" + - "5318:5318" volumes: - scriberr_data:/app/data - env_data:/app/whisperx-env restart: unless-stopped + depends_on: + - qdrant deploy: resources: reservations: @@ -19,6 +21,15 @@ services: environment: - NVIDIA_VISIBLE_DEVICES=all - NVIDIA_DRIVER_CAPABILITIES=compute,utility + - QDRANT_HOST=qdrant + + qdrant: + image: qdrant/qdrant:latest + ports: + - "6333:6333" + volumes: + - qdrant_storage:/qdrant/storage + restart: unless-stopped - PUID=${PUID:-1000} - PGID=${PGID:-1000} # Security: already set in container, but can be overridden @@ -28,4 +39,5 @@ services: volumes: scriberr_data: {} + qdrant_storage: {} env_data: {} diff --git a/docker-compose.yml b/docker-compose.yml index ebc5b5f1..ecd5eff7 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -2,7 +2,7 @@ services: scriberr: image: ghcr.io/rishikanthc/scriberr:latest ports: - - "8080:8080" + - "5318:5318" volumes: - scriberr_data:/app/data - env_data:/app/whisperx-env @@ -14,7 +14,18 @@ services: # CORS: comma-separated list of allowed origins for production # - ALLOWED_ORIGINS=https://your-domain.com restart: unless-stopped + depends_on: + - qdrant + + qdrant: + image: qdrant/qdrant:latest + ports: + - "6333:6333" + volumes: + - qdrant_storage:/qdrant/storage + restart: unless-stopped volumes: scriberr_data: {} + qdrant_storage: env_data: {} diff --git a/optimization_report.md b/optimization_report.md new file mode 100644 index 00000000..3a23e02c --- /dev/null +++ b/optimization_report.md @@ -0,0 +1,65 @@ +# Canary Transcription Optimization Report + +## Objective +Investigate and implement optimizations to speed up NVIDIA Canary transcription on Apple Silicon (Mac M1/M2/M3). + +## Findings + +### 1. PyTorch Installation Issue +The existing `pyproject.toml` contained a configuration that explicitly forced the installation of `pytorch-cpu` on macOS (`sys_platform == 'darwin'`). + +```toml +torch = [ + { index = "pytorch-cpu", marker = "sys_platform == 'darwin'" }, + ... +] +``` + +This prevents the installation of the standard PyTorch wheel from PyPI, which includes support for the Metal Performance Shaders (MPS) backend required for GPU acceleration on Apple Silicon. + +**Action Taken:** Modified `internal/transcription/adapters/py/nvidia/pyproject.toml` to remove this constraint, allowing `uv` to install the correct GPU-accelerated version of PyTorch for macOS. + +### 2. Lack of Explicit MPS Device Support +The `canary_transcribe.py` script relied on default device placement, which typically defaults to CPU unless CUDA is available. It did not check for or utilize the `mps` device available on macOS. + +**Action Taken:** Updated `internal/transcription/adapters/py/nvidia/canary_transcribe.py` to: +* Detect if `torch.backends.mps.is_available()` is true. +* Select `mps` device if available (and CUDA is not). +* Pass `map_location=device` when loading the model to ensure tensors are allocated on the correct device. +* Explicitly move the model to the device using `asr_model.to(device)`. +* Set `PYTORCH_ENABLE_MPS_FALLBACK=1` when MPS is used, as recommended by NVIDIA NeMo documentation to handle operations not yet implemented on MPS. + +### 3. Verification +A test script `test_canary_mock.py` was created to simulate the environment. It verified that: +* When `torch.backends.mps.is_available()` is mocked to return `True`, the script selects "MPS" device. +* When it returns `False`, it falls back to "CPU". + +### 4. Profiling +Integrated `torch.profiler` profiling to enable detailed performance analysis. +* **Usage:** Run the script with `--profile` to generate a Chrome trace file (default: `trace.json`). +* **Visualization:** Open `chrome://tracing` (or `edge://tracing`) in a Chrome/Edge browser and load the JSON file to inspect CPU/GPU timeline execution, operation duration, and bottlenecks. + +## Further Recommendations + +### Half-Precision (FP16/BF16) +While MPS supports Float16, it can sometimes be numerically unstable depending on the model layers. The current change uses default precision (Float32). If further speedup is needed, one could try converting the model to half precision: + +```python +if device_type == "mps": + asr_model = asr_model.half() +``` + +However, this requires verification that the Canary model (which is an Encoder-Decoder model) produces accurate results in FP16 on MPS. + +### Batch Processing +The current script processes a single audio file at a time. If the use case involves processing many small files, batching them into a single `transcribe` call (providing a list of paths) would significantly improve throughput. + +### torch.compile +Usage of `torch.compile` (introduced in PyTorch 2.0) on Apple Silicon (MPS backend) is currently experimental and not recommended for this specific use case. + +**Detailed Technical Context:** +* **Backend Limitations:** The default `inductor` backend, which provides the most significant speedups on NVIDIA GPUs (via Triton code generation), does not yet natively support MPS. To use compilation on Mac, one must use the `aot_eager` backend (`torch.compile(model, backend="aot_eager")`), which offers minimal to no performance gain compared to standard eager mode. +* **NeMo Compatibility:** Complex Encoder-Decoder models like Canary often utilize dynamic control flow and shapes (e.g., varying audio lengths, beam search decoding). These are historically challenging for graph capture mechanisms (TorchDynamo) to optimize effectively without significant "graph breaks," which can negate performance gains or cause crashes. +* **Stability:** Current community reports and issues indicate frequent compilation failures (`BackendCompilerFailed`) on MPS when attempting to compile complex neural network graphs. + +**Recommendation:** Stick to standard Eager Mode with MPS acceleration (as implemented). Re-evaluate `torch.compile` support when PyTorch releases stable MPS support for the `inductor` backend or a specialized Metal backend. diff --git a/slow_env_startup.md b/slow_env_startup.md new file mode 100644 index 00000000..caab93ce --- /dev/null +++ b/slow_env_startup.md @@ -0,0 +1,54 @@ +# Analysis: Slow Environment Startup in Scriberr + +## Problem Statement +The application startup time is significantly delayed (by approximately 8.5 seconds) due to environment readiness checks performed by model adapters. These checks verify that the Python virtual environments are correctly set up and that necessary libraries can be imported before processing starts. + +## Measurements +Measured on current hardware using `uv run` for various environment checks: + +| Check / Adapter | Command | Duration | +|---|-|---| +| **NVIDIA (NeMo)** | `python -c "import nemo.collections.asr"` | **~8.3s** | +| **WhisperX** | `python -c "import whisperx"` | ~0.9s | +| **PyAnnote** | `from pyannote.audio import Pipeline` | ~2.5s | +| **Baseline** | `uv run ... python --version` | ~0.6s | + +## Root Cause +The `CheckEnvironmentReady` function in `internal/transcription/adapters/base_adapter.go` executes a full Python interpreter launch and imports heavy libraries (like `torch` and `nemo`) to verify the environment. + +While these checks are parallelized in `ModelRegistry.InitializeModels`, the total startup delay is governed by the slowest component (NVIDIA adapters), leading to a mandatory ~8s wait every time the server starts. + +## Environment Variable Tuning +We tested several environment variables to see if they could suppress slow initialization logic (like CUDA discovery or network checks): + +| Environment Variables | Effect on NeMo Import | Result | +|---|---|---| +| `TRANSFORMERS_OFFLINE=1` `HF_HUB_OFFLINE=1` | Skip network checks for models | No significant change (~8.2s) | +| `CUDA_VISIBLE_DEVICES=""` | Skip CUDA/GPU discovery and initialization | **-1.1s** (~7.1s) | +| `NEMO_DISABLE_IMPORT_CHECKS=1` | Skip internal NeMo dependency verification | **-1.5s** (~6.7s) | + +**Conclusion**: While disabling CUDA and import checks helps, the core overhead remains high due to the sheer size of the `nemo.collections.asr` and `torch` submodules. + +## Proposed Optimizations + +### 1. Sentinel File (Recommended) +Instead of running a Python command, the system can create a sentinel file (e.g., `.scriberr_ready`) inside the environment directory after a successful `uv sync` and initial model download. +- **Benefit**: Reduces check time to nearly 0ms. +- **Implementation**: Adapters check for the file's existence. If missing, they run the full `PrepareEnvironment` logic and create the file upon success. + +### 2. Lighter Check (Implemented) +Change the `importStatement` check to a simple top-level package import (e.g., `import nemo` instead of `import nemo.collections.asr`). +- **Benefit**: Reduces check time from 8.3s to **~0.6s**. +- **Reasoning**: Top-level packages in these libraries often have very light `__init__.py` files that don't trigger the loading of heavy submodules like Torch or CUDA. + +### 3. Asynchronous Initialization (Implemented) +Ensured `InitializeModels` runs in the background and does not block the main API from becoming ready. +- **Effect**: Server startup is now instantaneous (from the Go perspective), while models continue to prepare their environments in the background. +- **Verification**: `registry.InitializeModels` now returns immediately after launching background goroutines. + +## Progress Made +- **Optimized Readiness Checks**: Updated NVIDIA, Sortformer, and PyAnnote adapters to use top-level package imports, reducing initialization time by ~14x. +- **Background Initialization**: Refactored `ModelRegistry` to perform environment preparation asynchronously. +- **Added Timing Logs**: `base_adapter.go` now logs the duration of every environment check. +- **Added Regression Test**: `TestParakeetPrepareEnvironment` in `internal/transcription/adapters_test.go` can be used to monitor startup performance. + diff --git a/web/project-site/public/api/swagger.json b/web/project-site/public/api/swagger.json index 6d7d5e80..1c3ff219 100644 --- a/web/project-site/public/api/swagger.json +++ b/web/project-site/public/api/swagger.json @@ -15,7 +15,7 @@ }, "version": "1.0" }, - "host": "localhost:8080", + "host": "localhost:5318", "basePath": "/api/v1", "paths": { "/api/v1/admin/queue/stats": { @@ -4982,4 +4982,4 @@ "in": "header" } } -} \ No newline at end of file +}