Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
# System files
.DS_Store
*.png

# IDE / editor settings
.claude/
.vs/

# Python environment
.venv/
Expand Down Expand Up @@ -37,3 +42,31 @@ local.properties
node_modules/
__pycache__/
.cursor/

# .NET / C# / Visual Studio
[Bb]in/
[Oo]bj/
[Dd]ebug/
[Rr]elease/
*.dll
*.exe
*.pdb
*.cache
*.baml
*.resources
*.user
*.suo
*.nupkg
*.snupkg
*.g.cs
*.g.resources
*.AssemblyInfo.cs
*.editorconfig
project.assets.json
project.nuget.cache
*.nuget.dgspec.json
*.nuget.g.props
*.nuget.g.targets
publish/
installer-output/
.vs/
60 changes: 60 additions & 0 deletions PR_DESCRIPTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# Add Voxtral Realtime Windows WPF Application

## Summary

Adds a Windows WPF (.NET 8) desktop application for Voxtral Realtime — a real-time speech-to-text transcription tool powered by ExecuTorch. This is the Windows counterpart to the existing macOS SwiftUI application.

## Key Features

- **Real-time transcription** — Live speech-to-text using the Voxtral model running locally via ExecuTorch
- **Dictation mode** — Speak and auto-paste transcribed text into any target application via clipboard
- **Silence detection** — Peak-based audio level monitoring with configurable silence threshold and timeout for auto-stop
- **Text processing pipeline** — Post-processing with configurable text replacements and snippet expansion
- **Session history** — Persistent transcript storage with search and browsing
- **Global hotkey** — System-wide keyboard shortcut to start/stop transcription
- **MVVM architecture** — Clean separation using `CommunityToolkit.Mvvm` with `ObservableObject` view models

## Architecture

```
VoxtralRealtime/
├── Models/ # Data models (Session, Snippet, ReplacementEntry, Enums)
├── Converters/ # WPF value converters
├── Services/ # Core services
│ ├── RunnerBridge.cs # ExecuTorch model bridge
│ ├── AudioCaptureService.cs # NAudio-based mic capture
│ ├── ClipboardPasteService.cs # Win32 clipboard + paste
│ ├── GlobalHotkeyService.cs # System-wide hotkey registration
│ ├── TextPipeline.cs # Post-processing pipeline
│ ├── PersistenceService.cs # JSON file storage
│ └── AppLogger.cs # File-based logging
├── ViewModels/ # MVVM view models
│ ├── TranscriptStoreViewModel.cs # Central state management
│ ├── DictationViewModel.cs # Dictation flow + silence monitor
│ ├── SettingsViewModel.cs # App configuration
│ ├── ReplacementStoreViewModel.cs # Text replacements
│ └── SnippetStoreViewModel.cs # Text snippets
├── Views/ # WPF XAML views
│ ├── MainWindow # Shell with sidebar + detail layout
│ ├── WelcomeView # Home/landing page
│ ├── TranscriptView # Live transcript display
│ ├── SidebarView # Navigation sidebar
│ ├── RecordingControlsBar # Toolbar (transcribe/pause/done)
│ ├── DictationWindow # Floating dictation overlay
│ ├── AudioLevelControl # Real-time audio level meter
│ ├── SettingsView # Configuration UI
│ └── *ManagementViews # Replacement & snippet editors
└── Resources/ # Styles and assets
```

## Also Included

- `.gitignore` updated with .NET/C# build artifact rules (`bin/`, `obj/`, `publish/`, etc.)
- `build.bat` for release builds
- `upload_models.py` helper for model distribution
- `README.md` with setup and build instructions

## Test Plan

- Built and tested manually on Windows 10/11 with .NET 8
- Verified real-time transcription, dictation auto-paste, silence detection, and session persistence
73 changes: 73 additions & 0 deletions WINDOWS_APP_SKILLS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
Windows App Developer
Purpose
Provides expertise in building modern Windows desktop applications using WinUI 3, WPF, and Windows App SDK. Specializes in XAML-based UI development, MVVM architecture, native Windows integration, and modern packaging with MSIX.

When to Use
Building Windows desktop applications with WinUI 3 or WPF
Implementing MVVM architecture for Windows apps
Creating XAML layouts and custom controls
Packaging applications with MSIX
Integrating with Windows features (notifications, taskbar, system tray)
Migrating WPF applications to WinUI 3
Implementing Windows-specific features (jump lists, live tiles)
Building Microsoft Store-ready applications
Quick Start
Invoke this skill when:

Building Windows desktop applications with WinUI 3 or WPF
Implementing MVVM architecture for Windows apps
Creating XAML layouts and custom controls
Packaging applications with MSIX
Integrating with Windows features (notifications, taskbar)
Do NOT invoke when:

Building cross-platform apps → use mobile-developer or electron-pro
Console applications → use appropriate language skill
PowerShell GUI → use powershell-ui-architect
Web applications → use appropriate web skill
Decision Framework
Windows App Task?
├── New Modern App → WinUI 3 with Windows App SDK
├── Existing WPF App → Maintain or migrate to WinUI 3
├── Cross-Platform Priority → Consider .NET MAUI
├── Enterprise Internal → WPF with proven patterns
├── Store Distribution → MSIX packaging required
└── System Integration → P/Invoke or Windows SDK APIs
Core Workflows
1. WinUI 3 Application Setup
Create project using Windows App SDK template
Configure Package.appxmanifest for capabilities
Set up MVVM infrastructure (CommunityToolkit.Mvvm)
Implement navigation and shell structure
Create reusable control library
Configure MSIX packaging
Set up CI/CD for Store or sideload distribution
2. MVVM Implementation
Define ViewModels with observable properties
Implement commands for user actions
Create services for data and business logic
Set up dependency injection container
Bind Views to ViewModels in XAML
Implement navigation service
Add design-time data for XAML preview
3. MSIX Packaging
Configure Package.appxmanifest
Define application identity and capabilities
Set up visual assets (icons, splash)
Configure installation behavior
Sign package with certificate
Test installation and updates
Submit to Microsoft Store or deploy internally
Best Practices
Use WinUI 3 for new development, WPF for legacy maintenance
Implement MVVM strictly for testability and separation
Use x:Bind for compile-time binding validation
Leverage Community Toolkit for common patterns
Package with MSIX for modern installation experience
Follow Fluent Design System for consistent UX
Anti-Patterns
Code-behind logic → Move to ViewModels
Synchronous UI operations → Use async/await for I/O
Direct service calls from Views → Go through ViewModels
Ignoring DPI awareness → Test at multiple scale factors
Missing capabilities → Declare required capabilities in manifest
124 changes: 124 additions & 0 deletions voxtral_realtime/windows/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Voxtral Realtime - Windows CUDA

Real-time speech transcription desktop app powered by ExecuTorch with CUDA acceleration.

This is the Windows equivalent of the [macOS Voxtral Realtime app](../macos/).

## Quick Start

1. Download `VoxtralRealtime-Setup.exe` from the [Releases](https://github.com/meta-pytorch/executorch-examples/releases) page and run the installer
2. Launch from the Start Menu or desktop shortcut
3. Click **"Load Model"** — the app will automatically download the required model files (~5.2 GB) from HuggingFace on first launch
4. Once loaded, click **"Start Transcription"**

### Requirements

- Windows 10/11 with NVIDIA GPU (CUDA-capable)
- CUDA Toolkit installed (auto-detected from `C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\`)

## Features

- **Live Transcription** - Start/Pause/Resume with streaming text output
- **Session Management** - Save, rename, pin, delete, and export sessions (TXT/JSON/SRT)
- **Text Replacements** - Auto-correct transcription (e.g., "executorch" -> "ExecuTorch")
- **Text Snippets** - Voice-triggered templates for common text blocks
- **Dictation Mode** - Ctrl+Space global hotkey, floating overlay, auto-paste to any app, auto-stop on 2s silence
- **Audio Level Visualization** - Real-time waveform display
- **Auto Model Download** - Model weights are downloaded from HuggingFace on first launch with progress tracking

## Keyboard Shortcuts

| Shortcut | Action |
|----------|--------|
| Ctrl+Shift+R | Start / Resume transcription |
| Ctrl+. | Pause transcription |
| Ctrl+Enter | End session |
| Ctrl+Space | Toggle dictation mode |

## Build from Source

For developers who want to build the app themselves.

### Prerequisites

- [.NET 8.0 SDK](https://dotnet.microsoft.com/download/dotnet/8.0)
- Pre-built `voxtral_realtime_runner.exe` (see [Building the Runner](#building-the-runner))

### Building the Runner

Build the `voxtral_realtime_runner.exe` from the [ExecuTorch](https://github.com/pytorch/executorch) repo:

```bash
cd executorch
cmake --preset voxtral-realtime-cuda
cmake --build --preset voxtral-realtime-cuda
```

The runner will be at `cmake-out/examples/models/voxtral_realtime/Release/voxtral_realtime_runner.exe`.

### Build and Run

```powershell
cd VoxtralRealtime
dotnet restore
dotnet build --configuration Release
dotnet run --project VoxtralRealtime --configuration Release
```

Model files will be auto-downloaded on first launch to the `models/` directory next to the executable. To download them manually instead:

```powershell
pip install huggingface_hub
huggingface-cli download younghan-meta/Voxtral-Mini-4B-Realtime-2602-ExecuTorch-CUDA-Windows --local-dir models
huggingface-cli download mistralai/Voxtral-Mini-4B-Realtime-2602 tekken.json --local-dir models
```

### Publish Standalone Executable

Create a single self-contained exe (no .NET runtime required on target machine):

```powershell
cd VoxtralRealtime
dotnet publish VoxtralRealtime --configuration Release --runtime win-x64 --self-contained true /p:PublishSingleFile=true /p:IncludeNativeLibrariesForSelfExtract=true /p:DebugType=none -o publish
```

### Create Installer

Builds an installer that bundles the app and runner. Model weights are downloaded automatically on first launch.

```powershell
# 1. Install Inno Setup (one-time)
winget install JRSoftware.InnoSetup

# 2. Publish the app
cd VoxtralRealtime
dotnet publish VoxtralRealtime --configuration Release --runtime win-x64 --self-contained true /p:PublishSingleFile=true /p:IncludeNativeLibrariesForSelfExtract=true /p:DebugType=none -o publish

# 3. Build the installer (set EXECUTORCH_ROOT to your ExecuTorch repo path)
cd ..
$env:EXECUTORCH_ROOT = "C:\path\to\executorch"
& "$env:LOCALAPPDATA\Programs\Inno Setup 6\ISCC.exe" installer.iss
```

The output `installer-output\VoxtralRealtime-Setup.exe` includes:
- App executable (self-contained, no .NET runtime needed)
- `voxtral_realtime_runner.exe` + `aoti_cuda_shims.dll`
- Start Menu and optional desktop shortcuts
- Clean uninstall via Windows Settings

Model weights (`model.pte`, `preprocessor.pte`, `aoti_cuda_blob.ptd`, `tekken.json`) are **not bundled** — they are downloaded from HuggingFace on first launch, keeping the installer small (~49 MB).

## Architecture

The app wraps the `voxtral_realtime_runner.exe` C++ binary via stdin/stdout pipes:

```
Microphone (WASAPI 48kHz) -> Resample to 16kHz mono f32le -> stdin pipe -> runner.exe -> stdout tokens -> WPF UI
```

Platform-specific adaptations from the macOS app:
- **Audio**: WASAPI (shared mode) replaces AVAudioEngine, with software resampling from native rate to 16kHz
- **Hotkey**: Win32 `RegisterHotKey` replaces Carbon `RegisterEventHotKey`
- **UI**: WPF (.NET 8) with MVVM replaces SwiftUI
- **Backend**: `--data_path` for CUDA blob (macOS uses Metal backend)
- **Stdin**: Binary mode (`_setmode`) required on Windows to prevent 0x1A byte (Ctrl+Z) from being interpreted as EOF
19 changes: 19 additions & 0 deletions voxtral_realtime/windows/VoxtralRealtime/VoxtralRealtime.sln
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@

Microsoft Visual Studio Solution File, Format Version 12.00
# Visual Studio Version 17
VisualStudioVersion = 17.0.31903.59
MinimumVisualStudioVersion = 10.0.40219.1
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "VoxtralRealtime", "VoxtralRealtime\VoxtralRealtime.csproj", "{A1B2C3D4-E5F6-7890-ABCD-EF1234567890}"
EndProject
Global
GlobalSection(SolutionConfigurationPlatforms) = preSolution
Debug|Any CPU = Debug|Any CPU
Release|Any CPU = Release|Any CPU
EndGlobalSection
GlobalSection(ProjectConfigurationPlatforms) = postSolution
{A1B2C3D4-E5F6-7890-ABCD-EF1234567890}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{A1B2C3D4-E5F6-7890-ABCD-EF1234567890}.Debug|Any CPU.Build.0 = Debug|Any CPU
{A1B2C3D4-E5F6-7890-ABCD-EF1234567890}.Release|Any CPU.ActiveCfg = Release|Any CPU
{A1B2C3D4-E5F6-7890-ABCD-EF1234567890}.Release|Any CPU.Build.0 = Release|Any CPU
EndGlobalSection
EndGlobal
17 changes: 17 additions & 0 deletions voxtral_realtime/windows/VoxtralRealtime/VoxtralRealtime/App.xaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
<!-- Copyright (c) Meta Platforms, Inc. and affiliates. -->
<!-- All rights reserved. -->
<!-- This source code is licensed under the BSD-style license found in the -->
<!-- LICENSE file in the root directory of this source tree. -->

<Application x:Class="VoxtralRealtime.App"
xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
StartupUri="Views/MainWindow.xaml">
<Application.Resources>
<ResourceDictionary>
<ResourceDictionary.MergedDictionaries>
<ResourceDictionary Source="Resources/Styles.xaml"/>
</ResourceDictionary.MergedDictionaries>
</ResourceDictionary>
</Application.Resources>
</Application>
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
// Copyright (c) Meta Platforms, Inc. and affiliates.
// All rights reserved.
// This source code is licensed under the BSD-style license found in the
// LICENSE file in the root directory of this source tree.

using System.Runtime.InteropServices;
using System.Windows;
using VoxtralRealtime.Services;
using VoxtralRealtime.ViewModels;

namespace VoxtralRealtime;

public partial class App : Application
{
[DllImport("kernel32.dll")]
private static extern bool AttachConsole(int dwProcessId);
private const int ATTACH_PARENT_PROCESS = -1;
public static SettingsViewModel Settings { get; private set; } = null!;
public static ReplacementStoreViewModel ReplacementStore { get; private set; } = null!;
public static SnippetStoreViewModel SnippetStore { get; private set; } = null!;
public static TextPipeline TextPipeline { get; private set; } = null!;
public static TranscriptStoreViewModel Store { get; private set; } = null!;
public static DictationViewModel Dictation { get; private set; } = null!;
public static GlobalHotkeyService HotkeyService { get; private set; } = null!;

protected override void OnStartup(StartupEventArgs e)
{
base.OnStartup(e);

// Attach to parent console so Console.WriteLine shows in the terminal
AttachConsole(ATTACH_PARENT_PROCESS);
AppLogger.Log("App", "Voxtral Realtime starting up");

Settings = new SettingsViewModel();
ReplacementStore = new ReplacementStoreViewModel();
SnippetStore = new SnippetStoreViewModel();
TextPipeline = new TextPipeline(ReplacementStore, SnippetStore);
Store = new TranscriptStoreViewModel(Settings, TextPipeline);
HotkeyService = new GlobalHotkeyService();
Dictation = new DictationViewModel(Store, Settings, HotkeyService);
}

protected override void OnExit(ExitEventArgs e)
{
Dictation.Cleanup();
Store.Shutdown();
base.OnExit(e);
}
}
Loading
Loading