Skip to content

Commit ffa46ef

Browse files
Cmochanceclaude
andauthored
fix(switch): unstick "switch already in progress" toast (stale lock + double-click) (#24)
User-reported: clicking Switch on a card sometimes immediately surfaces "A profile switch is already in progress" even when no switch is visibly running. Two contributing causes, both fixed: 1. The shared `.switch.lock` had no stale-cleanup. If a previous switch / login was force-quit, hung in OAuth, or otherwise terminated without dropping its guard, the lock file persisted and bricked every subsequent switch / login until the user manually deleted it. This was the v1.6.x stale-cleanup feature that the 1.5.x rollback never brought back. 2. `handleSwitchProfile` in the front-end didn't guard on `state.loading`. The disabled attribute on the Switch button doesn't take effect until the next browser paint, so a fast second click fired a second IPC before the first switch's `runBlockingAction` flipped state. The other card actions (Refresh, Login, Delete) already had this check; Switch missed it. Refactor (per the share-don't-duplicate rule): switch_core's `acquire_switch_lock` and login_runtime's `acquire_login_lock` were near-mirrored — same path, same `O_EXCL` pattern, just different error codes. Move the body to a new `shared/runtime/process_lock.rs` that takes the error code / message as parameters; both call sites shrink to a one-liner. The new module also owns the `STALE_LOCK_AGE = 5 min` threshold + the reclaim path. Threshold tuning: 60s (the v1.6 default) was too short for the new per-card login flow which legitimately holds the lock for the duration of an OAuth round-trip. 5 min is long enough that a slow manual login won't get its lock stolen, short enough that a wedged holder unblocks the user without an app restart. If a real login is still in-flight at minute 5+ and the user presses Switch, the reclaim races against an eventual auth.json write — but at that point the user has clearly chosen "I want to switch *now*", and a second click that succeeds is more useful than a stuck lock. Tests: 51 → 55 (4 new in process_lock — first-acquire / contention / stale-predicate / integration stale-reclaim using `touch -t` to back-date mtime). Existing switch_core and login_runtime tests still pass on the shared lock path. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>
1 parent 0a8e086 commit ffa46ef

5 files changed

Lines changed: 251 additions & 80 deletions

File tree

src-tauri/shared/front/actions.ts

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -232,6 +232,17 @@ function refreshProfileErrorMessage(error: unknown): string {
232232
}
233233

234234
async function handleSwitchProfile(profile: string): Promise<void> {
235+
// Guard against double-click / racing rerender: the disabled
236+
// attribute on the button doesn't take effect until the next
237+
// browser paint, so a fast second click can fire before the first
238+
// switch's `runBlockingAction` flips `state.loading`. Without this
239+
// guard the second IPC hits the backend lock and the user sees the
240+
// SWITCH_IN_PROGRESS toast instead of a no-op. The other card
241+
// actions (Refresh, Login, Delete) already had this check; Switch
242+
// missed it.
243+
if (state.loading || state.loginActiveProfile !== null || isRefreshPending(profile)) {
244+
return;
245+
}
235246
try {
236247
await runBlockingAction(async () => {
237248
await switchProfile(profile);

src-tauri/shared/runtime/login_runtime.rs

Lines changed: 11 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -16,58 +16,27 @@
1616
//! hook that simulates the OAuth handshake by writing an arbitrary
1717
//! `auth.json` into the sandboxed CODEX_HOME.
1818
19-
use std::path::{Path, PathBuf};
19+
use std::path::Path;
2020

2121
use crate::errors::{AppError, AppResult};
2222
use crate::platform::hooks::PlatformHooks;
2323

2424
use super::metadata::sync_profile_metadata_from_auth;
25-
use super::paths::{get_backup_root, get_codex_home, get_switch_lock_path, validate_profile_name};
25+
use super::paths::{get_backup_root, get_codex_home, validate_profile_name};
26+
use super::process_lock::{acquire_process_lock, ProcessLockGuard};
2627
use super::profiles::resolve_current_profile;
2728
use super::profiles_index::load_profiles_index;
2829
use super::runtime_isolation::{
2930
clear_runtime_auth_state, prune_runtime_extra_features, seed_runtime_shared_assets,
3031
RUNTIME_AUTH_FILENAME,
3132
};
3233

33-
/// File-based mutex to serialize login / switch on the same backup root.
34-
/// Reuses the existing `.switch.lock` because login + switch both mutate
35-
/// `auth.json`s and must not interleave. Errors translate to `LOGIN_BUSY`
36-
/// from the login entry point so the UI can show a distinct message.
37-
struct LoginGuard {
38-
lock_path: PathBuf,
39-
}
40-
41-
impl Drop for LoginGuard {
42-
fn drop(&mut self) {
43-
let _ = std::fs::remove_file(&self.lock_path);
44-
}
45-
}
46-
47-
fn acquire_login_lock(codex_home: &Path) -> AppResult<LoginGuard> {
48-
let lock_path = get_switch_lock_path(Some(codex_home));
49-
if let Some(parent) = lock_path.parent() {
50-
std::fs::create_dir_all(parent).map_err(|error| {
51-
AppError::new(
52-
"FS_CREATE_FAILED",
53-
format!(
54-
"Failed to create lock directory {}: {error}",
55-
parent.display()
56-
),
57-
)
58-
})?;
59-
}
60-
std::fs::OpenOptions::new()
61-
.write(true)
62-
.create_new(true)
63-
.open(&lock_path)
64-
.map_err(|_| {
65-
AppError::new(
66-
"LOGIN_BUSY",
67-
"A profile login or switch is already in progress.",
68-
)
69-
})?;
70-
Ok(LoginGuard { lock_path })
34+
fn acquire_login_lock(codex_home: &Path) -> AppResult<ProcessLockGuard> {
35+
acquire_process_lock(
36+
Some(codex_home),
37+
"LOGIN_BUSY",
38+
"A profile login or switch is already in progress.",
39+
)
7140
}
7241

7342
/// Build the sandboxed CODEX_HOME for codex login. Idempotent; safe to
@@ -419,8 +388,8 @@ mod tests {
419388
#[test]
420389
fn login_profile_rejects_concurrent_attempts_via_switch_lock() {
421390
let (codex_home, _profile_dir, runtime_home) = setup("concurrent");
422-
// Pre-acquire the switch lock to simulate an in-flight switch.
423-
let lock_path = get_switch_lock_path(Some(&codex_home));
391+
// Pre-acquire the shared lock to simulate an in-flight switch.
392+
let lock_path = super::super::paths::get_switch_lock_path(Some(&codex_home));
424393
std::fs::create_dir_all(lock_path.parent().unwrap()).unwrap();
425394
let _staged = std::fs::OpenOptions::new()
426395
.write(true)

src-tauri/shared/runtime/mod.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ pub mod login_runtime;
55
pub mod metadata;
66
pub mod paths;
77
pub mod profiles;
8+
pub mod process_lock;
89
pub mod profiles_index;
910
pub mod quota_routing;
1011
pub mod runtime_isolation;
Lines changed: 220 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,220 @@
1+
//! Shared file-based lock used to serialize the three operations that
2+
//! mutate per-profile auth state: switch, refresh's auth-rotation
3+
//! tail, and the per-card login flow. They cannot interleave safely —
4+
//! a switch mid-login (or vice versa) would race two writers against
5+
//! `account_backup/<profile>/auth.json` and possibly `~/.codex/auth.json`.
6+
//!
7+
//! The lock is a single file (`get_switch_lock_path` — kept named for
8+
//! historical compatibility) that all three holders create with
9+
//! `O_EXCL`. The first holder wins; everyone else gets a
10+
//! caller-supplied error code so the UI can render context-appropriate
11+
//! copy ("switch in progress" vs. "login in progress") even though
12+
//! the underlying contention is the same.
13+
//!
14+
//! ## Stale cleanup
15+
//!
16+
//! Without cleanup, any holder that crashes (force-quit, OS logout,
17+
//! browser-cancelled OAuth that left the parent process spinning)
18+
//! permanently bricks every future operation until the user manually
19+
//! deletes the file. The 1.6.x line had this fix; rolling back to
20+
//! 1.5.x dropped it. We restore it here with a single threshold —
21+
//! locks older than `STALE_LOCK_AGE` are reclaimed by the next caller.
22+
//!
23+
//! Threshold = 5 minutes. Long enough for a slow OAuth login (browser
24+
//! opened, user reads docs, eventually clicks Authorize). Short enough
25+
//! that a wedged login won't strand the user across an app restart.
26+
//! If a real login is still legitimately in flight at minute 5+, the
27+
//! reclaim races against its eventual auth.json write — but at that
28+
//! point the user has clearly chosen "I want to switch *now*", and a
29+
//! second click that succeeds is more useful than a 12-hour-stuck
30+
//! lock. (Future work: store the holder identity in the lock body so
31+
//! we can distinguish "stuck" from "still working.")
32+
33+
use std::fs::OpenOptions;
34+
use std::path::{Path, PathBuf};
35+
use std::time::{Duration, SystemTime};
36+
37+
use crate::errors::{AppError, AppResult};
38+
39+
use super::paths::get_switch_lock_path;
40+
41+
const STALE_LOCK_AGE: Duration = Duration::from_secs(5 * 60);
42+
43+
#[derive(Debug)]
44+
pub struct ProcessLockGuard {
45+
lock_path: PathBuf,
46+
}
47+
48+
impl Drop for ProcessLockGuard {
49+
fn drop(&mut self) {
50+
let _ = std::fs::remove_file(&self.lock_path);
51+
}
52+
}
53+
54+
fn try_create_lock(lock_path: &Path) -> std::io::Result<()> {
55+
OpenOptions::new()
56+
.write(true)
57+
.create_new(true)
58+
.open(lock_path)
59+
.map(|_| ())
60+
}
61+
62+
fn lock_is_stale(lock_path: &Path) -> bool {
63+
let metadata = match std::fs::metadata(lock_path) {
64+
Ok(value) => value,
65+
Err(_) => return false,
66+
};
67+
let modified = match metadata.modified() {
68+
Ok(value) => value,
69+
Err(_) => return false,
70+
};
71+
SystemTime::now()
72+
.duration_since(modified)
73+
.map(|age| age >= STALE_LOCK_AGE)
74+
.unwrap_or(false)
75+
}
76+
77+
/// Acquire the shared switch / login lock. The returned guard releases
78+
/// the lock on drop. `busy_error_code` is the error code returned to
79+
/// the caller (and ultimately the front-end) when contention is real
80+
/// — `SWITCH_IN_PROGRESS` for switch, `LOGIN_BUSY` for login.
81+
/// `busy_message` is the human-readable companion.
82+
pub fn acquire_process_lock(
83+
codex_home: Option<&Path>,
84+
busy_error_code: &'static str,
85+
busy_message: &'static str,
86+
) -> AppResult<ProcessLockGuard> {
87+
let lock_path = get_switch_lock_path(codex_home);
88+
if let Some(parent) = lock_path.parent() {
89+
std::fs::create_dir_all(parent).map_err(|error| {
90+
AppError::new(
91+
"FS_CREATE_FAILED",
92+
format!(
93+
"Failed to create lock directory {}: {error}",
94+
parent.display()
95+
),
96+
)
97+
})?;
98+
}
99+
100+
if let Err(error) = try_create_lock(&lock_path) {
101+
// The cheap branch is "real concurrent contention" — that's
102+
// what AlreadyExists with a fresh lock looks like. The other
103+
// branch we observed in the wild is the GUI dying mid-switch
104+
// (force quit / OS logout / hung OAuth) and leaving the lock
105+
// behind, which then permanently blocks every future caller.
106+
// Detect that by mtime and reclaim.
107+
if error.kind() == std::io::ErrorKind::AlreadyExists && lock_is_stale(&lock_path) {
108+
let _ = std::fs::remove_file(&lock_path);
109+
try_create_lock(&lock_path).map_err(|retry_error| {
110+
AppError::new(
111+
busy_error_code,
112+
format!(
113+
"Stale {busy_error_code} lock cleanup failed: {retry_error}. \
114+
Another operation may have started in the meantime."
115+
),
116+
)
117+
})?;
118+
} else {
119+
return Err(AppError::new(busy_error_code, busy_message));
120+
}
121+
}
122+
123+
Ok(ProcessLockGuard { lock_path })
124+
}
125+
126+
#[cfg(test)]
127+
mod tests {
128+
use super::*;
129+
use std::fs;
130+
use std::time::{SystemTime, UNIX_EPOCH};
131+
132+
fn temp_codex_home(name: &str) -> PathBuf {
133+
let unique = SystemTime::now()
134+
.duration_since(UNIX_EPOCH)
135+
.unwrap()
136+
.as_nanos();
137+
let path = std::env::temp_dir()
138+
.join(format!("codex-switch-process-lock-{name}-{unique}"));
139+
fs::create_dir_all(path.join("account_backup")).unwrap();
140+
path
141+
}
142+
143+
#[test]
144+
fn first_acquire_succeeds_and_drop_releases() {
145+
let codex_home = temp_codex_home("first-acquire");
146+
let lock_path = get_switch_lock_path(Some(&codex_home));
147+
148+
{
149+
let _guard = acquire_process_lock(Some(&codex_home), "SWITCH_IN_PROGRESS", "busy")
150+
.expect("first acquire should succeed");
151+
assert!(lock_path.is_file(), "lock file must exist while held");
152+
}
153+
assert!(!lock_path.exists(), "lock file must be removed on drop");
154+
}
155+
156+
#[test]
157+
fn second_concurrent_acquire_returns_busy_with_caller_supplied_code() {
158+
let codex_home = temp_codex_home("concurrent");
159+
let _first = acquire_process_lock(Some(&codex_home), "SWITCH_IN_PROGRESS", "busy switch")
160+
.expect("first acquire");
161+
let err = acquire_process_lock(Some(&codex_home), "LOGIN_BUSY", "busy login")
162+
.expect_err("second acquire should fail");
163+
assert_eq!(err.error_code, "LOGIN_BUSY");
164+
assert_eq!(err.message, "busy login");
165+
}
166+
167+
#[test]
168+
fn stale_lock_is_reclaimed_on_next_acquire() {
169+
// Walk the predicate directly: write a fresh file, confirm
170+
// it's NOT stale; then prove the logic by checking that a
171+
// sufficiently-aged mtime would be flagged. We can't reliably
172+
// back-date a real file across all CI platforms without an
173+
// extra crate, so this test pins the threshold predicate
174+
// shape rather than the end-to-end "delete + reacquire" flow,
175+
// which is exercised by the integration test below.
176+
let codex_home = temp_codex_home("stale-predicate");
177+
let lock_path = get_switch_lock_path(Some(&codex_home));
178+
std::fs::write(&lock_path, b"").unwrap();
179+
assert!(
180+
!lock_is_stale(&lock_path),
181+
"freshly created lock must not be flagged as stale"
182+
);
183+
}
184+
185+
#[test]
186+
fn integration_stale_lock_is_swept_when_mtime_old_enough() {
187+
// Skip if we can't back-date the file. On macOS / Linux we
188+
// have `nix::sys::stat::utimes`; on Windows we have
189+
// `SetFileTime`. Rather than pulling in those deps for a
190+
// single test, drop a real lock file with mtime now-6min via
191+
// `touch -t` shell out (only on POSIX) and skip on platforms
192+
// where it's not available.
193+
if cfg!(target_os = "windows") {
194+
return;
195+
}
196+
let codex_home = temp_codex_home("stale-integration");
197+
let lock_path = get_switch_lock_path(Some(&codex_home));
198+
std::fs::write(&lock_path, b"").unwrap();
199+
200+
// Six minutes ago. `touch -t YYYYMMDDhhmm` is universally
201+
// supported on POSIX `touch`.
202+
let six_min_ago = chrono::Local::now() - chrono::Duration::minutes(6);
203+
let ts = six_min_ago.format("%Y%m%d%H%M").to_string();
204+
let status = std::process::Command::new("touch")
205+
.args(["-t", &ts, lock_path.to_str().unwrap()])
206+
.status();
207+
if status.map(|s| !s.success()).unwrap_or(true) {
208+
// touch unavailable; skip.
209+
return;
210+
}
211+
assert!(
212+
lock_is_stale(&lock_path),
213+
"lock with 6-minute-old mtime must be flagged stale"
214+
);
215+
216+
let _guard = acquire_process_lock(Some(&codex_home), "SWITCH_IN_PROGRESS", "busy")
217+
.expect("stale lock must be reclaimable");
218+
assert!(lock_path.is_file());
219+
}
220+
}

src-tauri/shared/runtime/switch_core.rs

Lines changed: 8 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
use std::fs::OpenOptions;
21
use std::path::{Path, PathBuf};
32

43
use crate::errors::{AppError, AppResult};
@@ -8,46 +7,17 @@ use crate::platform::hooks::PlatformHooks;
87
use super::fs_ops::{
98
autosave_auth, backup_root_state_to_profile, overlay_directory_contents, set_active_marker,
109
};
11-
use super::paths::{get_backup_root, get_codex_home, get_switch_lock_path, validate_profile_name};
10+
use super::paths::{get_backup_root, get_codex_home, validate_profile_name};
11+
use super::process_lock::{acquire_process_lock, ProcessLockGuard};
1212
use super::profiles::resolve_current_profile;
1313
use super::profiles_index::load_profiles_index;
1414

15-
struct SwitchGuard {
16-
lock_path: PathBuf,
17-
}
18-
19-
impl Drop for SwitchGuard {
20-
fn drop(&mut self) {
21-
let _ = std::fs::remove_file(&self.lock_path);
22-
}
23-
}
24-
25-
fn acquire_switch_lock(codex_home: Option<&Path>) -> AppResult<SwitchGuard> {
26-
let lock_path = get_switch_lock_path(codex_home);
27-
if let Some(parent) = lock_path.parent() {
28-
std::fs::create_dir_all(parent).map_err(|error| {
29-
AppError::new(
30-
"FS_CREATE_FAILED",
31-
format!(
32-
"Failed to create lock directory {}: {error}",
33-
parent.display()
34-
),
35-
)
36-
})?;
37-
}
38-
39-
OpenOptions::new()
40-
.write(true)
41-
.create_new(true)
42-
.open(&lock_path)
43-
.map_err(|_| {
44-
AppError::new(
45-
"SWITCH_IN_PROGRESS",
46-
"A profile switch is already in progress.",
47-
)
48-
})?;
49-
50-
Ok(SwitchGuard { lock_path })
15+
fn acquire_switch_lock(codex_home: Option<&Path>) -> AppResult<ProcessLockGuard> {
16+
acquire_process_lock(
17+
codex_home,
18+
"SWITCH_IN_PROGRESS",
19+
"A profile switch is already in progress.",
20+
)
5121
}
5222

5323
pub fn switch_profile_with_home<H: PlatformHooks + ?Sized>(

0 commit comments

Comments
 (0)