Skip to content

Commit eb2c844

Browse files
committed
Don't canonicalise file keys
Only canonicalise when comparing against paths from external sources (e.g. normalised by R). Canonicalisation may corrupt the identity of paths sent by frontends, and may prevent matching files at all when they are deleted (at which point a non-canonicalised path sent by the frontend can't be matched against canonicalised paths we've stored as keys)
1 parent b18ee25 commit eb2c844

10 files changed

Lines changed: 171 additions & 138 deletions

File tree

crates/aether_url/src/lib.rs

Lines changed: 60 additions & 97 deletions
Original file line numberDiff line numberDiff line change
@@ -10,125 +10,83 @@ use std::fmt;
1010
use stdext::result::ResultExt;
1111
use url::Url;
1212

13-
/// Canonicalised file URL identity.
13+
/// Lexically normalised file URL identity.
1414
///
15-
/// # The multi-source URI reconciliation problem
15+
/// Internal identity key for files received from any source (LSP, DAP,
16+
/// scanner, R runtime). Constructed via the same lexical normalisation
17+
/// at every entry point so that two paths the editor considers "the
18+
/// same file" produce the same [`UrlId`].
1619
///
17-
/// File URIs for the same file can arrive from independent sources, each
18-
/// with its own representation:
20+
/// What we normalise: drive-letter casing on Windows; percent-encoding
21+
/// of `:` (decoded via `Url -> PathBuf -> Url` round-trip). No I/O:
22+
/// `std::fs::canonicalize()`, no symlink resolution. The same input URI
23+
/// produces the same [`UrlId`] whether or not the file exists on disk.
1924
///
20-
/// - **DAP (SetBreakpoints)**: Receives raw file paths from the frontend,
21-
/// converted to URIs via `UrlId::from_file_path`. These are stored as
22-
/// HashMap keys for breakpoint lookup.
25+
/// # Bridging across symlinks
2326
///
24-
/// - **LSP (didChange, etc.)**: Receives URIs directly from the editor
25-
/// client, which may use non-canonical forms (e.g. percent-encoded
26-
/// colons on Windows, or symlinked paths on macOS).
27+
/// R's `normalizePath()` resolves symlinks on its own. A srcref URI
28+
/// from the R runtime may name `/private/tmp/foo.R` while the editor
29+
/// sent us `/tmp/foo.R`. The two don't compare equal, so a `HashMap`
30+
/// keyed on `UrlId` treats them as separate files. Code that needs to
31+
/// match a srcref URI back to an open document or a breakpoint should
32+
/// maintain a secondary index of `fs::canonicalize`d paths and fall
33+
/// back to it on a primary miss.
34+
/// [`crate::dap::dap_state::BreakpointMap`] in `ark` does this for
35+
/// breakpoints.
2736
///
28-
/// - **Execute requests**: A frontend may attach a `code_location` URI
29-
/// that comes straight from the editor's document model, again
30-
/// potentially non-canonical.
37+
/// # Important: don't leak normalised URIs back out
3138
///
32-
/// - **R runtime**: When R evaluates `source()` or annotates code, it
33-
/// passes URIs that went through R's `normalizePath()`, which resolves
34-
/// symlinks to their canonical target (e.g. `/tmp` resolves to `/private/tmp`
35-
/// on macOS), producing a path the editor never sent. More generally,
36-
/// arbitrary R code can create source references that we may end up
37-
/// consuming for breakpoint or debug purposes, and the paths in those
38-
/// references may or may not be canonical.
39-
///
40-
/// All four sources must agree on file identity. For instance breakpoints set
41-
/// via DAP are looked up in a HashMap keyed by URI when code is executed or
42-
/// sourced, and invalidated when documents change via LSP.
43-
///
44-
/// # Design decision
45-
///
46-
/// We solve this by canonicalizing URIs into [`UrlId`] at every entry
47-
/// point, rather than interning paths into opaque IDs (as rust-analyzer
48-
/// does with its VFS `FileId` approach). Interning would be a larger
49-
/// architectural change and is not warranted here since we only need
50-
/// canonical keys at a handful of call sites.
51-
///
52-
/// Canonicalization uses `std::fs::canonicalize()` to resolve symlinks
53-
/// (e.g. `/tmp` to `/private/tmp` on macOS), round-trips through the
54-
/// filesystem path to normalize encoding variants (e.g. `%3A` to `:` on
55-
/// Windows), and uppercases drive letters on Windows. When the file does
56-
/// not exist on disk, we fall back to the original URI.
57-
///
58-
/// # Important: canonical URIs must not leak
59-
///
60-
/// [`UrlId`] is strictly for internal identity. When a URI flows back
61-
/// to R (e.g. in `#line` directives or injected breakpoint calls) or to
62-
/// the frontend (e.g. in DAP stack frames), always use the original raw
63-
/// URI. The frontend (and possibly R code) expects their own URI
64-
/// representation, and a canonical URI (e.g. `/private/tmp/...` instead of
65-
/// `/tmp/...`) could be treated as a different file (e.g. open a new editor in
66-
/// the frontend instead of an existing one).
67-
///
68-
/// A canonicalized file URI for use as a stable identity key.
69-
///
70-
/// Wraps a [`Url`] that has been canonicalized to resolve symlinks,
71-
/// normalize encoding variants, and uppercase drive letters on Windows.
72-
/// Use this type in HashMaps and anywhere file identity matters.
73-
///
74-
/// Construct via [`UrlId::from_url`], [`UrlId::from_file_path`], or
75-
/// [`UrlId::parse`].
76-
///
77-
/// On Windows, `std::fs::canonicalize()` returns extended-length paths
78-
/// prefixed with `\\?\` (e.g. `\\?\C:\Users\...`). Projects like Ruff
79-
/// use the `dunce` crate to strip this prefix, but we don't need it
80-
/// because `Url::from_file_path` already handles
81-
/// `Prefix::VerbatimDisk` and produces a clean `file:///C:/...` URI.
39+
/// Even though [`UrlId`] no longer fs-canonicalises, it still
40+
/// uppercases the Windows drive letter and decodes the percent-encoded
41+
/// colon. When sending URIs back to the editor or to R, prefer the
42+
/// original bytes the frontend sent. The frontend treats a URI as the
43+
/// editor's identity for the file; a normalised form may look like a
44+
/// different file to it (e.g. open a new editor pane).
8245
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
8346
pub struct UrlId(Url);
8447

8548
impl UrlId {
86-
/// Canonicalize a [`Url`] into a [`UrlId`].
49+
/// Lexically normalise a [`Url`] into a [`UrlId`].
8750
///
88-
/// Resolves symlinks via `std::fs::canonicalize()` and normalizes
89-
/// encoding variants (e.g. `%3A` to `:` on Windows). On Windows, also
90-
/// uppercases the drive letter. Falls back to the original URI for
91-
/// non-file schemes or when the path can't be resolved.
51+
/// Decodes encoding variants (e.g. `%3A` to `:` on Windows) and
52+
/// uppercases the Windows drive letter. Does no filesystem I/O.
53+
/// Non-`file:` URLs (`ark://`, `untitled:`, ...) pass through
54+
/// untouched.
9255
pub fn from_url(uri: Url) -> Self {
9356
if uri.scheme() != "file" {
9457
return Self(uri);
9558
}
9659

97-
let Some(path) = uri.to_file_path().warn_on_err() else {
98-
return Self(uri);
60+
// Round-trip through `PathBuf` so the URI form matches what
61+
// `Url::from_file_path` produces (decoded `%3A`, etc.). Skip
62+
// on error, we let pathological URIs flow through unchanged.
63+
let uri = match uri.to_file_path().warn_on_err() {
64+
Some(path) => Url::from_file_path(&path)
65+
.map_err(|()| anyhow::anyhow!("Failed to convert path to URI: {path:?}"))
66+
.warn_on_err()
67+
.unwrap_or(uri),
68+
None => uri,
9969
};
10070

101-
let path = std::fs::canonicalize(&path).trace_on_err().unwrap_or(path);
102-
let uri = Url::from_file_path(&path)
103-
.map_err(|()| anyhow::anyhow!("Failed to convert path to URI: {path:?}"))
104-
.warn_on_err()
105-
.unwrap_or(uri);
106-
10771
#[cfg(windows)]
10872
let uri = uppercase_windows_drive_in_uri(uri);
10973

11074
Self(uri)
11175
}
11276

113-
/// Wrap a [`Url`] that the caller asserts is already canonical.
114-
pub fn from_canonical(uri: Url) -> Self {
115-
Self(uri)
116-
}
117-
118-
/// Convert a file path to a canonical [`UrlId`].
77+
/// Build a [`UrlId`] from a filesystem path.
11978
///
120-
/// Canonicalizes the path to resolve symlinks (e.g. `/var/folders` to
121-
/// `/private/var/folders` on macOS) so the URI matches what R's
122-
/// `normalizePath()` produces. Falls back to the original path if
123-
/// canonicalization fails.
79+
/// Same lexical normalisation as [`Self::from_url`], no filesystem
80+
/// I/O. Errors only if `path` can't be expressed as a URL (e.g.
81+
/// not absolute on platforms that require it).
12482
pub fn from_file_path(path: impl AsRef<std::path::Path>) -> anyhow::Result<Self> {
12583
let path = path.as_ref();
12684
let url = Url::from_file_path(path)
12785
.map_err(|()| anyhow::anyhow!("Failed to convert path to URL: {}", path.display()))?;
12886
Ok(Self::from_url(url))
12987
}
13088

131-
/// Parse a URI string into a canonical [`UrlId`].
89+
/// Parse a URI string into a [`UrlId`].
13290
pub fn parse(s: &str) -> Result<Self, url::ParseError> {
13391
let url = Url::parse(s)?;
13492
Ok(Self::from_url(url))
@@ -223,28 +181,33 @@ mod tests {
223181

224182
#[test]
225183
#[cfg(not(windows))]
226-
fn test_fallback_for_nonexistent_path() {
227-
// For paths that don't exist, canonicalization falls back to the
228-
// original path so the URI is unchanged.
184+
fn test_nonexistent_path_unchanged() {
185+
// Construction is lexical-only, so nonexistent paths flow
186+
// through unchanged.
229187
let uri = Url::parse("file:///nonexistent/path/test.R").unwrap();
230188
let id = UrlId::from_url(uri.clone());
231189
assert_eq!(*id.as_url(), uri);
232190
}
233191

234192
#[test]
235193
#[cfg(target_os = "macos")]
236-
fn test_resolves_tmp_symlink() {
237-
// On macOS, `/tmp` is a symlink to `/private/tmp`. `UrlId` should
238-
// resolve it so that URIs from different sources match.
194+
fn test_does_not_resolve_symlinks() {
195+
// On macOS, `/tmp` is a symlink to `/private/tmp`. `UrlId` does
196+
// *not* resolve it; same input bytes produce the same output
197+
// regardless of the symlink graph on disk. Bridging across
198+
// symlinked names is the job of secondary canonical indexes at
199+
// specific seams (e.g. the DAP breakpoint store), not of
200+
// construction.
239201
let dir = tempfile::tempdir_in("/tmp").unwrap();
240202
let file = dir.path().join("test.R");
241203
std::fs::write(&file, "").unwrap();
242204

243-
let non_canonical = Url::from_file_path(&file).unwrap();
244-
assert!(non_canonical.path().starts_with("/tmp/"));
205+
let original = Url::from_file_path(&file).unwrap();
206+
assert!(original.path().starts_with("/tmp/"));
245207

246-
let id = UrlId::from_url(non_canonical);
247-
assert!(id.as_url().path().starts_with("/private/tmp/"));
208+
let id = UrlId::from_url(original.clone());
209+
assert_eq!(*id.as_url(), original);
210+
assert!(id.as_url().path().starts_with("/tmp/"));
248211
}
249212

250213
#[test]

crates/ark/src/dap/dap_state.rs

Lines changed: 92 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
//
77

88
use std::collections::HashMap;
9+
use std::path::PathBuf;
910
use std::sync::Arc;
1011
use std::sync::Mutex;
1112

@@ -220,6 +221,92 @@ impl DapBackendEvent {
220221
}
221222
}
222223

224+
/// Breakpoint storage keyed on the verbatim `UrlId` the frontend sent in
225+
/// `setBreakpoints`. A secondary index maps `fs::canonicalize`d paths
226+
/// back to the primary key, bridging R-runtime srcref URIs (which go
227+
/// through `normalizePath()` and therefore arrive symlink-resolved)
228+
/// against the editor's pre-resolution form.
229+
///
230+
/// `fs::canonicalize` runs at insert time and at lookup-fallback time
231+
/// only. Lookups try the primary first (cheap, no fs touch), then
232+
/// canonicalise the incoming URI and consult the secondary. DAP wire
233+
/// output (`debugInfo` source paths, breakpoint events) round-trips the
234+
/// primary key unchanged, so the frontend always sees the URI it sent.
235+
#[derive(Debug, Default)]
236+
pub struct BreakpointMap {
237+
/// Primary index, keyed on the URI the frontend gave us.
238+
by_url: HashMap<UrlId, (blake3::Hash, Vec<Breakpoint>)>,
239+
/// `fs::canonicalize`d path -> primary key. Populated at insert
240+
/// only when the URL is a `file:` and `canonicalize` succeeds (i.e.
241+
/// the file exists, which it does at `setBreakpoints` time).
242+
by_canonical: HashMap<PathBuf, UrlId>,
243+
}
244+
245+
impl BreakpointMap {
246+
pub fn insert(&mut self, url: UrlId, value: (blake3::Hash, Vec<Breakpoint>)) {
247+
if let Some(canonical) = canonical_path(&url) {
248+
self.by_canonical.insert(canonical, url.clone());
249+
}
250+
self.by_url.insert(url, value);
251+
}
252+
253+
pub fn remove(&mut self, url: &UrlId) -> Option<(blake3::Hash, Vec<Breakpoint>)> {
254+
let primary = self.resolve_primary(url)?.clone();
255+
self.by_canonical.retain(|_, p| p != &primary);
256+
self.by_url.remove(&primary)
257+
}
258+
259+
pub fn get(&self, url: &UrlId) -> Option<&(blake3::Hash, Vec<Breakpoint>)> {
260+
let primary = self.resolve_primary(url)?;
261+
self.by_url.get(primary)
262+
}
263+
264+
pub fn get_mut(&mut self, url: &UrlId) -> Option<&mut (blake3::Hash, Vec<Breakpoint>)> {
265+
// Cloning because the mut accessors can't hold a `&self.by_canonical`
266+
// borrow across `&mut self.by_url`.
267+
let primary = self.resolve_primary(url)?.clone();
268+
self.by_url.get_mut(&primary)
269+
}
270+
271+
pub fn contains_key(&self, url: &UrlId) -> bool {
272+
self.resolve_primary(url).is_some()
273+
}
274+
275+
pub fn iter(&self) -> impl Iterator<Item = (&UrlId, &(blake3::Hash, Vec<Breakpoint>))> {
276+
self.by_url.iter()
277+
}
278+
279+
pub fn iter_mut(
280+
&mut self,
281+
) -> impl Iterator<Item = (&UrlId, &mut (blake3::Hash, Vec<Breakpoint>))> {
282+
self.by_url.iter_mut()
283+
}
284+
285+
/// Resolve to the primary key. Tries bare `url` first. On miss,
286+
/// canonicalises `url` and consults the secondary index. The
287+
/// secondary catches cases like an R srcref `/private/tmp/foo.R`
288+
/// (resolved by `normalizePath()`) looking up an editor URI
289+
/// `/tmp/foo.R` (not resolved).
290+
fn resolve_primary<'a>(&'a self, url: &'a UrlId) -> Option<&'a UrlId> {
291+
if self.by_url.contains_key(url) {
292+
return Some(url);
293+
}
294+
let canonical = canonical_path(url)?;
295+
self.by_canonical.get(&canonical)
296+
}
297+
}
298+
299+
/// `fs::canonicalize` of the path component of a `file:` URL, if any.
300+
/// Returns `None` for non-`file:` URLs (`ark://`, `untitled:`, ...) and
301+
/// when the file isn't on disk.
302+
fn canonical_path(url: &UrlId) -> Option<PathBuf> {
303+
if !url.is_file() {
304+
return None;
305+
}
306+
let path = url.to_file_path().ok()?;
307+
std::fs::canonicalize(path).ok()
308+
}
309+
223310
pub struct Dap {
224311
/// Whether the REPL is stopped with a browser prompt.
225312
pub is_debugging: bool,
@@ -241,8 +328,8 @@ pub struct Dap {
241328
/// Current call stack
242329
pub stack: Option<Vec<FrameInfo>>,
243330

244-
/// Known breakpoints keyed by canonical URI, with document hash
245-
pub breakpoints: HashMap<UrlId, (blake3::Hash, Vec<Breakpoint>)>,
331+
/// Known breakpoints, with document hash.
332+
pub breakpoints: BreakpointMap,
246333

247334
/// Filters for enabled condition breakpoints
248335
pub exception_breakpoint_filters: Vec<String>,
@@ -306,7 +393,7 @@ impl Dap {
306393
is_connected: false,
307394
backend_events_tx: None,
308395
stack: None,
309-
breakpoints: HashMap::new(),
396+
breakpoints: BreakpointMap::default(),
310397
exception_breakpoint_filters: Vec::new(),
311398
fallback_sources: HashMap::new(),
312399
frame_id_to_variables_reference: HashMap::new(),
@@ -864,7 +951,7 @@ mod tests {
864951
is_connected: true,
865952
backend_events_tx: Some(backend_events_tx),
866953
stack: None,
867-
breakpoints: HashMap::new(),
954+
breakpoints: BreakpointMap::default(),
868955
exception_breakpoint_filters: Vec::new(),
869956
fallback_sources: HashMap::new(),
870957
frame_id_to_variables_reference: HashMap::new(),
@@ -978,7 +1065,7 @@ mod tests {
9781065
is_connected: false,
9791066
backend_events_tx: None,
9801067
stack: None,
981-
breakpoints: HashMap::new(),
1068+
breakpoints: BreakpointMap::default(),
9821069
exception_breakpoint_filters: Vec::new(),
9831070
fallback_sources: HashMap::new(),
9841071
frame_id_to_variables_reference: HashMap::new(),

crates/ark/src/lsp/tests/state_handlers.rs

Lines changed: 0 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -307,25 +307,13 @@ fn test_r_file_changed_for_unopened_file_updates_contents() {
307307
}
308308

309309
#[test]
310-
#[ignore = "blocked by UrlId canonicalization gap for missing paths"]
311310
fn test_r_file_deleted_for_editor_open_file_is_skipped() {
312311
// Mirror of `test_r_file_changed_for_editor_open_file_is_skipped`
313312
// for the Deleted kind. The skip check in `apply_watcher_events`
314313
// sits before the kind match, so editor-owned URLs should be
315314
// protected from deletion too: the buffer stays visible to the
316315
// user and queries keep resolving against the editor's
317316
// last-pushed content.
318-
//
319-
// The assertion passes today, but only incidentally: the same
320-
// canonicalization gap that blocks the DESCRIPTION test below also
321-
// breaks the skip lookup here. The event URL canonicalizes to the
322-
// raw `/var/...` path (the file is gone, canonicalize fails),
323-
// so the skip set (built from open documents while the file
324-
// existed, i.e. `/private/var/...`) does not contain it. The event
325-
// therefore reaches `remove_watched_file`, which also fails to find
326-
// the file by URL and bails out. Net: file survives, but the skip
327-
// mechanism is never actually exercised. Marked ignore until the
328-
// canonicalization gap is fixed.
329317
let tmp = tempfile::tempdir().unwrap();
330318
let path = tmp.path().join("a.R");
331319
fs::write(&path, "disk_v1\n").unwrap();
@@ -351,18 +339,10 @@ fn test_r_file_deleted_for_editor_open_file_is_skipped() {
351339
}
352340

353341
#[test]
354-
#[ignore = "blocked by UrlId canonicalization gap for missing paths; see dev-notes/notes/oak/2026-05-22-0846-watcher-test-gaps.md"]
355342
fn test_description_deleted_demotes_package_to_scripts() {
356343
// Inverse of `test_description_created_triggers_root_rescan`: a
357344
// DESCRIPTION removed mid-session triggers a root rescan and the
358345
// former package's R/ files surface as workspace scripts.
359-
//
360-
// Currently fails because `UrlId::from_url` canonicalizes via
361-
// `std::fs::canonicalize`, which fails for paths that no longer
362-
// exist and falls back to the raw path. On macOS this means the
363-
// Deleted event's URL keeps the `/var/...` prefix while the
364-
// workspace root URL uses `/private/var/...`, and the routing in
365-
// `apply_watcher_events` misses.
366346
let tmp = tempfile::tempdir().unwrap();
367347
write_package(&tmp.path().join("pkg"), "pkg", &[("a.R", "x <- 1\n")]);
368348
let mut state = workspace_state(tmp.path());

0 commit comments

Comments
 (0)