Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
228 changes: 132 additions & 96 deletions crates/aether_url/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,129 +6,101 @@
//

use std::fmt;
use std::path::Component;

use stdext::result::ResultExt;
use url::Url;

/// Canonicalised file URL identity.
/// Lexically normalised file URL identity.
///
/// # The multi-source URI reconciliation problem
/// Internal identity key for files received from any source (LSP, DAP,
/// scanner, R runtime). Constructed via the same lexical normalisation
/// at every entry point so that two paths the editor considers "the
/// same file" produce the same [`UrlId`].
///
/// File URIs for the same file can arrive from independent sources, each
/// with its own representation:
/// What we normalise: dot segments (`.` and `..`, collapsed lexically);
/// drive-letter casing on Windows; percent-encoding of `:` (decoded via
/// `Url -> PathBuf -> Url` round-trip). No I/O: `std::fs::canonicalize()`,
/// no symlink resolution. The same input URI produces the same [`UrlId`]
/// whether or not the file exists on disk.
///
/// - **DAP (SetBreakpoints)**: Receives raw file paths from the frontend,
/// converted to URIs via `UrlId::from_file_path`. These are stored as
/// HashMap keys for breakpoint lookup.
/// "Lexical" means we resolve `..` by dropping the previous segment, not
/// by asking the filesystem. So `/a/b/../c.R` becomes `/a/c.R` even if
/// `/a/b` is a symlink that points elsewhere. That's the same trade-off
/// the editor makes, so the two agree.
///
/// - **LSP (didChange, etc.)**: Receives URIs directly from the editor
/// client, which may use non-canonical forms (e.g. percent-encoded
/// colons on Windows, or symlinked paths on macOS).
/// # Bridging across symlinks
///
/// - **Execute requests**: A frontend may attach a `code_location` URI
/// that comes straight from the editor's document model, again
/// potentially non-canonical.
/// R's `normalizePath()` resolves symlinks on its own. A srcref URI
/// from the R runtime may name `/private/tmp/foo.R` while the editor
/// sent us `/tmp/foo.R`. The two don't compare equal, so a `HashMap`
/// keyed on `UrlId` treats them as separate files. Code that needs to
/// match a srcref URI back to an open document or a breakpoint should
/// maintain a secondary index of `fs::canonicalize`d paths and fall
/// back to it on a primary miss.
/// [`crate::dap::dap_state::BreakpointMap`] in `ark` does this for
/// breakpoints.
///
/// - **R runtime**: When R evaluates `source()` or annotates code, it
/// passes URIs that went through R's `normalizePath()`, which resolves
/// symlinks to their canonical target (e.g. `/tmp` resolves to `/private/tmp`
/// on macOS), producing a path the editor never sent. More generally,
/// arbitrary R code can create source references that we may end up
/// consuming for breakpoint or debug purposes, and the paths in those
/// references may or may not be canonical.
/// # Important: don't leak normalised URIs back out
///
/// All four sources must agree on file identity. For instance breakpoints set
/// via DAP are looked up in a HashMap keyed by URI when code is executed or
/// sourced, and invalidated when documents change via LSP.
///
/// # Design decision
///
/// We solve this by canonicalizing URIs into [`UrlId`] at every entry
/// point, rather than interning paths into opaque IDs (as rust-analyzer
/// does with its VFS `FileId` approach). Interning would be a larger
/// architectural change and is not warranted here since we only need
/// canonical keys at a handful of call sites.
///
/// Canonicalization uses `std::fs::canonicalize()` to resolve symlinks
/// (e.g. `/tmp` to `/private/tmp` on macOS), round-trips through the
/// filesystem path to normalize encoding variants (e.g. `%3A` to `:` on
/// Windows), and uppercases drive letters on Windows. When the file does
/// not exist on disk, we fall back to the original URI.
///
/// # Important: canonical URIs must not leak
///
/// [`UrlId`] is strictly for internal identity. When a URI flows back
/// to R (e.g. in `#line` directives or injected breakpoint calls) or to
/// the frontend (e.g. in DAP stack frames), always use the original raw
/// URI. The frontend (and possibly R code) expects their own URI
/// representation, and a canonical URI (e.g. `/private/tmp/...` instead of
/// `/tmp/...`) could be treated as a different file (e.g. open a new editor in
/// the frontend instead of an existing one).
///
/// A canonicalized file URI for use as a stable identity key.
///
/// Wraps a [`Url`] that has been canonicalized to resolve symlinks,
/// normalize encoding variants, and uppercase drive letters on Windows.
/// Use this type in HashMaps and anywhere file identity matters.
///
/// Construct via [`UrlId::from_url`], [`UrlId::from_file_path`], or
/// [`UrlId::parse`].
///
/// On Windows, `std::fs::canonicalize()` returns extended-length paths
/// prefixed with `\\?\` (e.g. `\\?\C:\Users\...`). Projects like Ruff
/// use the `dunce` crate to strip this prefix, but we don't need it
/// because `Url::from_file_path` already handles
/// `Prefix::VerbatimDisk` and produces a clean `file:///C:/...` URI.
/// When sending URIs back to the editor or to R, prefer the original
/// bytes the frontend sent rather than the lexically normalized form.
/// The frontend treats a URI as the editor's identity for the file; a
/// normalised form may look like a different file to it (e.g. open a
/// new editor pane).
#[derive(Debug, Clone, PartialEq, Eq, Hash)]
pub struct UrlId(Url);

impl UrlId {
/// Canonicalize a [`Url`] into a [`UrlId`].
/// Lexically normalise a [`Url`] into a [`UrlId`].
///
/// Resolves symlinks via `std::fs::canonicalize()` and normalizes
/// encoding variants (e.g. `%3A` to `:` on Windows). On Windows, also
/// uppercases the drive letter. Falls back to the original URI for
/// non-file schemes or when the path can't be resolved.
/// Decodes encoding variants (e.g. `%3A` to `:` on Windows),
/// uppercases the Windows drive letter, and collapses `.` and `..`
/// segments lexically. Does no filesystem I/O. Non-`file:` URLs
/// (`ark://`, `untitled:`, ...) pass through untouched.
pub fn from_url(uri: Url) -> Self {
if uri.scheme() != "file" {
return Self(uri);
}

let Some(path) = uri.to_file_path().warn_on_err() else {
return Self(uri);
// Round-trip through `PathBuf` so the URI form matches what
// `Url::from_file_path` produces (decoded `%3A`, etc.). Skip
// on error, we let pathological URIs flow through unchanged.
//
// `Url::parse` already collapses dot segments for `file:` URLs,
// but `Url::from_file_path` does not, so `from_file_path` entry
// points would keep a stray `..`. Normalise the path here so all
// three constructors land on the same `UrlId`.
let uri = match uri.to_file_path().warn_on_err() {
Some(path) => {
let path = lexically_normalize(&path);
Url::from_file_path(&path)
.map_err(|()| anyhow::anyhow!("Failed to convert path to URI: {path:?}"))
.warn_on_err()
.unwrap_or(uri)
},
None => uri,
};

let path = std::fs::canonicalize(&path).trace_on_err().unwrap_or(path);
let uri = Url::from_file_path(&path)
.map_err(|()| anyhow::anyhow!("Failed to convert path to URI: {path:?}"))
.warn_on_err()
.unwrap_or(uri);

#[cfg(windows)]
let uri = uppercase_windows_drive_in_uri(uri);

Self(uri)
}

/// Wrap a [`Url`] that the caller asserts is already canonical.
pub fn from_canonical(uri: Url) -> Self {
Self(uri)
}

/// Convert a file path to a canonical [`UrlId`].
/// Build a [`UrlId`] from a filesystem path.
///
/// Canonicalizes the path to resolve symlinks (e.g. `/var/folders` to
/// `/private/var/folders` on macOS) so the URI matches what R's
/// `normalizePath()` produces. Falls back to the original path if
/// canonicalization fails.
/// Same lexical normalisation as [`Self::from_url`], no filesystem
/// I/O. Errors only if `path` can't be expressed as a URL (e.g.
/// not absolute on platforms that require it).
pub fn from_file_path(path: impl AsRef<std::path::Path>) -> anyhow::Result<Self> {
let path = path.as_ref();
let url = Url::from_file_path(path)
.map_err(|()| anyhow::anyhow!("Failed to convert path to URL: {}", path.display()))?;
Ok(Self::from_url(url))
}

/// Parse a URI string into a canonical [`UrlId`].
/// Parse a URI string into a [`UrlId`].
pub fn parse(s: &str) -> Result<Self, url::ParseError> {
let url = Url::parse(s)?;
Ok(Self::from_url(url))
Expand Down Expand Up @@ -166,6 +138,33 @@ impl fmt::Display for UrlId {
}
}

/// Collapse `.` and `..` segments without touching the filesystem.
///
/// `Path::components` already drops `.` and redundant separators, so all
/// we add is popping the previous segment when we hit a `..`. A `..` that
/// would climb above the root is dropped, matching how the URL parser
/// clamps `file:` paths. Inputs here are always absolute (a `file:` URL
/// can't be built from a relative path), so the leading root component is
/// preserved and a `..` never escapes it.
fn lexically_normalize(path: &std::path::Path) -> std::path::PathBuf {
let mut out = std::path::PathBuf::new();

for component in path.components() {
match component {
Component::ParentDir => match out.components().next_back() {
Some(Component::Normal(_)) => {
out.pop();
},
Some(Component::RootDir | Component::Prefix(_)) => {},
_ => out.push(component),
},
other => out.push(other),
}
}

out
}

/// Uppercase the drive letter in a Windows file URI for consistent hashing.
#[cfg(windows)]
fn uppercase_windows_drive_in_uri(mut uri: Url) -> Url {
Expand Down Expand Up @@ -223,28 +222,33 @@ mod tests {

#[test]
#[cfg(not(windows))]
fn test_fallback_for_nonexistent_path() {
// For paths that don't exist, canonicalization falls back to the
// original path so the URI is unchanged.
fn test_nonexistent_path_unchanged() {
// Construction is lexical-only, so nonexistent paths flow
// through unchanged.
let uri = Url::parse("file:///nonexistent/path/test.R").unwrap();
let id = UrlId::from_url(uri.clone());
assert_eq!(*id.as_url(), uri);
}

#[test]
#[cfg(target_os = "macos")]
fn test_resolves_tmp_symlink() {
// On macOS, `/tmp` is a symlink to `/private/tmp`. `UrlId` should
// resolve it so that URIs from different sources match.
fn test_does_not_resolve_symlinks() {
// On macOS, `/tmp` is a symlink to `/private/tmp`. `UrlId` does
// *not* resolve it; same input bytes produce the same output
// regardless of the symlink graph on disk. Bridging across
// symlinked names is the job of secondary canonical indexes at
// specific seams (e.g. the DAP breakpoint store), not of
// construction.
let dir = tempfile::tempdir_in("/tmp").unwrap();
let file = dir.path().join("test.R");
std::fs::write(&file, "").unwrap();

let non_canonical = Url::from_file_path(&file).unwrap();
assert!(non_canonical.path().starts_with("/tmp/"));
let original = Url::from_file_path(&file).unwrap();
assert!(original.path().starts_with("/tmp/"));

let id = UrlId::from_url(non_canonical);
assert!(id.as_url().path().starts_with("/private/tmp/"));
let id = UrlId::from_url(original.clone());
assert_eq!(*id.as_url(), original);
assert!(id.as_url().path().starts_with("/tmp/"));
}

#[test]
Expand All @@ -254,6 +258,38 @@ mod tests {
assert_eq!(id.as_url().as_str(), "file:///home/user/test.R");
}

#[test]
#[cfg(not(windows))]
fn test_resolves_dot_segments_on_parse() {
let id = UrlId::parse("file:///a/b/../c/./d.R").unwrap();
assert_eq!(id.as_url().as_str(), "file:///a/c/d.R");
}

#[test]
#[cfg(not(windows))]
fn test_resolves_dot_segments_from_file_path() {
// `Url::from_file_path` keeps `..`, unlike `Url::parse`. The
// explicit lexical pass in `from_url` is what makes this entry
// point agree with the others.
let id = UrlId::from_file_path("/a/b/../c/./d.R").unwrap();
assert_eq!(id.as_url().as_str(), "file:///a/c/d.R");
}

#[test]
#[cfg(not(windows))]
fn test_dot_segments_consistent_across_constructors() {
let parsed = UrlId::parse("file:///a/b/../c.R").unwrap();
let from_path = UrlId::from_file_path("/a/b/../c.R").unwrap();
assert_eq!(parsed, from_path);
}

#[test]
#[cfg(not(windows))]
fn test_parent_dir_clamped_at_root() {
let id = UrlId::from_file_path("/../../a.R").unwrap();
assert_eq!(id.as_url().as_str(), "file:///a.R");
}

// Windows-specific tests

#[test]
Expand Down
9 changes: 6 additions & 3 deletions crates/ark/src/console/console_annotate.rs
Original file line number Diff line number Diff line change
Expand Up @@ -940,9 +940,10 @@ pub unsafe extern "C-unwind" fn ps_annotate_source(
// If there are no breakpoints for this file, return NULL to signal no
// annotation needed. Scope the mutable borrow so we can re-borrow after.
let annotated = {
let Some((_, breakpoints)) = dap_guard.breakpoints.get_mut(&uri_id) else {
let Some(entry) = dap_guard.breakpoints.get_mut(&uri_id) else {
return Ok(harp::r_null());
};
let breakpoints = &mut entry.breakpoints;
if breakpoints.is_empty() {
return Ok(harp::r_null());
}
Expand All @@ -960,8 +961,10 @@ pub unsafe extern "C-unwind" fn ps_annotate_source(
// Remove disabled breakpoints. Their verification state is now stale since
// they weren't injected during this annotation. If the user re-enables
// them, they'll be treated as new unverified breakpoints.
if let Some((_, breakpoints)) = dap_guard.breakpoints.get_mut(&uri_id) {
breakpoints.retain(|bp| !matches!(bp.state, BreakpointState::Disabled));
if let Some(entry) = dap_guard.breakpoints.get_mut(&uri_id) {
entry
.breakpoints
.retain(|bp| !matches!(bp.state, BreakpointState::Disabled));
}

Ok(RObject::from(annotated).sexp)
Expand Down
2 changes: 1 addition & 1 deletion crates/ark/src/console/console_repl.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1534,7 +1534,7 @@ impl Console {
let breakpoints = uri_id
.as_ref()
.and_then(|uri_id| dap_guard.breakpoints.get_mut(uri_id))
.map(|(_, v)| v.as_mut_slice());
.map(|entry| entry.breakpoints.as_mut_slice());

match PendingInputs::read(&code, loc, breakpoints) {
Ok(ParseResult::Success(inputs)) => {
Expand Down
13 changes: 7 additions & 6 deletions crates/ark/src/dap/dap_jupyter_handler.rs
Original file line number Diff line number Diff line change
Expand Up @@ -186,12 +186,13 @@ impl DapJupyterHandler {
let breakpoints: Vec<serde_json::Value> = state
.breakpoints
.iter()
.map(|(uri, (_, bps))| {
let source = uri
.as_url()
.to_file_path()
.map_or_else(|_| uri.to_string(), |p| p.to_string_lossy().into_owned());
let source_breakpoints: Vec<serde_json::Value> = bps
.map(|(_, entry)| {
// Echo the verbatim path the frontend sent rather than the
// normalized key, so it sees back its own bytes (e.g. its
// Windows drive-letter casing).
let source = entry.verbatim_path.clone();
let source_breakpoints: Vec<serde_json::Value> = entry
.breakpoints
.iter()
.filter(|bp| !matches!(bp.state, BreakpointState::Disabled))
.map(|bp| {
Expand Down
Loading
Loading