Reliable JavaScript/SourceMap processing via DebugId

Swatinem · Swatinem · commit 0a6d0caa2bfa · 2023-03-23T11:42:00.000+01:00
We want to make processing / SourceMap-ing of JavaScript stack traces more reliable.
To achieve this, we want to uniquely identify a (minified / deployed) JavaScript file using a DebugId.
The same DebugId also uniquely identifies the corresponding SourceMap.
That way it should be possible to _reliably_ look up the SourceMap corresponding to
a JavaScript file, which is necessary to have reliable SourceMap processing.
diff --git a/README.md b/README.md
@@ -36,3 +36,4 @@ This repository contains RFCs and DACIs. Lost?
 - [0071-continue-trace-over-process-boundaries](text/0071-continue-trace-over-process-boundaries.md): Continue trace over process boundaries
 - [0072-kafka-schema-registry](text/0072-kafka-schema-registry.md): Kafka Schema Registry
 - [0078-escalating-issues](text/0078-escalating-issues.md): Escalating Issues
+- [0081-sourcemap-debugid](text/0081-sourcemap-debugid.md): Reliable JavaScript/SourceMap processing via `DebugId`
diff --git a/text/0081-sourcemap-debugid.md b/text/0081-sourcemap-debugid.md
@@ -0,0 +1,362 @@
+- Start Date: 2023-03-21
+- RFC Type: initiative
+- RFC PR: https://github.com/getsentry/rfcs/pull/81
+- RFC Status: draft
+
+# Summary / Motivation
+
+We want to make processing / SourceMap-ing of JavaScript stack traces more reliable.
+To achieve this, we want to uniquely identify a (minified / deployed) JavaScript file using a `DebugId`.
+The same `DebugId` also uniquely identifies the corresponding SourceMap.
+That way it should be possible to _reliably_ look up the SourceMap corresponding to
+a JavaScript file.
+
+# Background
+
+It is currently not possible to _reliably_ find the associated SourceMap for a
+JavaScript file.
+
+A JavaScript stack trace only points to the (minified / transformed) source file
+by its URL, such as `https://example.com/file.min.js`, or `/path/to/local/file.min.js`.
+
+The corresponding SourceMap is often referenced using a `sourceMappingURL` comment
+at the end of that file. It is also possible to have a "hidden" SourceMap that is
+not referenced in such a way, but is typically found by its filename `{js_filename}.map`.
+
+However it is not guaranteed that the SourceMap found in such a way actually
+corresponds to the JavaScript file in which the error happened.
+
+A classical example is caching.
+
+1. An end-user is loading version `1` of `https://example.com/file.min.js`.
+2. A new app version `2` is deployed.
+3. The user experiences an error.
+4. The SourceMap at `https://example.com/file.min.js.map` (version `2`) at this point in time does not correspond to
+   the code the user was running.
+
+This problem is even worse at Sentry scale, as at any point in time, errors can come in that happened with arbitrary
+versions of the deployed code, sometimes even involving multiple files which might be out-of-sync with each other.
+
+To work around this problem, Sentry has used the combination of `release` and optional `dist` to better associate
+JavaScript files from one release with SourceMaps uploaded to Sentry.
+
+However this solution is still not reliable, as mentioned above, even two files loaded in the end-users browser can
+belong to a different release, due to caching or other reasons.
+
+Using a `DebugId`, which uniquely associates the JavaScript file and its corresponding SourceMap, should make source-mapping
+a lot more reliable.
+
+# Supporting Data
+
+TODO: please fill in the gaps here!
+
+Sentry has used the `release + dist` solution for quite some time and found it inadequate.
+A lot of events are not being resolved correctly due to these mismatches, and problems with source-mapping are very
+common in customer-support interactions.
+
+On the other hand, using a `DebugId` for symbolication of Native crashes and stack traces is working reliably both in
+Sentry and in the wider native ecosystem. The Native and C# community has the concept of _Symbol Servers_, which can
+serve any debug file based on its `DebugId`, which allows reliable symbolication for any release, at any point in time.
+
+# Options Considered
+
+To make `DebugId` work, we need to generate one, and associate it to both the JavaScript file, and its corresponding
+SourceMap.
+
+## The `DebugId` format
+
+The `DebugId` should have the same format as a standard UUI, specifically:
+It should be a 128 bit (16 byte), formatted to a string using base-16 hex encoding like so:
+
+`XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX`
+
+## How to generate `DebugId`s?
+
+There is two options of choosing a `DebugId`: Making it completely random, or make it reproducible by deriving it from
+a content hash.
+
+### Based on JavaScript Content-hash
+
+This creates a new `DebugId` by hashing the contents of the JavaScript file.
+
+**pros**
+
+- Is fully reproducible. The same JavaScript file will always have the same `DebugId`.
+- Works well with existing caching solutions.
+
+**cons**
+
+- Increases overhead in server-side SourceMap processing, as one file can potentially be included in multiple _bundles_.
+  See [_What is an `ArtifactBundle`_](#what-is-an-artifactbundle) below.
+- A difference in a source file might not be reflected in the JavaScript file. An example of this might be changes to
+  whitespace, comments, or code that was dead-code-eliminated by bundlers.
+
+### Based on SourceMap Content-hash
+
+This creates a new `DebugId` by hashing the contents of the SourceMap file.
+
+**pros**
+
+- Generates a new `DebugId` for changes to source files that would otherwise not lead to changes in the JavaScript file.
+
+**cons**
+
+- Does lead to slightly more cache invalidation.
+
+### Random `DebugId`
+
+This option would create a new random `DebugId` for each file, on each build.
+
+**pros**
+
+- Simpler server-side SourceMap processing, as one `DebugId` is only included in a single _bundle_, and that one bundle
+  can serve multiple stack frames for multiple files of the same build.
+
+**cons**
+
+- Completely breaks the concept of _caching_, as every file is unique for every build.
+
+## How to inject the `DebugId` into the JavaScript file?
+
+### `//# debugId` comment
+
+We propose to add a new magic comment to the end of JavaScript files similar to the existing `//# sourceMappingURL`
+comment. It should be at the end of the file, preferable as the line right before the `sourceMappingURL`, as the
+second line from the bottom
+
+It should look like this:
+
+```js
+someRandomJSCode();
+//# debugId=XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
+//# sourceMappingURL=file.min.js.map
+```
+
+### Runtime Detection / Resolution of `DebugId`
+
+In a shiny utopian future, Browsers would directly expose builtin APIs to programmatically access each frame of an `Error`s stack.
+This might include the absolute path, the line and column number, and the `DebugId`.
+Though the reality of today is that each browser has its own text-based `Error.stack` format, which might even give
+completely different line and column numbers across the different browsers.
+No programmatic API exists today, and might never exist. At the very least, widespread support for this is years away.
+
+It is therefore necessary to extract this `DebugId` through other means.
+
+#### Reading the `//# debugId` comment when capturing Errors
+
+Current JavaScript stack traces include the absolute path (called `abs_path`) of each stack frame. It should be possible
+to load and inspect that file at runtime whenever an error happens.
+
+**pros**
+
+- Does not require injecting any _code_ into the JavaScript files.
+
+**cons**
+
+- Might incur some async fetching / IO when capturing an Error. Though any `abs_path` in the stack trace should be cached already.
+
+#### Add the `DebugId` to a global at load time
+
+One solution here is to inject a small snippet of JS which will be executed when the JavaScript file is loaded, and adds
+the `DebugId` to a global map.
+
+An example snippet is here:
+
+```
+!function(){try{var e="undefined"!=typeof window?window:"undefined"!=typeof global?global:"undefined"!=typeof self?self:{},n=(new Error).stack;n&&(e._sentryDebugIds=e._sentryDebugIds||{},e._sentryDebugIds[n]="XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX")}catch(e){}}()
+```
+
+This snippet adds a complete `Error.stack` to a global called `_sentryDebugIds`.
+Further post-processing at time of capturing an `Error` is required to extract the `abs_path` from that captured stack.
+
+**pros**
+
+- Does not require any async fetching at time of capturing an `Error`.
+
+**cons**
+
+- It does however require parsing of the `Error.stack` at time of capturing the `Error`.
+
+An alternative implementation might use the [`import.meta.url`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Operators/import.meta)
+property. This would avoid capturing and post-processing an `Error.stack`, but does require usage of ECMAScript Modules.
+
+```
+((globalThis._sentryDebugIds=globalThis._sentryDebugIds||{})[import.meta.url]="XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX");
+```
+
+**pros**
+
+- More compact snippet.
+- No post-processing required.
+
+**cons**
+
+- Depends on usage of ECMAScript Modules.
+
+## When to inject the `DebugId` into the JavaScript file?
+
+Deploying JavaScript applications can range from a simple _copy files via ftp_
+to a complex workflow like the following:
+
+```mermaid
+graph TD
+    transpile[Transpile source files] --> bundle[Bundle source files]
+    bundle --> minify[Minify bundled chunk]
+    minify --> fingerprint[Fingerprint minified chunks]
+    minify --> sentry[Upload release to Sentry]
+    fingerprint --> upload[Upload assets to CDN]
+    upload --> propagate[Wait for CDS assets to propagate]
+    fingerprint --> deploy[Deploy updated asset references]
+    propagate --> deploy
+```
+
+In this example, assets are _fingerprinted_, and after being fully propagated
+through a global CDN, they are starting to be referenced from the backend
+service via HTML.
+
+This may work with unique content-hash based filenames, and even use _fingerprinting_ and
+[Subresource Integrity (SRI)](https://developer.mozilla.org/en-US/docs/Web/Security/Subresource_Integrity).
+
+An example may look like this, for a CDN-deployed and fingerprinted reference
+to [katex](https://katex.org/docs/browser.html#starter-template):
+
+```html
+<script
+  defer
+  src="https://cdn.jsdelivr.net/npm/katex@0.16.4/dist/katex.min.js"
+  integrity="sha384-PwRUT/YqbnEjkZO0zZxNqcxACrXe+j766U2amXcgMg5457rve2Y7I6ZJSm2A0mS4"
+  crossorigin="anonymous"
+></script>
+```
+
+Not only is the deployment pipeline very complex, it can also involve a variety of tools with varying degree of
+integration between them.
+The example `<script>` tag shown above might be generated as part of one integrated JS bundler tool, or it might be
+generated by a Rust or python backend, based on supplied JSON file.
+
+The checksums themselves might be directly output by a JS bundler tool, or they might be generated by a completely
+different tool at another stage of the build pipeline.
+
+Each application and build pipeline is unique, and there is an ever growing multitude of tools.
+_Insert joke about a new JS bundler being created each week here._
+
+It is therefore important that whatever comments and/or code we end up injecting into the final JavaScript assets is
+being injected at the right point in this pipeline. Ideally it would be injected **before** fingerprinting happens, and
+**before** any content-hash based naming happens.
+
+As most JavaScript bundlers support automatic bundle-splitting, and will insert dynamic `import` or `require` statements
+referencing those chunks by (fingerprinted) filename, a deep integration into those various bundlers might be needed.
+
+### Injection via `sentry-cli inject`
+
+With this, injection would happen with a new command, `sentry-cli inject`. It will be the responsibility of the developer
+to call this at the appropriate time depending on their unique build pipeline.
+
+**pros**
+
+- Gives full control for build pipelines that involve a heterogenous set of tools and stages.
+
+**cons**
+
+- Requires manually using this command.
+- Does not work with bundlers that integrate fingerprinting.
+
+### Injection at `sentry-cli upload` time
+
+In this scenario, injection happens at the time of `sentry-cli upload`, and will also modify the files at that time.
+
+**pros**
+
+- Makes sure that assets uploaded to Sentry have a `DebugId`.
+- No additional command and invocation needed.
+
+**cons**
+
+- Does not work with bundlers that integrate fingerprinting.
+- Does not work in build pipelines where `sentry-cli upload` is not in the main deployment path.
+
+### Injection via bundler plugins
+
+Here, we would build `DebugId` injection right into the various JavaScript bundlers. This can happen with a third-party
+plugin at first, and might move into the core bundler packages once there is enough community buy-in for `DebugId`s.
+
+Each bundler is unique though, and has different hooks at different stages of its internal pipeline. Some bundlers
+might not have the necessary hooks at the necessary stage at all.
+
+#### Rollup
+
+Rollup has a very comprehensive plugin system, with good documentation about the various hooks and the internal pipeline:
+https://rollupjs.org/plugin-development/#output-generation-hooks
+
+According to the above diagram, the appropriate plugin hook to use might be the
+[`renderChunk`](https://rollupjs.org/plugin-development/#renderchunk) hook, which allows
+access and modification of a chunks `code` and `map` (SourceMap) output.
+This hook runs before the `augmentChunkHash` and `generateBundle` hooks which are responsible for fingerprinting and
+generating the _final_ output for each chunk.
+
+TODO: further investigation and experimentation for this is needed
+
+#### Webpack
+
+Webpack documentation for plugin hooks is not as extensive, and there is no broad overview of the internal pipeline and
+phases. There is a general overview of all the `Compilation` hooks though:
+https://webpack.js.org/api/compilation-hooks/
+
+It might be possible to use the [`processAssets`](https://webpack.js.org/api/compilation-hooks/#processassets) hook
+for this purpose. Documentation mentions the `PROCESS_ASSETS_STAGE_DEV_TOOLING` phase which is responsible for
+extracting SourceMaps, or the `PROCESS_ASSETS_STAGE_OPTIMIZE_HASH` which looks to be responsible for generating the
+final fingerprint of an asset.
+
+TODO: further investigation and experimentation for this is needed
+
+#### TODO: other popular bundlers and build tools
+
+## Injecting the `DebugId` into the SourceMap
+
+This is a less controversial part, as SourceMaps are in general not distributed to production, and are less likely to
+be fingerprinted or integrity-checked. They are also plain JSON, making it trivial to inject additional fields.
+We propose to add a new JSON field to the root of the SourceMap object called `debugId`.
+This new field should encode the `DebugId` as a plain string.
+
+# Drawbacks
+
+The main drawback is that this might feel like an invasive change to the JavaScript ecosystem. It is a huge implementation
+burden, and might not be received positively by neither customers nor the wider JS tools ecosystem.
+
+Especially injecting a piece of JavaScript into every production asset might alienate some users.
+
+The effectiveness and success of this initiative needs to be proved out first, and is not guaranteed.
+
+# Unresolved questions
+
+- ~~Why do we call the new SourceMap field `debug_id` and not `debugId`?
+  All existing fields in SourceMaps are camelCase, and so is the general convention in the JS ecosystem.~~
+
+# Implementation
+
+- TODO: link to some implementation breadcrumbs and PRs
+- TODO: change the existing SourceMap implementation to use a camelCased `debugId` instead of the snake_cased `debug_id` field.
+
+---
+
+# Appendix
+
+## What is an `ArtifactBundle`
+
+Sentry bundles up all the assets of one release / build into a so-called `ArtifactBundle` (also called `SourceBundle`, or `ReleaseBundle`).
+
+This is a special ZIP file which includes all the minified / production JavaScript files, their corresponding SourceMap,
+and the original source files as referenced by the SourceMaps in whatever format (TypeScript or other).
+
+It also has a `manifest.json`, which has more metadata per file, like the type of a file, its `DebugId`, and an optional
+`SourceMap` reference from minified files to their SourceMap.
+
+**pros**
+
+- Customers naturally think in _releases_, so having one archive per release is good.
+- Only needing to download / cache / process a single file for one release can be more efficient.
+
+**cons**
+
+- Does not work well content-hash based `DebugId`s, as one `DebugId` can appear in a multitude of archives.
+- Feels like a workaround for inefficiencies in other parts of the processing pipeline when dealing with more smaller files.