microsoft
diff --git a/‎docs/component-identity-merging.md‎
Lines changed: 293 additions & 0 deletions b/‎docs/component-identity-merging.md‎
Lines changed: 293 additions & 0 deletions
diff --git a/‎docs/component-identity-reconciliation-design.md‎
Lines changed: 173 additions & 0 deletions b/‎docs/component-identity-reconciliation-design.md‎
Lines changed: 173 additions & 0 deletions
@@ -0,0 +1,173 @@
+# Component Identity Reconciliation — Design
+
+## Data Flow Overview
+
+```
+Detector A (e.g., NpmComponentDetector)
+  └─ ComponentRecorder A
+       ├─ SingleFileComponentRecorder (package.json)
+       │    ├─ detectedComponentsInternal: { componentId → DetectedComponent }
+       │    └─ DependencyGraph: nodes keyed by componentId
+       └─ GetDetectedComponents(): merges across SingleFileComponentRecorders, groups by component.Id
+
+Detector B (e.g., NpmComponentDetectorWithRoots)
+  └─ ComponentRecorder B
+       ├─ SingleFileComponentRecorder (package-lock.json)
+       │    ├─ detectedComponentsInternal: { componentId → DetectedComponent }
+       │    └─ DependencyGraph: nodes keyed by componentId
+       └─ GetDetectedComponents(): merges across SingleFileComponentRecorders, groups by component.Id
+
+DefaultGraphTranslationService.GenerateScanResultFromProcessingResult()
+  ├─ GatherSetOfDetectedComponentsUnmerged(): iterates each detector's recorder,
+  │    calls componentRecorder.GetDetectedComponents(), enriches with graph data
+  ├─ FlattenAndMergeComponents(): groups by component.Id + detector.Id, merges within groups
+  │    → produces ComponentsFound
+  └─ AccumulateAndConvertToContract(): iterates all detectors' graphs by file location
+       → produces DependencyGraphs
+```
+
+## Three Reconciliation Points
+
+Identity conflicts (bare Id vs rich Id for the same package) can arise at three levels. We fix bottom-up.
+
+### Point 1: SingleFileComponentRecorder::RegisterUsage
+
+**What it reconciles:** Same package within a single file producing different Ids.
+
+**When this happens:** Unlikely for npm (a single lockfile either has all resolved URLs or none), but possible in theory if a file format contains asymmetric information about the same package in different sections.
+
+**Current behavior:** `RegisterUsage` stores in `detectedComponentsInternal.GetOrAdd(componentId, detectedComponent)`. Two entries with different Ids for the same BaseId would be stored separately. Graph edges would reference whichever Id was used at registration time.
+
+**Fix:** Before storing, check if an entry with the same `BaseId` exists. Only bare and rich merge — rich entries never merge with each other. Specifically:
+- **Incoming is bare, existing is rich:** Redirect the bare registration to use the existing rich entry's Id. Add graph edges under the rich Id.
+- **Incoming is rich, existing is bare:** Re-key the existing bare entry to the rich Id. Update the graph node key and parents' children sets. Enrich the stored component with URL data.
+- **Incoming is rich, existing is rich (different Ids):** Keep both as separate entries. No merge.
+
+This requires a `BaseId → Id` lookup (e.g., a secondary dictionary) to find existing entries that share the same `BaseId`.
+
+**Scope:** Defensive. Low priority since single-file conflicts are rare.
+
+### Point 2: ComponentRecorder::GetDetectedComponents + GatherSetOfDetectedComponentsUnmerged
+
+**What it reconciles:** Same package across different files read by one detector.
+
+**When this happens:** A detector reads multiple manifest files. For example, `NpmComponentDetectorWithRoots` processes multiple lockfiles in a monorepo — one lockfile might have a resolved URL for `lodash@4.17.23` and another might not (e.g., an older lockfile format).
+
+#### What already works
+
+The existing code already handles cross-file and cross-graph reconciliation — **under the assumption that the same package always has the same Id:**
+
+- **`ComponentRecorder::GetDetectedComponents()`** merges component-level metadata (LicensesConcluded, Suppliers, ContainerDetailIds) across `SingleFileComponentRecorder`s by grouping on `component.Id`.
+- **`GatherSetOfDetectedComponentsUnmerged()`** enriches each component with graph-level metadata (roots, ancestors, locations, devDep, scope) from every graph that contains the component by matching on `component.Component.Id`.
+
+Both of these work correctly when `Id` is the same across files/graphs.
+
+#### What we are fixing
+
+When `DownloadUrl` or `SourceUrl` is set, `TypedComponent.Id` includes these values (via `GetExtendedIdProperties()`). If one file produces a bare Id and another produces a rich Id for the same package, the existing merge logic treats them as different packages because their Ids differ.
+
+**The fix is not new reconciliation logic — it is extending the existing reconciliation to also recognize bare and rich entries as the same package when they share a `BaseId`.**
+
+#### Key semantics of the scan result
+
+`DefaultGraphScanResult.ComponentsFound` is produced by `GatherSetOfDetectedComponentsUnmerged` + `FlattenAndMergeComponents`.
+
+`FlattenAndMergeComponents` groups by `component.Id + detector.Id`. This means the existing design **intentionally** keeps same-package-from-different-detectors as separate entries in `ComponentsFound`. We preserve this semantic — cross-detector reconciliation is **not** needed for `ComponentsFound`.
+
+The reconciliation scope for `ComponentsFound` is **within a single detector** — i.e., within one `ComponentRecorder`. This is handled by `GetDetectedComponents()` and `GatherSetOfDetectedComponentsUnmerged`.
+
+`ComponentRecorder.GetDetectedComponents()` is only used in production by `GatherSetOfDetectedComponentsUnmerged`, so it is safe to change.
+
+#### Plan for ComponentsFound (two-step merge)
+
+**Step 1 — `ComponentRecorder::GetDetectedComponents()`: Component-level metadata merge**
+
+Current behavior groups by `component.Id`. Bare and rich entries for the same package end up in different groups.
+
+New behavior:
+
+1. Group by `component.Component.BaseId` (instead of `component.Id`).
+2. Within each BaseId group, separate entries into **rich** (Id != BaseId) and **bare** (Id == BaseId).
+3. Rich entries stay separate from each other (they have different Ids — e.g., different DownloadUrls).
+4. If rich entries exist, merge bare's component-level metadata into **each** rich entry:
+   - `LicensesConcluded` (union)
+   - `Suppliers` (union)
+   - `ContainerDetailIds` (union)
+5. Drop the bare entry from the output.
+6. If no rich entries exist, keep the bare entry as-is.
+
+**Step 2 — `GatherSetOfDetectedComponentsUnmerged()`: Graph-level metadata merge**
+
+After Step 1, the bare entry is gone. Each rich component has the bare's component-level metadata. But the graph from the file that originally registered the bare entry stored the bare Id. The enrichment loop:
+
+```csharp
+foreach (var graphKvp in dependencyGraphsByLocation.Where(x => x.Value.Contains(component.Component.Id)))
+```
+
+...won't match graphs that contain the bare Id for components that are now keyed by a rich Id.
+
+Fix: extend the graph lookup to also match on `BaseId`:
+
+```csharp
+foreach (var graphKvp in dependencyGraphsByLocation.Where(x =>
+    x.Value.Contains(component.Component.Id) || x.Value.Contains(component.Component.BaseId)))
+```
+
+This way each rich component picks up graph-level metadata (roots, ancestors, locations, dev dependency, dependency scope) from the bare entry's graphs. Since all rich entries for the same BaseId run through this loop independently, the bare entry's graph data is absorbed into **all** of them — "merge into all" happens naturally.
+
+#### Summary
+
+| Step | Where | What merges | Into what |
+|---|---|---|---|
+| 1 | `GetDetectedComponents()` | Bare's LicensesConcluded, Suppliers, ContainerDetailIds | Each rich entry with same BaseId |
+| 2 | `GatherSetOfDetectedComponentsUnmerged()` | Bare's graph data (roots, ancestors, locations, devDep, scope) | Each rich entry with same BaseId |
+
+Rich entries never merge with each other. Bare merges into all rich. If no rich exists, bare stays.
+
+---
+
+## DependencyGraphs in Scan Result
+
+### Conclusion: No changes needed
+
+After analysis, the `DependencyGraphs` field requires **no identity reconciliation**. Bare Ids remain in the graph output as-is. Downstream consumers must use `ComponentsFound` as the authoritative source for component identity and metadata.
+
+### What DependencyGraphs contains
+
+`DefaultGraphScanResult.DependencyGraphs` is produced by `GraphTranslationUtility.AccumulateAndConvertToContract()`. It iterates each detector's `ComponentRecorder`, walks every graph by file location, and merges them into a dictionary of `DependencyGraphWithMetadata` keyed by file location.
+
+Each `DependencyGraphWithMetadata` contains:
+
+| Field | Type | Content |
+|---|---|---|
+| `Graph` | `DependencyGraph` | Adjacency list: maps each component Id (string) to the set of Ids it depends on |
+| `ExplicitlyReferencedComponentIds` | `HashSet<string>` | Component Ids that are direct/explicit dependencies |
+| `DevelopmentDependencies` | `HashSet<string>` | Component Ids classified as dev-only |
+| `Dependencies` | `HashSet<string>` | Component Ids classified as production dependencies |
+
+**Every field is purely structural** — string-based Ids and edges between them. There is no component metadata (no name, version, download URL, license, etc.) in the graph output. Component metadata lives exclusively in `ComponentsFound`.
+
+### Why no reconciliation is needed
+
+1. **No component metadata to enrich.** The graph output has no `TypedComponent` objects, no `DetectedComponent` wrappers — just string Ids forming an adjacency structure. There is nothing to "merge" in the metadata sense.
+
+2. **Bare Ids cannot be rewritten to rich Ids.** A bare entry doesn't know which rich Id it should become. Consider: if `lodash@4.17.23` appears as a bare node in one graph and two different lockfiles produce two different rich Ids (with different `DownloadUrl`s), the bare Id has no way to choose which rich Id to rewrite to. Any rewriting would be ambiguous.
+
+3. **Graph structure is still correct.** A bare Id in the graph still correctly represents the dependency relationships. The edges (who depends on whom) and classifications (dev vs prod, explicit vs transitive) are all accurate — they don't depend on whether the Id includes a URL suffix.
+
+4. **Existing merge logic works.** `AccumulateAndConvertToContract` merges graphs across detectors by file location. Within a file location, it unions the graph edges and dependency sets. If one detector registered `lodash@4.17.23` (bare) and another registered `lodash@4.17.23 [DownloadUrl:...]` (rich), both appear in the merged graph. This is correct — they came from different detectors with different levels of information.
+
+### Downstream contract
+
+Consumers of the scan result that need to resolve a component Id from `DependencyGraphs` to its full metadata (name, version, download URL, license, etc.) must look it up in `ComponentsFound`. This was already implicitly the contract — `DependencyGraphs` never carried component metadata. The bare/rich split makes this explicit:
+
+- A bare Id in `DependencyGraphs` can be matched to a rich entry in `ComponentsFound` via `BaseId`.
+- A rich Id in `DependencyGraphs` can be matched directly by `Id` in `ComponentsFound`.
+
+This is a **documentation concern**, not a code change. Downstream tools (e.g., CRA, SBOM generators) should be aware that graph Ids may be bare even when richer component data exists in `ComponentsFound`.
+
+## Open Questions
+
+1. **Performance of BaseId lookup in graph search (Step 2):** The extended graph lookup checks both `Id` and `BaseId` for every component × every graph. This is a small constant factor increase. Likely acceptable since it's a linear scan already.
+2. **Are there ecosystems where a single file legitimately produces both bare and rich entries for the same package?** If so, Point 1 (RegisterUsage) becomes more than defensive.
+3. **Downstream consumer awareness:** Tools consuming `DependencyGraphs` (CRA, SBOM generators) need to know that graph Ids may be bare and must cross-reference `ComponentsFound` for full metadata. This may require documentation or a schema note in the scan result contract.