feat: add Confluence metadata extractor#515
Conversation
Extract page metadata and relationships from Confluence spaces via the REST API v2. Emits space and document entities with belongs_to, child_of, owned_by, and documented_by edges. Scans page content for URN references to auto-link documentation to data assets.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 51 minutes and 2 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughA new Confluence extractor plugin was added to extract metadata and relationships from Confluence spaces and pages via the REST API v2. The implementation includes a REST client ( Sequence DiagramsequenceDiagram
participant E as Extract Flow
participant C as Confluence Client
participant API as Confluence REST API v2
participant Emit as Record Emitter
E->>C: GetSpaces(ctx, keys)
C->>API: GET /spaces (with pagination via cursor)
API-->>C: Spaces list
C-->>E: []Space
loop For each space (not excluded)
E->>Emit: Emit space record
E->>C: GetPages(ctx, spaceID)
C->>API: GET /spaces/{id}/pages (cursor pagination, storage format)
API-->>C: Pages list
C-->>E: []Page
loop For each page
E->>C: GetPageLabels(ctx, pageID)
C->>API: GET /pages/{id}/labels
API-->>C: Labels
C-->>E: []Label
E->>E: Extract metadata, scan body for URNs
E->>Emit: Emit document record
E->>Emit: Emit belongs_to edge (space)
E->>Emit: Emit child_of edge (parent page if exists)
E->>Emit: Emit owned_by edge (author)
E->>Emit: Emit documented_by edges (per detected URN)
end
end
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugins/extractors/confluence/client.go`:
- Around line 152-159: GetPageLabels currently only fetches the first page;
implement cursor-based pagination like GetSpaces/GetPages by looping until there
is no next cursor. Change GetPageLabels to call c.get repeatedly with query
params limit and cursor (or follow the returned _links.next), accumulate
resp.Results into a single slice, and update the local resp struct to include
the pagination metadata (e.g., a Links or _links.next field) so you can extract
the next cursor; ensure errors from c.get are wrapped as before and return the
full aggregated []Label when done.
In `@plugins/extractors/confluence/confluence_test.go`:
- Around line 149-155: The test currently loops over records := emitter.Get()
and only asserts each found "space_key" is not "ARCHIVE", which silently passes
if no records are emitted; update the test to first assert that records is not
empty (e.g., assert.NotEmpty or assert.Greater(len(records), 0) on records
returned by emitter.Get()), then iterate the records from emitter.Get() and
assert that at least one record's props["space_key"] is present and not equal to
"ARCHIVE" (set a found flag while inspecting r.Entity().GetProperties().AsMap()
and assert the flag is true). Ensure you reference the same symbols
(emitter.Get(), records, r.Entity().GetProperties().AsMap(), "space_key") when
making these assertions so the test fails if extraction or filtering removes all
spaces.
In `@plugins/extractors/confluence/confluence.go`:
- Around line 90-95: The current code ignores errors from the emit callback and
downgrades e.extractPages failures to warnings; update the logic so emit
failures and page extraction errors are propagated up instead of suppressed:
check the return value from emit(e.buildSpaceRecord(space)) and return that
error if non-nil, and if e.extractPages(ctx, emit, space) returns an error
return it (don’t just log a warning). Apply the same change where pages are
emitted (the other occurrence noted around line 114) so both emit calls and all
e.extractPages failures bubble up to the caller.
- Around line 148-155: The timestamp formatting in the props map uses the
literal layout "2006-01-02T15:04:05Z" for page.CreatedAt and
page.Version.CreatedAt which forces a literal 'Z' instead of emitting proper
timezone offsets; update those calls to use time.RFC3339 (e.g.,
page.CreatedAt.Format(time.RFC3339) and
page.Version.CreatedAt.Format(time.RFC3339)) and add the missing import "time".
Ensure the changes are applied where props is constructed (referencing
page.CreatedAt and page.Version.CreatedAt) so timestamps include correct
timezone information.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9e3d9b77-cb68-48d4-b786-32d81029a1cd
📒 Files selected for processing (6)
plugins/extractors/confluence/README.mdplugins/extractors/confluence/client.goplugins/extractors/confluence/confluence.goplugins/extractors/confluence/confluence_test.goplugins/extractors/populate.gotest/e2e/confluence_file/confluence_file_test.go
| emit(e.buildSpaceRecord(space)) | ||
|
|
||
| if err := e.extractPages(ctx, emit, space); err != nil { | ||
| e.logger.Warn("failed to extract pages from space, skipping", | ||
| "space", space.Key, "error", err) | ||
| } |
There was a problem hiding this comment.
Propagate emit and page extraction failures.
Line 90 and Line 114 ignore emitter failures, and Lines 92-95 downgrade page extraction failures to a warning. That can make a run succeed with missing records or failed downstream writes.
Proposed error propagation
- emit(e.buildSpaceRecord(space))
+ if err := emit(e.buildSpaceRecord(space)); err != nil {
+ return fmt.Errorf("emit space %s: %w", space.Key, err)
+ }
if err := e.extractPages(ctx, emit, space); err != nil {
- e.logger.Warn("failed to extract pages from space, skipping",
- "space", space.Key, "error", err)
+ return fmt.Errorf("extract pages from space %s: %w", space.Key, err)
}- emit(e.buildPageRecord(page, space, spaceURN, labels))
+ if err := emit(e.buildPageRecord(page, space, spaceURN, labels)); err != nil {
+ return fmt.Errorf("emit page %s: %w", page.ID, err)
+ }Also applies to: 114-114
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@plugins/extractors/confluence/confluence.go` around lines 90 - 95, The
current code ignores errors from the emit callback and downgrades e.extractPages
failures to warnings; update the logic so emit failures and page extraction
errors are propagated up instead of suppressed: check the return value from
emit(e.buildSpaceRecord(space)) and return that error if non-nil, and if
e.extractPages(ctx, emit, space) returns an error return it (don’t just log a
warning). Apply the same change where pages are emitted (the other occurrence
noted around line 114) so both emit calls and all e.extractPages failures bubble
up to the caller.
There was a problem hiding this comment.
Not applicable — plugins.Emit is defined as func(models.Record) with no return value, so there's no error to propagate. The warn-and-skip pattern for page extraction matches other extractors (e.g. GitHub).
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the Emit type definition
rg -n "type Emit" --type go
# Check GitHub extractor for warn-and-skip pattern
rg -n "Warn.*extract.*skip\|Warn.*pages\|extractPages" plugins/extractors/github/ -A2 2>/dev/null || true
rg -n "logger.Warn.*skip" plugins/extractors/ --type go -A2 | head -40
Length of output: 2553
@ravisuhag, you're absolutely right — confirmed on both points:
plugins.Emitsignature:plugins/plugin.goLine 19 defines it astype Emit func(models.Record)— no return value, so there's nothing to propagate.- Warn-and-skip pattern: The GitHub extractor uses the same pattern in multiple places (e.g., Lines 151, 256, 331, 345, 386, 391, 451 of
plugins/extractors/github/github.go), making the Confluence extractor's approach consistent with the established convention.
My review comment was incorrect. Sorry for the noise!
✏️ Learnings added
Learnt from: ravisuhag
Repo: raystack/meteor PR: 515
File: plugins/extractors/confluence/confluence.go:91-96
Timestamp: 2026-04-18T22:46:54.415Z
Learning: In the raystack/meteor codebase, `plugins.Emit` is defined as `func(models.Record)` (no return value) in `plugins/plugin.go`. There is no error to propagate from emit calls. The warn-and-skip pattern (`e.logger.Warn(...)` + `continue`) for non-fatal per-item errors in extractors is the established convention, as used in `plugins/extractors/github/github.go` and others.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: CR
Repo: raystack/meteor PR: 0
File: CLAUDE.md:0-0
Timestamp: 2026-04-18T03:36:16.367Z
Learning: Applies to plugins/extractors/**/*.go : Extractors should emit Records containing Entity (urn, type, name, description, source, properties) and Edges (source_urn, target_urn, type, source, properties) for relationships
- Paginate GetPageLabels to capture all labels (not just first page) - Use time.RFC3339 for proper timezone handling in timestamps - Tighten exclusion test to assert ENG space exists (not just absence of ARCHIVE)
Summary
spaceanddocumententities withbelongs_to,child_of,owned_by, anddocumented_byedgesDetails
New files:
plugins/extractors/confluence/confluence.go— Main extractor with Config, Init, Extractplugins/extractors/confluence/client.go— HTTP client for Confluence REST API v2 (spaces, pages, labels, cursor-based pagination)plugins/extractors/confluence/confluence_test.go— 6 unit tests covering config validation, extraction, edges, URN detection, exclusionplugins/extractors/confluence/README.md— Documentationtest/e2e/confluence_file/confluence_file_test.go— End-to-end test with mock server through full pipelineEntities emitted:
spacedocumentEdges emitted:
belongs_tochild_ofowned_bydocumented_byCloses #503 (Confluence portion)
Test plan
go test -tags plugins ./plugins/extractors/confluence/)go test -tags integration ./test/e2e/confluence_file/)go build ./...succeeds