Skip to content

Feature RBAC maintenance#152

Draft
bakhterets wants to merge 50 commits into
mainfrom
feature/rbac
Draft

Feature RBAC maintenance#152
bakhterets wants to merge 50 commits into
mainfrom
feature/rbac

Conversation

@bakhterets
Copy link
Copy Markdown
Contributor

@bakhterets bakhterets commented Feb 5, 2026

Implements Role-Based Access Control (RBAC) for maintenance event management as specified in specs/001-maintenance-rbac/spec.md.

Changes:

RBAC Implementation:

  • Three roles with hierarchy: Admin > Operator > Creator
  • Role resolution from JWT groups claim via environment variables (SD_RBAC_GROUP_ADMINS, SD_RBAC_GROUP_OPERATORS, SD_RBAC_GROUP_CREATORS)
  • Middleware for authentication and role-based authorization

Maintenance Workflow:

  • Operator: Create maintenance → pending_review status, can modify/cancel own events while pending
  • Operator: Approve events (pending_reviewreviewed), cancel any pending event, create with planned status
  • Admin: Unrestricted access, bypass all status restrictions
  • Checker auto-transitions reviewedplanned

Field Visibility:

  • creator, contact_email, version fields visible only to authenticated users
  • Events with pending_review or reviewed status hidden from unauthenticated users

Optimistic Locking:

  • Version-based conflict detection for concurrent updates
  • Returns 409 Conflict on version mismatch

Documentation:

  • New docs/auth/rbac.md with roles, permissions, and workflow
  • Updated openapi.yaml with security schemes, RBAC description, 401/403/409 responses
  • Updated spec FR-022-1 to include reviewed status visibility

Testing:

  • Unit tests for RBAC functions, middleware, validation
  • Integration tests with real HMAC JWT authentication
  • Coverage for role permissions, ownership, version conflicts, status transitions

Updates:

  • Go version updated to 1.25
  • golangci-lint updated to v2.8.0

Testing:

# Unit tests:
go test ./internal/... -count=1

# Integration tests (requires Docker):
go test ./tests/... -count=1
Related
- Spec: specs/001-maintenance-rbac/spec.md

@bakhterets bakhterets self-assigned this Feb 10, 2026
@bakhterets bakhterets requested a review from sgmv February 17, 2026 16:23
@bakhterets bakhterets marked this pull request as ready for review February 17, 2026 16:24
@bakhterets
Copy link
Copy Markdown
Contributor Author

make version optionally for incidents and info

@bakhterets bakhterets requested a review from sgmv February 25, 2026 15:42
role operator has been raised to allow to perform all CRUD actions; spec.md revised.
rbac tests redesigned, main tests fixed.
Comment thread internal/api/v2/v2.go
Comment thread internal/api/v2/v2.go Outdated
Comment thread internal/conf/conf.go Outdated
Comment thread internal/api/api.go Outdated
Comment thread internal/api/middleware.go
Comment thread internal/checker/maintenance.go Outdated
@bakhterets
Copy link
Copy Markdown
Contributor Author

bakhterets commented Mar 2, 2026

Multi-IdP Authentication & RBAC Security Hardening (refactoring)

Summary

Implements a "secure by default" architecture by removing all authentication/RBAC bypass toggles and adding dual-IdP support (Keycloak RSA + Local HMAC). Introduces structured audit logging, Keycloak resilience with retry/fallback, JWT audience validation, and comprehensive test coverage.

Breaking Changes

Removed env var Replacement
SD_AUTH_DISABLED Removed — auth is always enforced
SD_RBAC_DISABLED Removed — RBAC is always enforced
SD_PROVIDER_DISABLED Removed — provider is always active

SD_RBAC_GROUP_ADMINS is now mandatory. Application fails to start without it.
SD_SECRET_KEY must be ≥ 32 characters if HMAC provider is used.

Changes by Category

Security

  • Removed bypass toggles: AuthenticationDisabled, RBAC.Disabled, Provider.Disabled — no path exists to skip auth/RBAC
  • Dual-IdP dispatch: parseToken() routes by JWT alg header — RS256 → Keycloak public key, HS* → HMAC secret
  • Minimum secret key length: SD_SECRET_KEY requires ≥ 32 characters, enforced in validateProviders()
  • Config validation: Validate() ensures at least one provider (Keycloak or HMAC) is configured; Admins group is mandatory

Observability

  • Structured audit logging: authAudit() emits SIEM-ready events with fields: event=auth_audit, action, result, idp_type (keycloak | local_hmac), username, reason
  • IdP detection: idpTypeFromMethod() maps JWT signing method to human-readable provider name
  • Auth handler audit: All 5 OAuth handlers (login, callback, token_retrieve, logout, refresh) log structured events with action/result

Middleware Refactoring

  • DRY helper: validateAndSetClaims() extracts shared token parsing, claims extraction, and context population — used by both AuthenticationMW (write) and SetJWTClaims (read)
  • Conditional routes: Keycloak OAuth routes (/auth/*) only registered when Keycloak is configured

Tests (new files)

  • internal/api/auth/auth_test.go — 24 tests: handler coverage, Keycloak JWKS fetch, retry/cache behavior
  • internal/api/auth/storage_test.go — 6 tests: CRUD, overwrite, concurrent access
  • tests/rbac_admin_only_test.go — 10 admin-only RBAC integration subtests

Documentation

  • docs/auth/authentication.md — Dual-IdP architecture, middleware chain, audit logging
  • docs/auth/rbac.md — Removed SD_RBAC_DISABLED references, added resilience/audit sections
  • docs/testing/integration-tests.md — Admin-only tests, unit test coverage section
  • openapi.yaml — Updated security scheme, dual-IdP description, mandatory ADMINS group
  • specs/001-maintenance-rbac/spec.md — Removed RBAC_DISABLED, updated FR-002a and FR-026

Test Coverage

Package Before After
internal/conf 16.7% 74.5%
internal/api ~40% 53.5%
internal/api/auth 0% 78.4%
internal/api/rbac 100% 100%

@bakhterets bakhterets requested a review from sgmv March 3, 2026 20:38
@sgmv sgmv marked this pull request as draft March 5, 2026 13:19
@bakhterets
Copy link
Copy Markdown
Contributor Author

fix(checker): handle optimistic locking race condition in maintenance processing

Problem

The background checker goroutine (runs every 2 minutes) reads maintenance incidents from the database and updates their status based on time progression (plannedin_progresscompleted).

When ModifyIncident uses optimistic locking (WHERE version = ?), a race condition occurs:

  1. Checker reads incident with version=N
  2. A concurrent API request (user action) updates the same incident, incrementing version to N+1
  3. Checker attempts UPDATE ... WHERE version=N → 0 rows affected → ErrVersionConflict
  4. Previously: the entire checker cycle aborted, leaving all remaining maintenances unprocessed

Solution

Introduced retry-with-refetch pattern in a new processMaintenance() method:

  • On ErrVersionConflict, re-reads the incident from DB (fresh version) and retries (up to 3 attempts)
  • Errors on individual maintenances no longer abort the entire cycle — logged and skipped
  • No changes to ModifyIncident, RBAC, or the optimistic locking mechanism itself

Behavior Change

Scenario Before After
Version conflict Entire checker cycle fails Retry up to 3x with fresh data
One maintenance fails All remaining skipped Others still processed
No conflict 1 SELECT + N UPDATEs Same (no overhead)
Conflict occurs 1 SELECT + 1 failed UPDATE +1 SELECT per retry attempt

Code Changes

internal/checker/maintenance.go

  • Extracted processMaintenance() — retry loop with re-fetch on ErrVersionConflict
  • Extracted evaluateAndFixMntStatus() — calculates actual status and backfills missing status entries
  • Extracted trackActiveMaintenance() — appends to active maintenance slice for watermark tracking
  • Fixed linter issues: variable shadowing (govet), integer range loop (intrange), cognitive complexity (gocognit)

Execution Flow

CheckMaintenance()
 ├── db.GetMaintenances(lastMntID)              [SQL SELECT + Preload]
 └── for each maintenance:
      └── processMaintenance(mn)
           ├── attempt=0:
           │    ├── evaluateAndFixMntStatus(mn)  [in-memory]
           │    └── db.ModifyIncident(mn)        [SQL TX: UPDATE WHERE version=N]
           │         └── ErrVersionConflict? → continue
           └── attempt=1:
                ├── db.GetIncident(mn.ID)        [SQL SELECT — fresh version]
                ├── evaluateAndFixMntStatus(mn)  [recalculate]
                └── db.ModifyIncident(mn)        [SQL TX: UPDATE WHERE version=N+1] ✓

@sgmv
Copy link
Copy Markdown
Contributor

sgmv commented May 22, 2026

fix(checker): handle optimistic locking race condition in maintenance processing

You can do it simpler. Because we will change this table, I won't use any db schemas changes. And you can just get the maintenance right before the patch and compare.

  The diff (against the pre-fix file bf9e9c7^:internal/checker/maintenance.go)

   --- a/internal/checker/maintenance.go
   +++ b/internal/checker/maintenance.go
   @@
          var activeMaintenances []uint
          for _, mn := range maintenances {
                  // Draft maintenances are not processed by the checker — they await
                  // manual approval (reviewed) or rejection (cancelled) via the API.
                  if mn.Status == event.MaintenancePendingReview {
                          continue
                  }

   +              // Refetch immediately before the read-modify-write. The bulk
   +              // GetMaintenances above is N items old by the time we reach item N;
   +              // using its preloaded state for the version check races concurrent
   +              // API edits. A single fresh read shrinks the race window from
   +              // "duration of the whole tick" to "one DB round-trip", which makes
   +              // ErrVersionConflict effectively unreachable without a retry loop.
   +              mn, err = ch.db.GetIncident(int(mn.ID))
   +              if err != nil {
   +                      ch.log.Error("failed to refetch maintenance, skipping",
   +                              zap.Uint("mntID", mn.ID), zap.Error(err))
   +                      continue
   +              }
   +
                     if mn.Status != actualStatus {
                          mn.Status = actualStatus
                          err = ch.db.ModifyIncident(mn)
                          if err != nil {
   -                              return err
   +                              ch.log.Error("failed to update maintenance, skipping",
   +                                      zap.Uint("mntID", mn.ID), zap.Error(err))
   +                              continue
                          }
                  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants