Skip to content

Improvement/s3 utils 233/script to manually repair md primary backup from secondary#386

Closed
fredmnl wants to merge 3 commits intodevelopment/1from
improvement/S3UTILS-233/script-to-manually-repair-md-primary-backup-from-secondary
Closed

Improvement/s3 utils 233/script to manually repair md primary backup from secondary#386
fredmnl wants to merge 3 commits intodevelopment/1from
improvement/S3UTILS-233/script-to-manually-repair-md-primary-backup-from-secondary

Conversation

@fredmnl
Copy link
Copy Markdown

@fredmnl fredmnl commented Apr 28, 2026

No description provided.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 28, 2026

Hello fredmnl,

My role is to assist you with the merge of this
pull request. Please type @bert-e help to get information
on this process, or consult the user documentation.

Available options
name description privileged authored
/after_pull_request Wait for the given pull request id to be merged before continuing with the current one.
/bypass_author_approval Bypass the pull request author's approval
/bypass_build_status Bypass the build and test status
/bypass_commit_size Bypass the check on the size of the changeset TBA
/bypass_incompatible_branch Bypass the check on the source branch prefix
/bypass_jira_check Bypass the Jira issue check
/bypass_peer_approval Bypass the pull request peers' approval
/bypass_leader_approval Bypass the pull request leaders' approval
/approve Instruct Bert-E that the author has approved the pull request. ✍️
/create_pull_requests Allow the creation of integration pull requests.
/create_integration_branches Allow the creation of integration branches.
/no_octopus Prevent Wall-E from doing any octopus merge and use multiple consecutive merge instead
/unanimity Change review acceptance criteria from one reviewer at least to all reviewers
/wait Instruct Bert-E not to run until further notice.
Available commands
name description privileged
/help Print Bert-E's manual in the pull request.
/status Print Bert-E's current status in the pull request TBA
/clear Remove all comments from Bert-E from the history TBA
/retry Re-start a fresh build TBA
/build Re-start a fresh build TBA
/force_reset Delete integration branches & pull requests, and restart merge process from the beginning.
/reset Try to remove integration branches unless there are commits on them which do not appear on the source branch.

Status report is not available.

@bert-e
Copy link
Copy Markdown
Contributor

bert-e commented Apr 28, 2026

Incorrect fix version

The Fix Version/s in issue S3UTILS-233 contains:

  • 1.17.8

Considering where you are trying to merge, I ignored possible hotfix versions and I expected to find:

  • 1.18.0

Please check the Fix Version/s of S3UTILS-233, or the target
branch of this pull request.

Comment thread BackupRepair/repair.go
expectedBseq = bseq + 1
}

log.Printf(" scanned copy 0: %d entries (max bseq %d)", primary.count, primary.maxBseq)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If copy 0 is completely empty (e.g. a full primary failure), the iterator returns nothing, expectedBseq stays at minBseq, and the function reports zero inconsistencies. This silently masks the worst-case scenario.

Consider adding a check after the loop: if primary.count == 0, query at least one secondary to see if it has entries, and warn or return an error.

— Claude Code

Comment thread BackupRepair/sproxyd.go
endpoint: endpoint,
path: path,
client: &http.Client{
Timeout: 5 * time.Minute,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http.Client.Timeout covers the entire request lifecycle including body streaming. Since Get() returns resp.Body and the caller streams it into Put(), the 5-minute clock from the GET is still ticking during the copy. For large backup segments this could fire mid-transfer, leaving a partial object at the destination key.

Consider using per-request context timeouts for the initial response only (e.g. via http.NewRequestWithContext) and letting body streaming run without a hard cap, or at least bumping this significantly and documenting the maximum supported segment size.

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

  • findInconsistentBseqs does not detect a completely empty primary (copy 0). If the primary index has zero entries, the tool reports no inconsistencies even if secondaries have data — silently masking the worst-case failure.
    - Add a post-scan check: if primary.count == 0, probe secondaries and warn or error.
    - http.Client{Timeout: 5 * time.Minute} on the sproxyd client covers body streaming, not just the initial response. The GET body is streamed into the PUT, so large backup segments risk timing out mid-transfer and leaving a partial object.
    - Use per-request context timeouts for the response headers and let body streaming run uncapped, or increase the timeout and document the size limit.

    Review by Claude Code

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 44.12%. Comparing base (8528e62) to head (4015265).

Additional details and impacted files
@@              Coverage Diff               @@
##           development/1     #386   +/-   ##
==============================================
  Coverage          44.12%   44.12%           
==============================================
  Files                 88       88           
  Lines               6448     6448           
  Branches            1348     1348           
==============================================
  Hits                2845     2845           
  Misses              3555     3555           
  Partials              48       48           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@fredmnl fredmnl force-pushed the improvement/S3UTILS-233/script-to-manually-repair-md-primary-backup-from-secondary branch from 7f04f2c to 4015265 Compare April 28, 2026 13:42
Comment thread BackupRepair/main.go Outdated
stdinReader.ReadByte()
}
fmt.Fprint(os.Stderr, "press enter to continue (or ctrl-c to abort)... ")
stdinReader.ReadString('\n')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirmStep silently proceeds when stdin is closed (e.g. cron job, Docker without TTY). ReadString returns io.EOF which is discarded, so the tool runs all destructive repair steps without real user confirmation. Check the error and abort on EOF.

Suggested change
stdinReader.ReadString('\n')
if _, err := stdinReader.ReadString('\n'); err != nil {
log.Fatalf("stdin closed unexpectedly (non-interactive environment?): %v — use -y to skip prompts", err)
}

— Claude Code

Comment thread BackupRepair/repair.go
return nil, err
}
if !ok {
break
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When maxBseq is set and the primary iterator is exhausted before reaching maxBseq, bseqs between the primary's last entry and maxBseq are silently skipped. The existing MaxBseq and MinAndMaxBseq tests only pass because the primary fixture includes bseq=8 beyond maxBseq, acting as a sentinel. If the primary's last entry were below maxBseq with no entries beyond it, those trailing gaps would be missed.

Consider scanning expectedBseq..maxBseq after the loop exits when maxBseq > 0.

— Claude Code

Comment thread BackupRepair/sproxyd.go
req.ContentLength = info.ContentLength
if info.UserMD != "" {
req.Header.Set("X-Scal-Usermd", info.UserMD)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike Get and getBackupIndexPage, Put discards the response body before checking the status code, losing diagnostic info from the server on failure.

Suggested change
}
body, _ := io.ReadAll(resp.Body)
if resp.StatusCode != http.StatusOK {
return fmt.Errorf("PUT %s returned %d: %s", url, resp.StatusCode, string(body))
}

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

  • confirmStep silently proceeds when stdin is closed (EOF), bypassing confirmation prompts in non-interactive environments without -y
    • Check the error from ReadString and abort on EOF
  • findInconsistentBseqs misses trailing gaps when maxBseq is set and the primary iterator is exhausted before reaching maxBseq
    • Scan expectedBseq..maxBseq after the loop exits when maxBseq > 0
    • Add a test where primary's last entry is below maxBseq with no sentinel beyond it
  • SproxydClient.Put discards the response body before checking the status code, unlike Get and getBackupIndexPage
    • Read the body and include it in the error message for consistency

Review by Claude Code

@fredmnl fredmnl force-pushed the improvement/S3UTILS-233/script-to-manually-repair-md-primary-backup-from-secondary branch from 4015265 to bd328ba Compare April 28, 2026 15:17
@fredmnl fredmnl closed this Apr 28, 2026
Comment thread BackupRepair/admin.go

const maxStallPolls = 10

func (a *AdminClient) WaitForReindex(pollInterval time.Duration) error {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WaitForReindex has no overall wall-clock timeout. The stall detector only fires when ProcessingBseq stops advancing, so a reindex that advances by 1 bseq every 30 seconds will keep this loop alive indefinitely. Consider adding a maxDuration parameter (e.g. time.After) so the script cannot block forever in production.

— Claude Code

Comment thread BackupRepair/admin.go
return fmt.Errorf("POST %s: %w", url, err)
}
defer resp.Body.Close()
body, _ := io.ReadAll(resp.Body)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

io.ReadAll error is silently discarded. If the read fails, the error message on line 172 will show an empty/partial body, hiding the actual server response.

— Claude Code


// referenceKey reimplements the Node.js computeKey logic as a direct translation,
// to cross-check against the KeyGenerator implementation.
func referenceKey(raftSessionID string, installID int, bseq int, copyId int) string {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test helper parameter is named raftSessionID but TestGenerateKeyWithBackupID passes a full backupID (bucket/7). Since the production code hashes cluster/raftSessionId via Config.BackupID(), the parameter name here is misleading and could mask a mismatch if someone modifies the hash input. Rename to backupID to match NewKeyGenerator.

— Claude Code

@claude
Copy link
Copy Markdown

claude Bot commented Apr 28, 2026

  • WaitForReindex (admin.go:201) can loop indefinitely if the reindex makes slow but steady progress, since the stall detector only fires when ProcessingBseq stops advancing. Add an overall wall-clock timeout to prevent the script from blocking forever in production.
    • Use time.After or context.WithTimeout to cap total wait time
  • TriggerReindex (admin.go:169) silently discards the io.ReadAll error, which can produce a misleading empty body in the error message on the next line.
    • Check and return the error from io.ReadAll
  • referenceKey test helper (keygen_test.go:13) parameter is named raftSessionID but receives a full backupID in TestGenerateKeyWithBackupID. Rename to backupID to match NewKeyGenerator and avoid confusion about what gets hashed.

Review by Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants