Skip to content

Improve leaderless tablet triage and unsafe config change tooling#202

Open
pgyogesh wants to merge 3 commits into
mainfrom
tablet-report-parser-leaderless-views
Open

Improve leaderless tablet triage and unsafe config change tooling#202
pgyogesh wants to merge 3 commits into
mainfrom
tablet-report-parser-leaderless-views

Conversation

@pgyogesh

@pgyogesh pgyogesh commented May 26, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Bump tablet-report-parser.py to v0.60 and make the script executable.
  • Replace legacy delete_leaderless_be_careful and UNSAFE_Leader_create views with two structured views for leaderless tablet triage:
    • unsafe_config_change_commands: One row per leaderless tablet with table FQN, replica counts, max-CTID (OpID) peer selection, all peer UUIDs, and a cmd_to_run using /home/yugabyte/tserver/bin/yb-ts-cli.
    • leaderless_peer_ctid_rank: One row per leaderless tablet with per-peer CTID ranking, sync indicators (ctid_in_sync, spreads), lease status, and an aggregated all_peer_ctids column for quick review.
  • Add generate-unsafe-config-change.py: Interactive helper that reads unsafe_config_change_commands from a TabletReport SQLite DB and writes a reviewable bash script (unsafe_config_change-run.sh) with readiness prompts, optional TLS certs, and sleep between commands.

Test plan

  • Run python3 tablet-report-parser/tablet-report-parser.py <support-bundle/YBA/TabletReport> against a known support bundle.
  • Open the generated .sqlite DB and query unsafe_config_change_commands and leaderless_peer_ctid_rank for leaderless tablets.
  • Confirm cmd_to_run includes all RUNNING peer UUIDs (not just one peer) and uses the max-CTID tserver.
  • Confirm leaderless_peer_ctid_rank ranks peers by (cterm, cidx) and flags out-of-sync CTIDs.
  • Run python3 tablet-report-parser/generate-unsafe-config-change.py TabletReport.sqlite, review the generated script, and verify commands match the DB view.

Replace legacy delete/UNSAFE_Leader views with structured SQL views that
rank peers by CTID, include all peer UUIDs in generated yb-ts-cli commands,
and expose CTID sync diagnostics for leaderless tablets.

Co-authored-by: Cursor <cursoragent@cursor.com>
@pgyogesh pgyogesh changed the title Improve leaderless tablet views in tablet-report-parser Improve leaderless tablet views in tablet-report-parser based on Max OpID May 26, 2026
@pgyogesh pgyogesh requested a review from eugeneckim May 26, 2026 06:22
Bump parser to v0.60 with a standard yb-ts-cli path in cmd_to_run, and add
generate-unsafe-config-change.py to build a reviewable bash script from the DB.

Co-authored-by: Cursor <cursoragent@cursor.com>
@pgyogesh pgyogesh changed the title Improve leaderless tablet views in tablet-report-parser based on Max OpID Improve leaderless tablet triage and unsafe config change tooling May 26, 2026
sleep_secs = int(sleep_s)

default_out = db_path.parent / "unsafe_config_change-run.sh"
out_path = Path(t.ask("Output script path", str(default_out))).expanduser().resolve()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a test of this and did "." as my answer and it errored out saying it's a directory, not a file. As a user i saw "script path" and thought it meant the "path." could we make this more clear? perhaps full path and script name?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eugeneckim At which step did you get this error?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this precise prompt. i just input the path "." basically the thing expects a file when it says "path" i assumed it was a directory location.

@eugeneckim eugeneckim left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple things:

  • Curious: Did we go with a separate script for generate-unsafe-config-change.py because it's risky operation and don't want to have it baked into the report parser?
  • Could we provide documentation in the README, or perhaps separately on an end-to-end run of this? Similar to the test plan you had outlined.
  • Possibly consider providing example files for someone if they want to do a quick test. However, this might need to be internal (e.g. shastra) because it may have internal data.

Co-authored-by: Cursor <cursoragent@cursor.com>
@pgyogesh

pgyogesh commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

A couple things:

  • Curious: Did we go with a separate script for generate-unsafe-config-change.py because it's risky operation and don't want to have it baked into the report parser?

generate-unsafe-config-change.py is a separate, interactive tool because every deployment differs:

  • yb-ts-cli path — install dir varies (/home/yugabyte/..., custom images, non-standard layouts).
  • TLS — optional -certs_dir_name and cert directory location per customer.
  • Timing — sleep between commands is configurable at generation time.
  • Safety — readiness prompts before writing a script; you review the file before bash on a live cluster.
  • Could we provide documentation in the README, or perhaps separately on an end-to-end run of this? Similar to the test plan you had outlined.

Updated README.

  • Possibly consider providing example files for someone if they want to do a quick test. However, this might need to be internal (e.g. shastra) because it may have internal data.

Will do it in shastra.

@kapil-yb

kapil-yb commented Jun 2, 2026

Copy link
Copy Markdown

@pgyogesh

Are multiple peer UUIDs being passed now? The old view passed only one node_uuid to unsafe_config_change. If so, why is passing all RUNNING peers safer, and when would we actually need this?

@eugeneckim eugeneckim self-requested a review June 2, 2026 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants