Improve leaderless tablet triage and unsafe config change tooling#202
Improve leaderless tablet triage and unsafe config change tooling#202pgyogesh wants to merge 3 commits into
Conversation
Replace legacy delete/UNSAFE_Leader views with structured SQL views that rank peers by CTID, include all peer UUIDs in generated yb-ts-cli commands, and expose CTID sync diagnostics for leaderless tablets. Co-authored-by: Cursor <cursoragent@cursor.com>
Bump parser to v0.60 with a standard yb-ts-cli path in cmd_to_run, and add generate-unsafe-config-change.py to build a reviewable bash script from the DB. Co-authored-by: Cursor <cursoragent@cursor.com>
| sleep_secs = int(sleep_s) | ||
|
|
||
| default_out = db_path.parent / "unsafe_config_change-run.sh" | ||
| out_path = Path(t.ask("Output script path", str(default_out))).expanduser().resolve() |
There was a problem hiding this comment.
I did a test of this and did "." as my answer and it errored out saying it's a directory, not a file. As a user i saw "script path" and thought it meant the "path." could we make this more clear? perhaps full path and script name?
There was a problem hiding this comment.
@eugeneckim At which step did you get this error?
There was a problem hiding this comment.
At this precise prompt. i just input the path "." basically the thing expects a file when it says "path" i assumed it was a directory location.
eugeneckim
left a comment
There was a problem hiding this comment.
A couple things:
- Curious: Did we go with a separate script for generate-unsafe-config-change.py because it's risky operation and don't want to have it baked into the report parser?
- Could we provide documentation in the README, or perhaps separately on an end-to-end run of this? Similar to the test plan you had outlined.
- Possibly consider providing example files for someone if they want to do a quick test. However, this might need to be internal (e.g. shastra) because it may have internal data.
Co-authored-by: Cursor <cursoragent@cursor.com>
generate-unsafe-config-change.py is a separate, interactive tool because every deployment differs:
Updated README.
Will do it in shastra. |
|
Are multiple peer UUIDs being passed now? The old view passed only one node_uuid to unsafe_config_change. If so, why is passing all RUNNING peers safer, and when would we actually need this? |
Summary
tablet-report-parser.pyto v0.60 and make the script executable.delete_leaderless_be_carefulandUNSAFE_Leader_createviews with two structured views for leaderless tablet triage:unsafe_config_change_commands: One row per leaderless tablet with table FQN, replica counts, max-CTID (OpID) peer selection, all peer UUIDs, and acmd_to_runusing/home/yugabyte/tserver/bin/yb-ts-cli.leaderless_peer_ctid_rank: One row per leaderless tablet with per-peer CTID ranking, sync indicators (ctid_in_sync, spreads), lease status, and an aggregatedall_peer_ctidscolumn for quick review.generate-unsafe-config-change.py: Interactive helper that readsunsafe_config_change_commandsfrom a TabletReport SQLite DB and writes a reviewable bash script (unsafe_config_change-run.sh) with readiness prompts, optional TLS certs, and sleep between commands.Test plan
python3 tablet-report-parser/tablet-report-parser.py <support-bundle/YBA/TabletReport>against a known support bundle..sqliteDB and queryunsafe_config_change_commandsandleaderless_peer_ctid_rankfor leaderless tablets.cmd_to_runincludes all RUNNING peer UUIDs (not just one peer) and uses the max-CTID tserver.leaderless_peer_ctid_rankranks peers by(cterm, cidx)and flags out-of-sync CTIDs.python3 tablet-report-parser/generate-unsafe-config-change.py TabletReport.sqlite, review the generated script, and verify commands match the DB view.