Skip to content

Latest commit

 

History

History
257 lines (176 loc) · 10.6 KB

File metadata and controls

257 lines (176 loc) · 10.6 KB

--- Day 1 (Monday, 2026-05-12) ---

marblesoup 09:14 ok story time. friday night, prod deploy, trout pushes a bad config bundle, half our services start crashlooping. took us 40 min to manually unwind it because nobody had a clean snapshot of the previous good state. just configmaps and a helm release pinned to the wrong chart version. brutal.

marblesoup 09:15 so. hot take. trout should ship a snapshot + rollback story. one command before deploy, one command to undo. why is this not a thing yet

👍 7 🔥 3

dr_taco 09:22 oh i love this. been wanting it forever. trout deploy already knows everything that's about to change, would be cheap to dump the prior state to disk first

kbnetes 09:31 hmm. love the idea in spirit. nervous about scope. "snapshot cluster state" can mean anything from "yaml of the resources i'm touching" to "full etcd dump + every PVC". we need to pin this down before we sign up for it

🤔 5

marblesoup 09:33 fair. for my use case it's configs and helm releases. i don't need PVCs snapshotted, that's a backup product

kbnetes 09:35 right but someone in this discord WILL ask for PVCs the moment we ship it. just calling it now

chrissyOPS 09:40 +1 to scoping tight. configs and helm seem like the 80% case. anything beyond that is velero territory

dr_taco 09:44 fwiw i think we could do this in like 2 weeks if we keep it to "what trout was about to change, snapshotted, written to a local dir, indexed by deploy id". rollback = re-apply the snapshot

verysleepyperson 10:02 y'all i am SO into this. lost 6 hours last quarter to a bad deploy. would have paid money for an undo button

kbnetes 10:11 ok let me push back gently though. is this a v3 feature or v2.x. because if it's v2.x we need to be REALLY sure it doesn't change any existing command behavior. semver compatibility is a thing i actually care about lol

marblesoup 10:14 new flag, new subcommand, no behavior change to existing flows. should be v2.x in my head

dr_taco 10:15 agree, additive only

kbnetes 10:16 ok i'll buy that. for now.

--- Day 2 (Tuesday, 2026-05-13) ---

chrissyOPS 08:47 slept on this. one thing i want to flag: helm rollback is already a thing (helm rollback <release> <rev>). do we just shell out to helm for the helm parts, or reimplement?

dr_taco 08:55 shell out. 100%. we are not in the business of reimplementing helm's revision history

kbnetes 08:57 agree. trout's job is orchestration. helm does helm.

chrissyOPS 09:01 ok cool. so the "snapshot" for helm bits is really just "remember the release revision number that was live before we deployed". on rollback we call helm rollback with that number. that's it

marblesoup 09:04 yes exactly. and for raw kubectl-style configs we keep a yaml dump

verysleepyperson 11:30 random aside. has anyone looked at how kubectl-timetravel does this? they apparently keep a ring buffer of last N applied manifests on disk and let you kubectl-timetravel undo. seems close to what we're describing

dr_taco 11:34 oh interesting, link?

verysleepyperson 11:36 i'll dig it up. think it's a side project from someone at one of the cloud providers. point being, prior art exists, we don't need to invent the model

kbnetes 11:41 this is reassuring actually. if there's prior art for the local-disk-ring-buffer pattern we can crib from that and not overthink storage

chrissyOPS 13:02 storage question though. local disk works for solo devs but ops teams will want a shared backing store. s3, gcs, whatever. should that be in v1?

marblesoup 13:08 local default, s3 optional via config. i don't think we need every blob store on day 1

dr_taco 13:09 +1. s3 is fine as the "remote" option for v1. gcs/azure can come later if anyone screams loud enough

kbnetes 13:14 i could live with that

👍 4

chrissyOPS 13:20 also: deploy latency. if snapshotting adds 30 seconds to every deploy, people will turn it off. it has to be fast. like, "you don't notice it" fast

dr_taco 13:22 yeah this is non-negotiable for me too. snapshot has to be near-zero overhead or it's dead on arrival

--- Day 3 (Wednesday, 2026-05-14) ---

marblesoup 07:55 ok i wrote up a rough rfc last night. very rough. https://gist.github.com/marblesoup/snapshot-rollback-trout-rfc-draft have a look, tear it apart

dr_taco 08:30 reading now

dr_taco 08:42 overall yes. one thing: you have PVCs listed under "stretch goals". i'd move them to non-goals for v1, full stop. "stretch" invites scope creep

marblesoup 08:44 fair, will edit

kbnetes 09:10 read it. broadly agree with direction. the PVC thing though, i want to argue for. SOME workloads are stateful and rolling back the config without rolling back the volume state leaves you in a worse place than before

chrissyOPS 09:14 counter: rolling back PVCs is a backup/restore problem, not a deploy problem. if your stateful workload needs volume rollback you need velero or your csi driver's snapshot story. trout shouldn't be in that game

kbnetes 09:18 i hear you but the user experience is going to be "i ran trout undo and my database is still broken"

dr_taco 09:22 that's a docs problem imo. clearly say "rolls back configs and helm releases, does not touch persistent volumes, use velero for that". if we try to do PVCs we will be doing it for the next 18 months

verysleepyperson 09:30 strong +1 to docs/explicit-non-goal. one of the things i like about trout is it doesn't try to be everything

kbnetes 09:45 mmm. ok. i'll come around. PVCs out for v1. but i want the rfc to have an explicit "future work" section that calls out the stateful workload gap, so we're honest with users

marblesoup 09:46 deal. adding that section now

chrissyOPS 09:48 this is good. feels like we're actually converging

👍 6 ✨ 2

dr_taco 10:30 on storage: rfc says "local default, s3 optional, bring your own bucket". i think that's the right call. anything more than s3 for v1 is overreach

marblesoup 10:32 that's where i landed too

verysleepyperson 14:11 re: success criteria. how do we know this feature is actually working out there? like, what's the proof it solved something

marblesoup 14:14 good q. couple ideas: (1) we see actual adoption among power users, like 3+ teams using it within first couple months. (2) we get fewer "i broke prod and had to manually unwind" tickets

chrissyOPS 14:17 both feel measurable. i like it.

--- Day 4 (Thursday, 2026-05-15) ---

dr_taco 08:01 ok bikeshed time. the command. options on the table:

  1. trout undo
  2. trout rollback
  3. trout snapshot restore

dr_taco 08:02 i vote trout undo. it's the most obvious to a panicked user at 2am

marblesoup 08:04 strong agree. nobody at 2am types "snapshot restore"

kbnetes 08:09 rollback is more accurate to what it does. undo implies it works on any action, which it doesn't

chrissyOPS 08:11 counter-counter: the user doesn't care about technical accuracy at 2am. they care about "make it stop". undo wins on that axis

dr_taco 08:13 also undo is short. people will type it more

kbnetes 08:20 ok fine. undo. but document it as "undoes the last trout deploy" so nobody thinks it undoes arbitrary cluster actions

marblesoup 08:21 yes, that's the right framing

verysleepyperson 08:30

👍 5

verysleepyperson 08:31 agreed, trout undo is the move

chrissyOPS 11:02 ok next question that's been nagging me. storage. if we offer s3 as a managed thing, does this become a paid cloud feature eventually? like "trout cloud stores your snapshots for you, $9/mo"?

dr_taco 11:08 oh hmm. interesting. i don't hate it but i don't want to design v1 around monetization. bring-your-own-bucket for now, see if anyone wants managed later

marblesoup 11:10 that's how i'd play it too. ship the open thing, see what people ask for

kbnetes 11:14 +1. don't put monetization on the critical path of a v1 ux decision

chrissyOPS 11:15 ok yeah, fair. just wanted to flag it before we built ourselves into a corner

verysleepyperson 16:40 random thought: should trout deploy auto-snapshot by default, or opt-in via flag?

marblesoup 16:45 default on. the whole value prop is "i didn't have to remember to snapshot"

dr_taco 16:46 default on, with a --no-snapshot escape hatch for the people who really don't want it

kbnetes 16:48 agree. defaults matter

--- Day 5 (Friday, 2026-05-16) ---

marblesoup 09:00 ok i think we've actually landed somewhere. let me try to summarize where we are:

marblesoup 09:01

  • trout adds snapshot + rollback. snapshot captures configs and helm release revisions before each deploy. rollback (trout undo) reapplies the snapshot in one command
  • snapshot is on by default, --no-snapshot opts out
  • storage: local disk by default, optional s3 via config (bring your own bucket)
  • PVCs are explicitly out of scope for v1. documented as future work. velero is the answer for stateful workloads today
  • helm rollback is delegated to helm itself (we just remember the revision number)
  • ships as v2.x because it's additive, no behavior change to existing commands
  • has to be near-zero added latency on deploy, this is a hard constraint
  • success looks like: a handful of power users adopting within ~60 days, and fewer "i broke prod manually unwinding" support pings

marblesoup 09:03 filed an issue tracking this: https://github.com/trout-cli/trout/issues/847

dr_taco 09:10 this is a great summary. lgtm

chrissyOPS 09:14 +1, ready to start prototyping

👍 6 🚀 3

verysleepyperson 09:20 hyped. happy to test once there's a branch

kbnetes 10:30 i'm going to register one ongoing dissent for the record. i still think v1 without ANY stateful workload story is going to confuse users badly. i don't have the votes to block it and i'm not trying to, but i'm going to write a separate issue documenting my concerns so we have a paper trail when people complain. just want it on the record

marblesoup 10:33 totally fair, that's healthy. please do file it, would rather have your concerns documented than not

kbnetes 10:34 will do today

dr_taco 10:40 appreciate that approach honestly. shipping with eyes open > pretending we solved everything

chrissyOPS 11:00 ok then. snapshot-and-rollback, v2.x, configs + helm only, local-or-s3, trout undo, default on. let's build it

👍 8 :trout: 4

verysleepyperson 11:02 lol who added a :trout: emoji

chrissyOPS 11:03 that was me, several weeks ago, no regrets