Skip to content

azure-diagnostics: Add Inspektor Gadget Reference#1961

Open
mqasimsarfraz wants to merge 1 commit intomicrosoft:mainfrom
mqasimsarfraz:ig
Open

azure-diagnostics: Add Inspektor Gadget Reference#1961
mqasimsarfraz wants to merge 1 commit intomicrosoft:mainfrom
mqasimsarfraz:ig

Conversation

@mqasimsarfraz
Copy link
Copy Markdown
Member

This PR introduces Inspektor Gadget to allow deeper troubleshooting of AKS clusters. All commands use:

  • kubectl debug so no additional installation is required
  • MCR images for all the steps
  • Content is kept in the references/directory to avoid bloating the initial context

Also, gadget commands are added contextually in networking, node-issues, and pod-failures, with the full reference in references/inspektor-gadget.md.

→ npm run tokens check -- plugin/skills/azure-diagnostics/aks-troubleshooting/references/inspektor-gadget.md

> @github-copilot-for-azure/scripts@1.0.0 tokens
> node --import tsx src/tokens/cli.ts check plugin/skills/azure-diagnostics/aks-troubleshooting/references/inspektor-gadget.md


📊 Token Limit Check
════════════════════════════════════════════════════════════
Files Checked: 1
Files Exceeded: 0

✅ All files within token limits!

cc: @julia-yin

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Inspektor Gadget guidance to the azure-diagnostics AKS troubleshooting skill to enable deeper node/pod-level diagnostics when standard Azure/Kubernetes evidence is inconclusive.

Changes:

  • Added a new Inspektor Gadget reference document with command patterns, filters, and a symptom-to-gadget map.
  • Introduced a “Deep Diagnostics Flow” reference and added contextual IG command snippets to AKS networking, node-issues, and pod-failures playbooks.
  • Updated the main AKS troubleshooting guide to include Inspektor Gadget as the third step in the evidence order.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
plugin/skills/azure-diagnostics/aks-troubleshooting/references/inspektor-gadget.md New IG reference: base command pattern, gadget catalog, and symptom mapping.
plugin/skills/azure-diagnostics/aks-troubleshooting/references/command-flows.md Adds an IG “Deep Diagnostics Flow” and link to the IG reference.
plugin/skills/azure-diagnostics/aks-troubleshooting/pod-failures.md Adds IG deep-diagnostics commands for CrashLoopBackOff/OOM-style investigations.
plugin/skills/azure-diagnostics/aks-troubleshooting/node-issues.md Adds IG guidance for PID pressure / unknown process load.
plugin/skills/azure-diagnostics/aks-troubleshooting/networking.md Adds IG networking/DNS tracing commands for deeper connectivity troubleshooting.
plugin/skills/azure-diagnostics/aks-troubleshooting/aks-troubleshooting.md Updates evidence order and guidance to incorporate IG as a third-step diagnostic tool.

Comment thread plugin/skills/azure-diagnostics/aks-troubleshooting/networking.md Outdated
Comment thread plugin/skills/azure-diagnostics/aks-troubleshooting/networking.md Outdated
Comment thread plugin/skills/azure-diagnostics/aks-troubleshooting/networking.md Outdated
Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few observations on top of the existing bot review.

Doc-only PR adds Inspektor Gadget commands across four AKS troubleshooting pages plus a new reference. Structure and placement match the existing skill conventions. Three things worth a look below; nothing blocking.

Comment thread plugin/skills/azure-diagnostics/aks-troubleshooting/aks-troubleshooting.md Outdated
Comment thread plugin/skills/azure-diagnostics/aks-troubleshooting/pod-failures.md Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comment thread plugin/skills/azure-diagnostics/aks-troubleshooting/references/command-flows.md Outdated
Comment thread plugin/skills/azure-diagnostics/aks-troubleshooting/networking.md Outdated
Comment thread plugin/skills/azure-diagnostics/aks-troubleshooting/references/command-flows.md Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small consistency nit on the new networking deep-diagnostics block. Otherwise the latest pass addresses everything I flagged earlier.

Comment thread plugin/skills/azure-diagnostics/aks-troubleshooting/networking.md Outdated
Copy link
Copy Markdown
Member Author

@mqasimsarfraz mqasimsarfraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jongio for quick reviews! :)

I have addressed the outstanding comment, please feel free to let me know if you have any other thought!

Comment thread plugin/skills/azure-diagnostics/aks-troubleshooting/networking.md Outdated
jongio
jongio previously approved these changes Apr 21, 2026
Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed - the trace_tcpdrop reference is out of networking.md and the catalog stays the single source of truth. Nothing else from my side. Thanks for the quick turnarounds.

@mqasimsarfraz
Copy link
Copy Markdown
Member Author

@jongio Thanks for the approval but the changes had conflict because of #1968 so I had to push again! I assume it will require the approval again!

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Comment thread plugin/skills/azure-diagnostics/aks-troubleshooting/node-issues.md Outdated
Comment thread plugin/skills/azure-diagnostics/aks-troubleshooting/node-issues.md Outdated
Comment thread plugin/skills/azure-diagnostics/troubleshooting/aks/pod-failures.md
jongio
jongio previously approved these changes Apr 22, 2026
Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Walked through all six files end-to-end. The reference is well-organized - single version pin, clear base command pattern, and the symptom-to-gadget map makes the right gadget easy to find. Contextual IG sections in networking, node-issues, and pod-failures point back to the reference without duplicating it.

No new issues since my last pass. Ship it.

tmeschter
tmeschter previously approved these changes Apr 22, 2026
Signed-off-by: Qasim Sarfraz <qasimsarfraz@microsoft.com>
@mqasimsarfraz
Copy link
Copy Markdown
Member Author

@jongio @tmeschter there were conflicts again :(, I pushed latest changes! The PR is ready to go from myside so please feel free to have a look (hopefully for the last time)!

Copy link
Copy Markdown
Collaborator

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed after rebase. All feedback from earlier rounds addressed - version pin centralized, trace_tcpdrop removed from inline references, evidence order corrected. Structure is clean: single reference file for the catalog, contextual sections in networking/node-issues/pod-failures don't duplicate it. CI green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants