Autonomous Kubernetes Troubleshooting with k8x

This article was co-authored with Shankar Ganeshan. You can find him on LinkedIn here.

Why Kubernetes troubleshooting still hurts and why we built `k8x`.

Even veteran SREs know the familiar pain cycle:

Symptom	Usual manual steps
Pods in `CrashLoopBackOff`	• `kubectl get pods -A` → copy failing names • `kubectl logs …` (repeat per pod & container) • grep for hints, cross-check ConfigMaps
503s on an Ingress	• Check Service endpoints → kube-proxy rules → NetworkPolicy • Compare readiness probes & resource pressure
”`myservice` works on most nodes”	• Inspect taints/labels, daemonsets, CNI logs • Describe nodes for kernel versions & allocatable

Each scenario is multi-step, context-heavy, and a single typo can derail the process. General-purpose LLM helpers (ChatGPT, Claude, Copilot Chat) can suggest commands, but you still copy-paste, re-run, and stitch results together.

The agentic leap: from suggestion to orchestrated review

Recent agents like GitHub Copilot Chat (in VS Code w/ terminal access), Claude Code (terminal-native edits) and Goose showed a new pattern: the LLM drives an interactive loop—executes safe commands autonomously, then narrates the findings. k8x applies the same agentic idea to Kubernetes:

Natural-language prompts → e.g. “Find pods that aren’t ready and tell me why.”
The agent plans a sequence: kubectl get …, kubectl describe …, maybe kubectl top ….
It executes those read-only commands, parses output, and reasons about root causes.
Results appear as an explanation first, with raw command logs one keystroke away.

Unlike code-centric tools, k8x is infra-native. It understands resource kinds, status fields, events, and failure taxonomies (image pull, scheduling, OOMKilled).

Design choices that matter to operators

k8x works in your console with the your current kubectl configuration, to perform autonomous, multi-step workflows to detect and troubleshoot kubernetes issues with your credentials.

Principle	Experience	Why it builds trust
Read-only by default (v0.1)	Zero risk of deletions; mirrors commands you could type	Safely trial AI before granting mutate rights ([GitHub][3])
Plain `kubectl` under the hood	Familiar audit trail; works anywhere your kubeconfig does	No proprietary sidecars or admission webhooks
Multi-LLM back-end	Select OpenAI, Claude, or Gemini at `k8x configure`	Avoid vendor lock-in; keep traffic in-house
Command history & undo	`k8x history list` shows past sessions	Auditors see exactly what ran; SREs replay in staging

There’s more to come, including write permissions, parallelism, etc. Conrtributions are welcome.

A day in the life with k8x

# 1️⃣ Something is off
default$ k8x -c "my checkout service is returning 502s"

# 2️⃣ Agent plan (condensed)
• Check Ingress status
• Verify Service endpoints
• Scan pod readiness & logs
• Examine recent HPA events

# 3️⃣ Summary
❌ 2/5 endpoints unhealthy
↳ Pods stuck in Init:CrashLoopBackOff (db-migrations)
↳ Migration container fails on `ALTER TABLE …` (lock timeout)
Suggested fix: run `kubectl exec` into db-migration-pod or scale replicas to 0/1 to release lock

In ~30 s, you get an actionable story instead of fifteen manual commands.

How a multi-step review actually works

Intent parsing – Translates English prompts into an internal diagnostic goal.
Planning – LLM selects a safe chain of read-only kubectl queries.
Adaptive execution – After each command, it decides if deeper queries are needed.
Reasoning & templated explanations – Maps results to known issue patterns for a deterministic, auditable summary.

Only redacted command output reaches the LLM—no full application logs—another trust measure.

Where k8x stands in the AI-ops landscape

Tool	Domain	Autonomy	Local-first?	Write access
GitHub Copilot Chat	Code / CI	Suggests fixes, runs queries in UI	No	Optional PR commits
Claude Code	Code & CLI automation	Plans & edits files	Yes	Yes (file edits)
Goose	Multi-agent dev tasks	Runs terminal commands	Yes	Yes
k8x (v0.1)	Kubernetes operations	Plans & executes `kubectl` reads	Yes	No (read-only)

k8x fills the gap for platform and DevOps engineers looking for Copilot-level assistance after deployment, not just in CI/CD.

Getting started in 60 seconds

brew tap aihero/k8x
brew install k8x         # installs v0.1.1
k8x configure            # choose LLM & set API key
k8x -c "Are all pods running?"

You’ll never look at a 3-screen tmux layout the same way again.

Roadmap sneak-peek

v0.2 – Declarative fixes
- Generate a patch plan (kubectl diff) and let humans --approve.
Terraform & cloud-CLI mode
- Run terraform plan or aws eks update-kubeconfig as sub-steps.
Cluster runbooks as code
- Store successful sessions as YAML recipes to auto-trigger on alerts.

Final thoughts

Generative-AI agents are moving from IDEs into production infrastructure. By combining LLM planning, policy-guarded execution, and domain-specific reasoning, k8x transforms Kubernetes troubleshooting from a scavenger hunt into a guided review. Start with read-only diagnostics today; when you’re ready, the agent will apply fixes—one audited pull request at a time.

AI Infrastructure

Thoughts on Agents

Model Context Protocol

A.I. in the Workplace

Agent-to-Agent Protocol

Musings

Autonomous Kubernetes Troubleshooting with k8x

Why Kubernetes troubleshooting still hurts and why we built `k8x`.

The agentic leap: from suggestion to orchestrated review

Design choices that matter to operators

A day in the life with k8x

How a multi-step review actually works

Where k8x stands in the AI-ops landscape

Getting started in 60 seconds

Roadmap sneak-peek

Final thoughts

Share on LinkedIn

Share on X/Twitter

Have thoughts? I'd love to chat!

More about Rahul and Elevate.do

Follow Rahul on LinkedIn

Follow Rahul on Twitter/X

AI Infrastructure

Thoughts on Agents

Model Context Protocol

A.I. in the Workplace

Agent-to-Agent Protocol

Musings

​Why Kubernetes troubleshooting still hurts and why we built k8x.

​The agentic leap: from suggestion to orchestrated review

​Design choices that matter to operators

​A day in the life with k8x

​How a multi-step review actually works

​Where k8x stands in the AI-ops landscape

​Getting started in 60 seconds

​Roadmap sneak-peek

​Final thoughts

​Get Connected, Share, and Other Socials

Share on LinkedIn

Share on X/Twitter

Have thoughts? I'd love to chat!

More about Rahul and Elevate.do

Follow Rahul on LinkedIn

Follow Rahul on Twitter/X

Why Kubernetes troubleshooting still hurts and why we built `k8x`.

The agentic leap: from suggestion to orchestrated review

Design choices that matter to operators

A day in the life with k8x

How a multi-step review actually works

Where k8x stands in the AI-ops landscape

Getting started in 60 seconds

Roadmap sneak-peek

Final thoughts

Get Connected, Share, and Other Socials