This article was co-authored with Shankar Ganeshan. You can find him on LinkedIn here.

Why Kubernetes troubleshooting still hurts and why we built k8x.

Even veteran SREs know the familiar pain cycle:
SymptomUsual manual steps
Pods in CrashLoopBackOffkubectl get pods -A → copy failing names
kubectl logs … (repeat per pod & container)
• grep for hints, cross-check ConfigMaps
503s on an Ingress• Check Service endpoints → kube-proxy rules → NetworkPolicy
• Compare readiness probes & resource pressure
myservice works on most nodes”• Inspect taints/labels, daemonsets, CNI logs
• Describe nodes for kernel versions & allocatable
Each scenario is multi-step, context-heavy, and a single typo can derail the process. General-purpose LLM helpers (ChatGPT, Claude, Copilot Chat) can suggest commands, but you still copy-paste, re-run, and stitch results together.

The agentic leap: from suggestion to orchestrated review

Recent agents like GitHub Copilot Chat (in VS Code w/ terminal access), Claude Code (terminal-native edits) and Goose showed a new pattern: the LLM drives an interactive loop—executes safe commands autonomously, then narrates the findings. k8x applies the same agentic idea to Kubernetes:
  • Natural-language prompts → e.g. “Find pods that aren’t ready and tell me why.”
  • The agent plans a sequence: kubectl get …, kubectl describe …, maybe kubectl top ….
  • It executes those read-only commands, parses output, and reasons about root causes.
  • Results appear as an explanation first, with raw command logs one keystroke away.
Unlike code-centric tools, k8x is infra-native. It understands resource kinds, status fields, events, and failure taxonomies (image pull, scheduling, OOMKilled).

Design choices that matter to operators

k8x works in your console with the your current kubectl configuration, to perform autonomous, multi-step workflows to detect and troubleshoot kubernetes issues with your credentials.
PrincipleExperienceWhy it builds trust
Read-only by default (v0.1)Zero risk of deletions; mirrors commands you could typeSafely trial AI before granting mutate rights ([GitHub][3])
Plain kubectl under the hoodFamiliar audit trail; works anywhere your kubeconfig doesNo proprietary sidecars or admission webhooks
Multi-LLM back-endSelect OpenAI, Claude, or Gemini at k8x configureAvoid vendor lock-in; keep traffic in-house
Command history & undok8x history list shows past sessionsAuditors see exactly what ran; SREs replay in staging
There’s more to come, including write permissions, parallelism, etc. Conrtributions are welcome.

A day in the life with k8x

# 1️⃣ Something is off
default$ k8x -c "my checkout service is returning 502s"

# 2️⃣ Agent plan (condensed)
 Check Ingress status
 Verify Service endpoints
 Scan pod readiness & logs
 Examine recent HPA events

# 3️⃣ Summary
 2/5 endpoints unhealthy
 Pods stuck in Init:CrashLoopBackOff (db-migrations)
 Migration container fails on `ALTER TABLE …` (lock timeout)
Suggested fix: run `kubectl exec` into db-migration-pod or scale replicas to 0/1 to release lock
In ~30 s, you get an actionable story instead of fifteen manual commands.

How a multi-step review actually works

  1. Intent parsing – Translates English prompts into an internal diagnostic goal.
  2. Planning – LLM selects a safe chain of read-only kubectl queries.
  3. Adaptive execution – After each command, it decides if deeper queries are needed.
  4. Reasoning & templated explanations – Maps results to known issue patterns for a deterministic, auditable summary.
Only redacted command output reaches the LLM—no full application logs—another trust measure.

Where k8x stands in the AI-ops landscape

ToolDomainAutonomyLocal-first?Write access
GitHub Copilot ChatCode / CISuggests fixes, runs queries in UINoOptional PR commits
Claude CodeCode & CLI automationPlans & edits filesYesYes (file edits)
GooseMulti-agent dev tasksRuns terminal commandsYesYes
k8x (v0.1)Kubernetes operationsPlans & executes kubectl readsYesNo (read-only)
k8x fills the gap for platform and DevOps engineers looking for Copilot-level assistance after deployment, not just in CI/CD.

Getting started in 60 seconds

brew tap aihero/k8x
brew install k8x         # installs v0.1.1
k8x configure            # choose LLM & set API key
k8x -c "Are all pods running?"
You’ll never look at a 3-screen tmux layout the same way again.

Roadmap sneak-peek

  • v0.2 – Declarative fixes
    • Generate a patch plan (kubectl diff) and let humans --approve.
  • Terraform & cloud-CLI mode
    • Run terraform plan or aws eks update-kubeconfig as sub-steps.
  • Cluster runbooks as code
    • Store successful sessions as YAML recipes to auto-trigger on alerts.

Final thoughts

Generative-AI agents are moving from IDEs into production infrastructure. By combining LLM planning, policy-guarded execution, and domain-specific reasoning, k8x transforms Kubernetes troubleshooting from a scavenger hunt into a guided review. Start with read-only diagnostics today; when you’re ready, the agent will apply fixes—one audited pull request at a time.

Get Connected, Share, and Other Socials