Troubleshooting Kubernetes Autonomously with k8x

This article was co-authored with Shankar Ganeshan.

When you’re setting up a Kubernetes cluster, DevOps and Platform engineers feel like they’re navigating a maze. Take a simple service deployment: you might have to check the deployment (kubectl get deployment), the service (kubectl get service), inspect events (kubectl describe pod), the logs (kubectl logs pod-name), and even Ingress rules (kubectl get ingress). Each step requires context, copy-pasting names and ids from one place into another. A single typo can leave you scratching your head. Worse, incomplete information can lead you down the wrong path and waste hours. The complexity of multi-step diagnostics, especially when things go wrong, can be overwhelming.

Why Kubernetes troubleshooting still hurts and why we built `k8x`.

Even veteran SREs know the familiar pain cycle:

Symptom	Usual manual steps
Pods in `CrashLoopBackOff`	• `kubectl get pods -A` → copy failing names • `kubectl logs …` (repeat per pod & container) • grep for hints, cross-check ConfigMaps
503s on an Ingress	• Check Service endpoints → kube-proxy rules → NetworkPolicy • Compare readiness probes & resource pressure
”`myservice` works on most nodes”	• Inspect taints/labels, daemonsets, CNI logs • Describe nodes for kernel versions & allocatable

Each scenario is multi-step, context-heavy, and needs you to track context across multiple kubernetes commands. You need to keep a note of resource ids to use them in other resource descriptions and logs to know what’s going on.

The agentic leap: from suggestion to orchestrated review

Recent agents like GitHub Copilot Chat (in VS Code w/ terminal access), Claude Code (terminal-native edits) and Goose showed a new pattern: the LLM drives an interactive loop—executes safe commands autonomously, then narrates the findings. General-purpose LLM helpers (ChatGPT, Claude, Copilot Chat) can go beyond suggesting commands - they can copy-paste, re-run, and stitch results together. k8x applies this agentic idea to Kubernetes:

Natural-language prompts → e.g. “Find pods that aren’t ready and tell me why.”
The agent plans a sequence: kubectl get …, kubectl describe …, maybe kubectl top ….
It executes those read-only commands, parses output, and reasons about root causes.
Results appear as an explanation first, with raw command logs one keystroke away.

Unlike code-centric tools, k8x is infra-native. It understands resource kinds, status fields, events, and failure taxonomies (image pull, scheduling, OOMKilled).

Design choices that matter to operators

k8x works in your console with the your current kubectl configuration, to perform autonomous, multi-step workflows to detect and troubleshoot kubernetes issues with your credentials.

Principle	Experience	Why it builds trust
Read-only by default (v0.1)	Zero risk of deletions; mirrors commands you could type	Safely trial AI before granting mutate rights ([GitHub][3])
`Plain kubectl under the hood`	Familiar audit trail; works anywhere your kubeconfig does	No proprietary sidecars or admission webhooks
Multi-LLM back-end	Select OpenAI, Claude, or Gemini at `k8x configure`	Avoid vendor lock-in; keep traffic in-house
Command history & undo	`k8x history list` shows past sessions	Auditors see exactly what ran; SREs replay in staging

There’s more to come, including write permissions, parallelism, etc. Conrtributions are welcome.

A day in the life with k8x

# 1️⃣ Something is off
default$ k8x -c "my checkout service is returning 502s"

# 2️⃣ Agent plan (condensed)
• Check Ingress status
• Verify Service endpoints
• Scan pod readiness & logs
• Examine recent HPA events

# 3️⃣ Summary
❌ 2/5 endpoints unhealthy
↳ Pods stuck in Init:CrashLoopBackOff (db-migrations)
↳ Migration container fails on `ALTER TABLE …` (lock timeout)
Suggested fix: run `kubectl exec` into db-migration-pod or scale replicas to 0/1 to release lock

In ~30 seconds, you get an actionable story instead of fifteen manual commands.

How a multi-step review actually works

Intent parsing - Translates English prompts into an internal diagnostic goal.
Planning - LLM selects a safe chain of read-only kubectl queries.
Adaptive execution - After each command, it decides if deeper queries are needed.
Reasoning & templated explanations - Maps results to known issue patterns for a deterministic, auditable summary.

Only redacted command output reaches the LLM—no full application logs—another trust measure.

Where k8x stands in the AI-ops landscape

Tool	Domain	Autonomy	Local-first?	Write access
GitHub Copilot Chat	Code / CI	Suggests fixes, runs queries in UI	No	Optional PR commits
Claude Code	Code & CLI automation	Plans & edits files	Yes	Yes (file edits)
Goose	Multi-agent dev tasks	Runs terminal commands	Yes	Yes
k8x (v0.1)	Kubernetes operations	Plans & executes `kubectl` reads	Yes	No (read-only)

k8x fills the gap for platform and DevOps engineers looking for Copilot-level assistance after deployment, not just in CI/CD.

Getting started in 60 seconds

brew tap aihero/k8x
brew install k8x         # installs v0.1.1
k8x configure            # choose LLM & set API key
k8x -c "Are all pods running?"

You’ll never look at a 3-screen tmux layout the same way again.

Open Source

k8x is Apache 2.0-licensed and available on GitHub. We’re looking for contributors to help build out the next features, including:

v0.2 - Declarative fixes
- Generate a patch plan (kubectl diff) and let humans --approve.
Support ArgoCD and other k8s tools
- Integrate with ArgoCD for GitOps workflows.
- Use kubectl apply to update resources based on agent suggestions.
Terraform & cloud-CLI mode
- Run terraform plan or aws eks update-kubeconfig as sub-steps.
Cluster runbooks as code
- Store successful sessions as YAML recipes to auto-trigger on alerts.

Final thoughts

Generative-AI agents are moving from IDEs into production infrastructure. By combining LLM planning, policy-guarded execution, and domain-specific reasoning, k8x transforms Kubernetes troubleshooting from a scavenger hunt into a guided review. Start with read-only diagnostics today; when you’re ready, the agent will apply fixes—one audited pull request at a time.

Agent-to-Agent Protocol

Beacon

Musings

AI Infrastructure

Thoughts on Agents

Model Context Protocol

A.I. in the Workplace

Troubleshooting Kubernetes Autonomously with k8x

Why Kubernetes troubleshooting still hurts and why we built `k8x`.

The agentic leap: from suggestion to orchestrated review

Design choices that matter to operators

A day in the life with k8x

How a multi-step review actually works

Where k8x stands in the AI-ops landscape

Getting started in 60 seconds

Open Source

Final thoughts

Share on LinkedIn

Share on X

Schedule a Chat

About Elevate.do

Follow on LinkedIn

Follow on X

Agent-to-Agent Protocol

Beacon

Musings

AI Infrastructure

Thoughts on Agents

Model Context Protocol

A.I. in the Workplace

​Why Kubernetes troubleshooting still hurts and why we built k8x.

​The agentic leap: from suggestion to orchestrated review

​Design choices that matter to operators

​A day in the life with k8x

​How a multi-step review actually works

​Where k8x stands in the AI-ops landscape

​Getting started in 60 seconds

​Open Source

​Final thoughts

Share on LinkedIn

Share on X

Schedule a Chat

About Elevate.do

Follow on LinkedIn

Follow on X

Why Kubernetes troubleshooting still hurts and why we built `k8x`.

The agentic leap: from suggestion to orchestrated review

Design choices that matter to operators

A day in the life with k8x

How a multi-step review actually works

Where k8x stands in the AI-ops landscape

Getting started in 60 seconds

Open Source

Final thoughts