Last modified June 21, 2026
Troubleshoot AI agent access
This guide covers the failure modes platform teams hit when operating Muster. For the basic IDE setup checks—context, login, editor configuration—start with the AI agent setup troubleshooting section first.
The agent keeps asking to authenticate
If the authenticate_muster tool keeps reappearing, or your assistant loops on login, the session has expired or the token can’t be refreshed.
- Run
muster auth loginin a terminal, then restart the MCP server in your IDE. - Check
muster auth status. If the aggregator shows as authenticated but a downstream server showsAuth Required, that server needs its own login or its single sign-on is set up incorrectly. - For a forwarding server, confirm the downstream trusts Muster’s client and the
requiredAudiencesmatch what the server expects. A token without the right audience is rejected, which looks like an endless re-auth.
A cluster shows as disconnected
In muster auth status, a single cluster reporting Disconnected or Failed while others are fine usually means that cluster’s mcp-kubernetes is unreachable, not a Muster problem.
- Check the
MCPServerstate:kubectl get mcpservers -n muster. AFailedstate means the endpoint is unreachable. Verify the URL, DNS, and any network policy between Muster and that cluster. Auth Requiredisn’t a failure: it’s the normal state for a protected server until a user logs in.- Other clusters keep working independently, so users can still operate the rest of the fleet while you fix the one.
Tools are missing from the agent
If expected tools don’t appear in the agent’s discovery results:
- Confirm the user is authenticated to the server those tools come from. Per-user visibility means a server a user hasn’t authenticated with contributes no tools to their list, even though it’s healthy.
- Check the server’s state is
ConnectedorRunningand thatautoStartis set if it should always be available. - Remember discovery is dynamic—no IDE restart is needed. If a server only just started, the agent’s next
list_toolsorfilter_toolscall reflects it.
A workflow runs manually but the agent never uses it
This is the most common workflow-authoring trap, and it’s about discoverability, not correctness. The workflow is live: kubectl get workflow shows it valid, and you can call it by name. But the agent falls back to raw tools.
The cause is almost always the description. An agent finds a workflow through filter_tools, whose description filter is a case-insensitive substring match over spec.description. If the agent searches for pod health and the description only says check the pods in <mc>, there’s no substring match, and the workflow is invisible to that search.
Fix it by leading spec.description with both the topic keyword and the natural phrasing. See authoring workflows for the full rule.
A related symptom: an agent told to “discover workflows” with a broad glob can pull every workflow’s description into context at once, spiking token cost. Steer agents to a narrow filter—a specific topic term or name pattern—rather than a catch-all.
A workflow runs but the agent over-explores
If a workflow returns a clean result and the agent keeps investigating anyway, the description is missing a stop-when-healthy rule. Without an explicit rule to write a one-line summary and stop when the result is empty, an agent treats a clean digest as a starting point. Add the rule to spec.description. This single change has swung a probe by an order of magnitude in token cost.
A long chain drops with a network error
A single agent turn that runs many tool calls back to back can drop with a streaming network error before producing an answer. The HTTP response succeeds, but the stream aborts partway. Fewer, cheaper calls reduce this exposure. That’s one more reason to encapsulate multi-step investigations as a workflow rather than letting the agent run the whole discover-query-correlate loop itself.
Authentication fails for users in many groups
If a user can authenticate from one account but fails from one that belongs to many identity-provider groups, suspect token size limits. The usual causes are ingress header buffers and the group-count limit. See RBAC and SSO.
Related
- Author a workflow: discoverability and stop-when-healthy rules.
- Manage MCP servers: server states and
autoStart. - Security model: per-user visibility and sign-out operations.
Need help, got feedback?
We listen to your Slack support channel. You can also reach us at support@giantswarm.io. And of course, we welcome your pull requests!