AI Agents

  • What’s Changed

    Full Changelog: https://github.com/giantswarm/klausctl/compare/v0.0.24...v0.0.25

  • What’s Changed

    Full Changelog: https://github.com/giantswarm/klausctl/compare/v0.0.23...v0.0.24

  • What’s Changed

    Full Changelog: https://github.com/giantswarm/muster/compare/v0.1.2...v0.1.3

  • What’s Changed

    Full Changelog: https://github.com/giantswarm/muster/compare/v0.1.1...v0.1.2

  • What’s Changed

    Full Changelog: https://github.com/giantswarm/klaus-oci/compare/v0.0.7...v0.0.8

  • What’s Changed

    Full Changelog: https://github.com/giantswarm/muster/compare/v0.1.0...v0.1.1

  • What’s Changed

    Full Changelog: https://github.com/giantswarm/muster/compare/v0.0.237...v0.0.238

  • Changed

    • Session duration reduced from 90 days to 30 days. The refresh token TTL now matches Dex’s absoluteLifetime (720h). Previously, muster’s 90-day refresh token outlived Dex’s 30-day session, causing confusing failures when auto-refresh silently stopped working after day 30. Users who were logging in once every ~2 months will now need to re-authenticate every 30 days.
    • muster auth status now shows session expiry. Instead of Refresh: Available, the output now shows Session: ~29 days remaining (auto-refresh), giving users a concrete estimate of when re-authentication will be required.
    • Access token TTL is now explicitly set to 30 minutes (matching Dex’s idTokens expiry) instead of relying on the library default of 1 hour.
    • Session duration is now configurable via oauth.server.sessionDuration in config.yaml (default: 720h / 30 days).
    • Kubernetes event emission is now disabled by default (alpha feature). Use --enable-events flag on muster serve or set events: true in config.yaml to opt in.
    • Switch CI to push-to-registries-multiarch (architect-orb@6.14.0) with amd64-only on branches for faster PR feedback and full multi-arch on release tags. Chart tests now run before publishing to the app catalog.
    • Update Dockerfile to multi-stage build with native cross-compilation support for multi-architecture images.

    Note: The Server-Side Meta-Tools Migration below is a breaking change that will be released as part of the next major version. External integrations should prepare for this change.

    Breaking Changes

    Server-Side Meta-Tools Migration

    Meta-tools (list_tools, call_tool, describe_tool, etc.) have moved from the agent to the aggregator server. This is a fundamental architectural change. What Changed:

    ComponentBeforeAfter
    AgentExposed 11 meta-tools + bridged to aggregatorTransport bridge only (OAuth shim + stdio↔HTTP)
    AggregatorExposed 36+ core tools directlyExposes ONLY meta-tools - no direct tool access
    Tool AccessDirect tool calls to aggregatorAll tool calls go through call_tool meta-tool
    What Continues Working (Transparent Migration):
    • CLI commands (muster list, muster get, etc.) - client wraps calls automatically
    • Agent REPL (muster agent --repl) - uses same client with transparent wrapping
    • BDD test scenarios - test client wraps calls automatically
    • MCP native protocol methods (tools/list, resources/list) - not affected What Breaks (Requires Update):
    • External integrations calling tools directly via HTTP
    • Custom clients connecting directly to aggregator Migration for External Clients:
    // Before: Direct tool call
    {"method": "tools/call", "params": {"name": "core_service_list", "arguments": {}}}
    // After: Wrap through call_tool
    {"method": "tools/call", "params": {
      "name": "call_tool",
      "arguments": {"name": "core_service_list", "arguments": {}}
    }}
    

    Benefits:

    • OAuth-capable clients can connect directly to server without agent
    • Simpler agent architecture (~200 lines vs ~700 lines)
    • Consistent tool visibility across all clients
    • Centralized meta-tool logic See ADR-010 for design details. Known External Integrations Affected:
    • Any HTTP clients calling the aggregator directly
    • Custom MCP clients not using muster agent
    • CI/CD pipelines with direct tool calls Recommended Migration Timeline:
    1. Review your integration code for direct tool calls
    2. Update to wrap calls through call_tool meta-tool
    3. Test with the new Muster version before deploying

    Changed

    • MCPServer CRD State Exposes Auth Required - The MCPServer CRD now shows Auth Required state when a remote server returns 401 Unauthorized (#337)
      • Before: 401 response mapped to Connected (hiding auth requirement)
      • After: 401 response shows as Auth Required in CRD state
      • This gives operators clear visibility into which servers need authentication
      • CLI output updated: muster list mcpserver now shows Auth Required state
      • SESSION column values updated: OKAuthenticated, RequiredPending Auth
      • Column header renamed: AUTHSESSION to match muster auth status output

    Added

    • Reconciliation Framework - Automatic synchronization between resource definitions (CRDs/YAML) and running services
      • Supports both Kubernetes mode (using controller-runtime informers) and filesystem mode (using fsnotify)
      • Auto-detects operating mode based on environment
      • Configurable per-resource-type enable/disable
      • Work queue with deduplication and exponential backoff
      • Status tracking and API for observability
      • See ADR 007 for design details
    • StateChangeBridge - Real-time sync of runtime state changes to CRD status subresources
      • Subscribes to orchestrator service state changes
      • Triggers reconciliation to update CRD status when services start/stop/crash

    Changed

    • BREAKING: Consolidated OAuth Configuration Naming - OAuth configuration structure has been reorganized for clarity (#324)
      • Before: aggregator.oauth (client/proxy) + aggregator.oauthServer (server protection)
      • After: aggregator.oauth.mcpClient (MCP client/proxy) + aggregator.oauth.server (server protection)
      • Both OAuth roles now live under a single oauth section with explicit mcpClient/server sub-sections
      • The mcpClient name makes it clear this is for authenticating TO remote MCP servers
      • CLI flags renamed: --oauth--oauth-mcp-client, --oauth-public-url--oauth-mcp-client-public-url
      • Helm values updated: muster.oauth.*muster.oauth.mcpClient.*, muster.oauthServer.*muster.oauth.server.*
      • CIMD configuration moved to nested structure: cimdPath/cimdScopescimd.path/cimd.scopes
      • Migration: Update configuration files and Helm values to use the new structure
    • BREAKING: CRD Status Field Changes - Status fields have been redesigned for session-aware tool availability
      • MCPServerStatus: Removed availableTools (session-dependent), added lastConnected and restartCount
      • ServiceClassStatus: Replaced available/requiredTools/missingTools/toolAvailability with valid/validationErrors/referencedTools
      • WorkflowStatus: Replaced available/requiredTools/missingTools/stepValidation with valid/validationErrors/referencedTools/stepCount
      • Tool availability is now computed per-session at runtime, not stored in CRs
      • Existing CRs will have stale status fields that will be updated on first reconciliation
    • Added Chart annotations to support OCI repositories.

    Fixed

    • Helm CiliumNetworkPolicy: Fixed incorrect values path for OAuth storage check (now uses .Values.muster.oauth.server.storage)

    Added

    • Remote MCP Server Support for Kubernetes Environments
      • Added comprehensive support for stdio, streamable-http and sse transport protocols
      • Enhanced CRD Schema: Updated MCPServerSpec to support all MCP server types
        • Added new config for streamable-http and sse: url, headers and timeout fields
        • Added mutual exclusion validation and required field validation using kubebuilder annotations
      • New CLI Commands: Added subcommands to use new type system
        • muster create mcpserver <name> --type stdio for local MCP servers
        • muster create mcpserver <name> --type streamable-http for HTTP remote servers
        • muster create mcpserver <name> --type sse for SSE remote servers
      • Updated Examples: Enhanced example files to demonstrate both local and remote configurations
      • Kubernetes Deployment Ready: Enables deployment patterns where Muster aggregator runs in cluster and connects to MCP servers deployed as separate Kubernetes services
    • Systemd Socket Activation Support
      • Added muster.socket unit file for socket-activated systemd deployment
      • Modified muster.service to use socket activation on localhost:8090
      • Updated scripts/setup-systemd.sh and scripts/dev-restart.sh to handle socket activation
      • Make use of new dependency github.com/coreos/go-systemd to handle socket activation
    • Service Health Monitoring
      • Added health checks for MCP servers using the tools/list JSON-RPC method
      • Added health checks for port forwards by testing TCP connectivity
      • Health checks run every 30 seconds for all running services
      • Health status is reported through the StateStore and displayed in the TUI
      • Created ServiceHealthChecker interface for extensible health checking
    • Improved State Reconciliation
      • Implemented proper ReconcileState() method that syncs TUI state with StateStore
      • Updates service statuses, ports, PIDs, and error states from centralized store
      • Synchronizes cluster health information from K8sStateManager
      • Ensures UI consistency after startup and state changes
    • K8s Connections as Services
      • Kubernetes connections are now modeled as services in the dependency graph
      • K8s connection health monitoring is now handled by dedicated K8s connection services
      • Unified service management architecture - all services (K8s, port forwards, MCPs) follow the same lifecycle
      • K8s connections can be stopped/restarted like other services with proper cascade handling
    • Cascading stop functionality: stopping a service automatically stops all dependent services
    • K8s connection health monitoring with automatic service lifecycle management
    • Port forwards now depend on their kubernetes context being authenticated and healthy
    • The kubernetes MCP server depends on the management cluster connection
    • When k8s connections become unhealthy, dependent services are automatically stopped
    • Manual stop (x key) now uses cascading stop to cleanly shut down dependent services
    • New StartServicesDependingOn method in ServiceManager to restart services when dependencies recover
    • New orchestrator package that manages application state and service lifecycle for both TUI and non-TUI modes
    • New HealthStatusUpdate and ReportHealth for proper health status reporting
    • Health-aware startup: Services now wait for their K8s dependencies to be healthy before starting
    • Add comprehensive dependency management system for services
      • Services now track why they were stopped (manual vs dependency cascade)
      • Automatically restart services when their dependencies recover
      • Ensure correct startup order based on dependency graph
      • Prevent manually stopped services from auto-restarting
    • Phase 1 of Issue #45: Message Handling Architecture Improvements
      • Added correlation ID support to ManagedServiceUpdate for tracing related messages and cascading effects
      • Implemented configurable buffer strategies for TUI message channels:
        • BufferActionDrop: Drop messages when buffer is full
        • BufferActionBlock: Block until space is available
        • BufferActionEvictOldest: Remove oldest message to make room for new ones
      • Added priority-based buffer strategies to handle different message types differently
      • Introduced BufferedChannel with metrics tracking (messages sent, dropped, blocked, evicted)
      • Enhanced orchestrator with correlation tracking for health checks and cascading operations
      • Updated service manager to use new correlation ID system for better debugging
      • Added comprehensive test coverage for buffer strategies and correlation tracking
    • Phase 2 of Issue #45: State Consolidation
      • Implemented centralized StateStore as single source of truth for all service states
      • Added ServiceStateSnapshot for complete state information with correlation tracking
      • Introduced state change subscriptions with StateSubscription for reactive updates
      • Enhanced ServiceReporter interface with GetStateStore() method for direct state access
      • Updated TUIReporter and ConsoleReporter to use centralized state management
      • Migrated ServiceManager from local state tracking to centralized StateStore
      • Added comprehensive metrics tracking for state changes and subscription performance
      • Implemented state change event system with old/new state tracking
      • Added support for filtering services by type and state
      • Maintained full backwards compatibility while eliminating state duplication
    • Phase 3 of Issue #45: Structured Event System
      • Implemented comprehensive event hierarchy with semantic event types:
        • ServiceStateEvent for service lifecycle changes with old/new state tracking
        • HealthEvent for cluster health status updates
        • DependencyEvent for cascade start/stop operations
        • UserActionEvent for user-initiated actions
        • SystemEvent for system-level operations
      • Added EventBus interface with publish/subscribe functionality
      • Implemented flexible event filtering system with composable filters:
        • Filter by event type, source, severity, correlation ID
        • Combine filters with AND/OR logic for complex subscriptions
      • Created EventBusAdapter for backwards compatibility with existing ServiceReporter interface
      • Added comprehensive event metrics tracking (published, delivered, dropped events)
      • Implemented both handler-based and channel-based event subscriptions
      • Added event severity levels (trace, debug, info, warn, error, fatal) for better categorization
      • Enhanced correlation tracking with event metadata support
      • Provided thread-safe concurrent event publishing and subscription management
      • Added extensive test coverage for all event types and bus functionality
    • Phase 4 of Issue #45: Testing and Polish
      • Added comprehensive integration tests covering end-to-end event flows
      • Implemented performance monitoring utilities with PerformanceMonitor and metrics tracking
      • Created event batching system with EventBatchProcessor for high-volume scenarios
      • Built OptimizedEventBus with configurable performance optimizations
      • Added object pooling system with EventPoolManager to reduce GC pressure
      • Implemented extensive error recovery testing including panic handling
      • Added memory usage monitoring and subscription cleanup verification
      • Created comprehensive documentation covering architecture, usage, and best practices
      • Fixed race conditions in event bus concurrent access patterns
      • Enhanced thread safety across all components with proper synchronization
      • Provided migration guides and troubleshooting documentation
      • Achieved high test coverage with robust integration and unit tests
    • Improved Dependency Management for Service Restarts
      • When restarting a service, its dependencies are now automatically restarted if they’re not active
      • This ensures services always have their requirements satisfied (e.g., restarting Grafana MCP will also restart its port forward if needed)
      • Dependencies are restarted regardless of their stop reason to guarantee service requirements
      • Clear manual stop reason when restarting a service to allow proper dependency management
    • Implemented Issue #46: Improved State Management Between TUI and Orchestrator
      • Phase 1: Unified State Management
        • Added helper methods to TUI Model to use StateStore as single source of truth
        • Implemented state reconciliation on TUI startup to ensure consistency
        • Updated TUI controller to use StateStore instead of directly updating model maps
        • Eliminated state duplication between TUI Model and StateStore
      • Phase 2: Message Sequencing
        • Added sequence numbers to ManagedServiceUpdate for proper message ordering
        • Implemented MessageBuffer for handling out-of-order messages
        • Added global sequence counter with atomic operations for thread safety
      • Phase 3: Enhanced Correlation Tracking
        • Added CascadeInfo type for tracking cascade relationships between services
        • Added StateTransition type for tracking state changes with full context
        • Enhanced StateStore to record state transitions and cascade operations automatically
        • Updated orchestrator to record cascade operations for better observability
      • Phase 4: Improved Error Handling
        • Added retry logic for critical updates that are dropped due to buffer overflow
        • Implemented BackpressureNotificationMsg for user notifications about dropped messages
        • Added configurable retry attempts with exponential backoff
        • Enhanced TUIReporter with retry queue processing and user feedback
    • Comprehensive Documentation Suite
      • Added Architecture Overview documenting system design, components, and principles
      • Created Quick Start Guide for new users to get up and running quickly
      • Added Troubleshooting Guide with common issues and solutions
      • Enhanced development documentation with recent architectural improvements
      • Documented dependency management, state management, and message flow in detail
    • Configurable Namespace for CR Discovery
      • Added namespace configuration option to config.yaml for Kubernetes CR discovery
      • Allows specifying which namespace to use for MCPServer, ServiceClass, and Workflow resources
      • Defaults to "default" when not specified
      • Enables muster to work properly in multi-namespace Kubernetes environments

    Changed

    • Aggregator Config
      • Drop the “Enabled” field (always enabled in modes where it’s used)
    • Service Manager Refactoring
      • ServiceManager now accepts an optional KubeManager parameter for K8s connection services
      • Added support for K8s connection services in the service lifecycle management
      • Improved service stop handling to report “Stopping” state before closing channels
    • Orchestrator Improvements
      • Removed old health monitoring methods in favor of K8s connection services
      • Updated dependency graph to use service labels for K8s connections (e.g., “k8s-mc-mymc” instead of “k8s:context-name”)
      • Improved service restart logic to properly handle dependencies
    • Dependency graph now includes K8sConnection nodes as fundamental dependencies
    • Service manager’s StopServiceWithDependents method handles cascading stops
    • Health check failures trigger automatic cleanup of dependent services
    • Non-TUI mode now uses the orchestrator for health monitoring and dependency management
    • TUI mode no longer performs its own health checks - the orchestrator handles all health monitoring and the TUI only displays results
    • Proper separation of concerns: orchestrator manages health checks and service lifecycle, TUI only displays status
    • Orchestrator now performs initial health check before starting services
    • Refactored TUI message handling system
      • Introduced specialized controller/dispatcher for better separation of concerns
      • Controllers now focus on single responsibilities
      • Better error handling and logging throughout the message flow
    • Improved startup behavior - the UI now shows loading state until all clusters are fully loaded
    • Port forwards no longer start before K8s health checks pass - orchestrator now checks K8s health before starting dependent services
    • ManagedServiceUpdate now includes CorrelationID, CausedBy, and ParentID fields for tracing
    • TUIReporter now uses configurable buffered channels instead of simple channels
    • Service state updates now include correlation information in logs
    • Orchestrator operations (stop/restart) now generate and track correlation IDs
    • Removed unused DependsOnServices field from MCPServerDefinition - MCP servers never depend on other MCP servers
    • Enhanced RestartService to use the new startServiceWithDependencies method for dependency-aware restarts
    • Updated handleServiceStateUpdate to properly restart services with their dependencies
    • Improved Service Monitoring
      • Fixed monitorAndStartServices to respect StopReasonDependency - services stopped due to dependency failure won’t be restarted until their dependencies are restored
      • Added automatic restart of dependent services when a dependency becomes healthy again
      • Added 1-second delay before restarting services to ensure ports are properly released

    Fixed

    • Exit CLI on standalone server failure
      • When the mcp-aggregator service (server) fails, the CLI now terminates gracefully
    • Port Forwarding State Issue
      • Fixed issue where port forwarding services would get stuck in “Stopping” state
      • ServiceManager now properly reports the “Stopping” state before closing the stop channel
      • Port forwarding processes correctly transition to “Stopped” state
    • Code Cleanup
      • Removed commented-out mcpServerProcess struct that was marked for deletion
      • Removed duplicate updatePortForwardFromSnapshot and updateMcpServerFromSnapshot methods
      • Cleaned up unused code and improved code organization
    • Dependency-Related Fixes
      • Fixed issue where MCP servers would restart even when their port forward dependencies were stopped
      • Services with StopReasonDependency now properly wait for their dependencies to be restored
      • When a service becomes healthy, its dependent services that were stopped due to dependency failure are automatically restarted
      • Fixed “address already in use” errors by adding proper restart delay
    • Fixed spurious error logs when stopping MCP servers
      • Suppressed expected “file already closed” errors that occurred when stopping MCP server processes
      • Added proper error handling for both stdout and stderr pipe closures during shutdown
      • These were harmless errors but created unnecessary noise in the logs
    • Fixed cascade stops not triggering when K8s connections fail
      • When a K8s connection transitions to Failed state (e.g., due to network issues), all dependent services (port forwards and MCP servers) are now properly stopped
      • This prevents orphaned services from continuing to run when their underlying K8s connection is no longer healthy
      • Services will automatically restart when the K8s connection recovers
    • Set config directory early to avoid bugs handling the empty string (those should be fixed with this change as well)

    Documentation

    • Added comprehensive documentation about dependency graph implementation
    • Enhanced dependency management documentation with detailed examples
    • Added explanation of dependency rules and startup/restart behavior
    • Documented the relationship between stop reasons and automatic recovery
    • Created comprehensive architecture documentation covering all major components
    • Added troubleshooting guide with detailed debugging techniques
    • Created quick start guide for new users
    • Updated development guide with recent architectural improvements
    • Documented the entire dependency management system with visual diagrams
    • Updated outdated documentation sections
      • Removed obsolete “Package Design for Shared Core Logic” section from development.md
      • Updated development.md to reference the unified service architecture
      • Fixed test examples in development.md to match current implementation
      • Updated README.md prerequisites to remove mcp-proxy requirement
      • Clarified non-TUI mode behavior in README.md
      • Rewritten MCP Integration Notes in README.md to reflect YAML configuration system

    Technical Details

    • New helper functions: NewManagedServiceUpdate(), WithCause(), WithError(), WithServiceData()
    • New types: BufferStrategy, BufferedChannel, ChannelMetrics, ChannelStats
    • Backwards compatibility maintained for existing interfaces
    • All existing tests updated and new comprehensive test suite added
  • What’s Changed

    Full Changelog: https://github.com/giantswarm/muster/compare/v0.0.236...v0.0.237

  • What’s Changed

    Full Changelog: https://github.com/giantswarm/muster/compare/v0.0.235...v0.0.236