muster release v0.1.0

Changed

  • Session duration reduced from 90 days to 30 days. The refresh token TTL now matches Dex’s absoluteLifetime (720h). Previously, muster’s 90-day refresh token outlived Dex’s 30-day session, causing confusing failures when auto-refresh silently stopped working after day 30. Users who were logging in once every ~2 months will now need to re-authenticate every 30 days.
  • muster auth status now shows session expiry. Instead of Refresh: Available, the output now shows Session: ~29 days remaining (auto-refresh), giving users a concrete estimate of when re-authentication will be required.
  • Access token TTL is now explicitly set to 30 minutes (matching Dex’s idTokens expiry) instead of relying on the library default of 1 hour.
  • Session duration is now configurable via oauth.server.sessionDuration in config.yaml (default: 720h / 30 days).
  • Kubernetes event emission is now disabled by default (alpha feature). Use --enable-events flag on muster serve or set events: true in config.yaml to opt in.
  • Switch CI to push-to-registries-multiarch (architect-orb@6.14.0) with amd64-only on branches for faster PR feedback and full multi-arch on release tags. Chart tests now run before publishing to the app catalog.
  • Update Dockerfile to multi-stage build with native cross-compilation support for multi-architecture images.

Note: The Server-Side Meta-Tools Migration below is a breaking change that will be released as part of the next major version. External integrations should prepare for this change.

Breaking Changes

Server-Side Meta-Tools Migration

Meta-tools (list_tools, call_tool, describe_tool, etc.) have moved from the agent to the aggregator server. This is a fundamental architectural change. What Changed:

ComponentBeforeAfter
AgentExposed 11 meta-tools + bridged to aggregatorTransport bridge only (OAuth shim + stdio↔HTTP)
AggregatorExposed 36+ core tools directlyExposes ONLY meta-tools - no direct tool access
Tool AccessDirect tool calls to aggregatorAll tool calls go through call_tool meta-tool
What Continues Working (Transparent Migration):
  • CLI commands (muster list, muster get, etc.) - client wraps calls automatically
  • Agent REPL (muster agent --repl) - uses same client with transparent wrapping
  • BDD test scenarios - test client wraps calls automatically
  • MCP native protocol methods (tools/list, resources/list) - not affected What Breaks (Requires Update):
  • External integrations calling tools directly via HTTP
  • Custom clients connecting directly to aggregator Migration for External Clients:
// Before: Direct tool call
{"method": "tools/call", "params": {"name": "core_service_list", "arguments": {}}}
// After: Wrap through call_tool
{"method": "tools/call", "params": {
  "name": "call_tool",
  "arguments": {"name": "core_service_list", "arguments": {}}
}}

Benefits:

  • OAuth-capable clients can connect directly to server without agent
  • Simpler agent architecture (~200 lines vs ~700 lines)
  • Consistent tool visibility across all clients
  • Centralized meta-tool logic See ADR-010 for design details. Known External Integrations Affected:
  • Any HTTP clients calling the aggregator directly
  • Custom MCP clients not using muster agent
  • CI/CD pipelines with direct tool calls Recommended Migration Timeline:
  1. Review your integration code for direct tool calls
  2. Update to wrap calls through call_tool meta-tool
  3. Test with the new Muster version before deploying

Changed

  • MCPServer CRD State Exposes Auth Required - The MCPServer CRD now shows Auth Required state when a remote server returns 401 Unauthorized (#337)
    • Before: 401 response mapped to Connected (hiding auth requirement)
    • After: 401 response shows as Auth Required in CRD state
    • This gives operators clear visibility into which servers need authentication
    • CLI output updated: muster list mcpserver now shows Auth Required state
    • SESSION column values updated: OKAuthenticated, RequiredPending Auth
    • Column header renamed: AUTHSESSION to match muster auth status output

Added

  • Reconciliation Framework - Automatic synchronization between resource definitions (CRDs/YAML) and running services
    • Supports both Kubernetes mode (using controller-runtime informers) and filesystem mode (using fsnotify)
    • Auto-detects operating mode based on environment
    • Configurable per-resource-type enable/disable
    • Work queue with deduplication and exponential backoff
    • Status tracking and API for observability
    • See ADR 007 for design details
  • StateChangeBridge - Real-time sync of runtime state changes to CRD status subresources
    • Subscribes to orchestrator service state changes
    • Triggers reconciliation to update CRD status when services start/stop/crash

Changed

  • BREAKING: Consolidated OAuth Configuration Naming - OAuth configuration structure has been reorganized for clarity (#324)
    • Before: aggregator.oauth (client/proxy) + aggregator.oauthServer (server protection)
    • After: aggregator.oauth.mcpClient (MCP client/proxy) + aggregator.oauth.server (server protection)
    • Both OAuth roles now live under a single oauth section with explicit mcpClient/server sub-sections
    • The mcpClient name makes it clear this is for authenticating TO remote MCP servers
    • CLI flags renamed: --oauth--oauth-mcp-client, --oauth-public-url--oauth-mcp-client-public-url
    • Helm values updated: muster.oauth.*muster.oauth.mcpClient.*, muster.oauthServer.*muster.oauth.server.*
    • CIMD configuration moved to nested structure: cimdPath/cimdScopescimd.path/cimd.scopes
    • Migration: Update configuration files and Helm values to use the new structure
  • BREAKING: CRD Status Field Changes - Status fields have been redesigned for session-aware tool availability
    • MCPServerStatus: Removed availableTools (session-dependent), added lastConnected and restartCount
    • ServiceClassStatus: Replaced available/requiredTools/missingTools/toolAvailability with valid/validationErrors/referencedTools
    • WorkflowStatus: Replaced available/requiredTools/missingTools/stepValidation with valid/validationErrors/referencedTools/stepCount
    • Tool availability is now computed per-session at runtime, not stored in CRs
    • Existing CRs will have stale status fields that will be updated on first reconciliation
  • Added Chart annotations to support OCI repositories.

Fixed

  • Helm CiliumNetworkPolicy: Fixed incorrect values path for OAuth storage check (now uses .Values.muster.oauth.server.storage)

Added

  • Remote MCP Server Support for Kubernetes Environments
    • Added comprehensive support for stdio, streamable-http and sse transport protocols
    • Enhanced CRD Schema: Updated MCPServerSpec to support all MCP server types
      • Added new config for streamable-http and sse: url, headers and timeout fields
      • Added mutual exclusion validation and required field validation using kubebuilder annotations
    • New CLI Commands: Added subcommands to use new type system
      • muster create mcpserver <name> --type stdio for local MCP servers
      • muster create mcpserver <name> --type streamable-http for HTTP remote servers
      • muster create mcpserver <name> --type sse for SSE remote servers
    • Updated Examples: Enhanced example files to demonstrate both local and remote configurations
    • Kubernetes Deployment Ready: Enables deployment patterns where Muster aggregator runs in cluster and connects to MCP servers deployed as separate Kubernetes services
  • Systemd Socket Activation Support
    • Added muster.socket unit file for socket-activated systemd deployment
    • Modified muster.service to use socket activation on localhost:8090
    • Updated scripts/setup-systemd.sh and scripts/dev-restart.sh to handle socket activation
    • Make use of new dependency github.com/coreos/go-systemd to handle socket activation
  • Service Health Monitoring
    • Added health checks for MCP servers using the tools/list JSON-RPC method
    • Added health checks for port forwards by testing TCP connectivity
    • Health checks run every 30 seconds for all running services
    • Health status is reported through the StateStore and displayed in the TUI
    • Created ServiceHealthChecker interface for extensible health checking
  • Improved State Reconciliation
    • Implemented proper ReconcileState() method that syncs TUI state with StateStore
    • Updates service statuses, ports, PIDs, and error states from centralized store
    • Synchronizes cluster health information from K8sStateManager
    • Ensures UI consistency after startup and state changes
  • K8s Connections as Services
    • Kubernetes connections are now modeled as services in the dependency graph
    • K8s connection health monitoring is now handled by dedicated K8s connection services
    • Unified service management architecture - all services (K8s, port forwards, MCPs) follow the same lifecycle
    • K8s connections can be stopped/restarted like other services with proper cascade handling
  • Cascading stop functionality: stopping a service automatically stops all dependent services
  • K8s connection health monitoring with automatic service lifecycle management
  • Port forwards now depend on their kubernetes context being authenticated and healthy
  • The kubernetes MCP server depends on the management cluster connection
  • When k8s connections become unhealthy, dependent services are automatically stopped
  • Manual stop (x key) now uses cascading stop to cleanly shut down dependent services
  • New StartServicesDependingOn method in ServiceManager to restart services when dependencies recover
  • New orchestrator package that manages application state and service lifecycle for both TUI and non-TUI modes
  • New HealthStatusUpdate and ReportHealth for proper health status reporting
  • Health-aware startup: Services now wait for their K8s dependencies to be healthy before starting
  • Add comprehensive dependency management system for services
    • Services now track why they were stopped (manual vs dependency cascade)
    • Automatically restart services when their dependencies recover
    • Ensure correct startup order based on dependency graph
    • Prevent manually stopped services from auto-restarting
  • Phase 1 of Issue #45: Message Handling Architecture Improvements
    • Added correlation ID support to ManagedServiceUpdate for tracing related messages and cascading effects
    • Implemented configurable buffer strategies for TUI message channels:
      • BufferActionDrop: Drop messages when buffer is full
      • BufferActionBlock: Block until space is available
      • BufferActionEvictOldest: Remove oldest message to make room for new ones
    • Added priority-based buffer strategies to handle different message types differently
    • Introduced BufferedChannel with metrics tracking (messages sent, dropped, blocked, evicted)
    • Enhanced orchestrator with correlation tracking for health checks and cascading operations
    • Updated service manager to use new correlation ID system for better debugging
    • Added comprehensive test coverage for buffer strategies and correlation tracking
  • Phase 2 of Issue #45: State Consolidation
    • Implemented centralized StateStore as single source of truth for all service states
    • Added ServiceStateSnapshot for complete state information with correlation tracking
    • Introduced state change subscriptions with StateSubscription for reactive updates
    • Enhanced ServiceReporter interface with GetStateStore() method for direct state access
    • Updated TUIReporter and ConsoleReporter to use centralized state management
    • Migrated ServiceManager from local state tracking to centralized StateStore
    • Added comprehensive metrics tracking for state changes and subscription performance
    • Implemented state change event system with old/new state tracking
    • Added support for filtering services by type and state
    • Maintained full backwards compatibility while eliminating state duplication
  • Phase 3 of Issue #45: Structured Event System
    • Implemented comprehensive event hierarchy with semantic event types:
      • ServiceStateEvent for service lifecycle changes with old/new state tracking
      • HealthEvent for cluster health status updates
      • DependencyEvent for cascade start/stop operations
      • UserActionEvent for user-initiated actions
      • SystemEvent for system-level operations
    • Added EventBus interface with publish/subscribe functionality
    • Implemented flexible event filtering system with composable filters:
      • Filter by event type, source, severity, correlation ID
      • Combine filters with AND/OR logic for complex subscriptions
    • Created EventBusAdapter for backwards compatibility with existing ServiceReporter interface
    • Added comprehensive event metrics tracking (published, delivered, dropped events)
    • Implemented both handler-based and channel-based event subscriptions
    • Added event severity levels (trace, debug, info, warn, error, fatal) for better categorization
    • Enhanced correlation tracking with event metadata support
    • Provided thread-safe concurrent event publishing and subscription management
    • Added extensive test coverage for all event types and bus functionality
  • Phase 4 of Issue #45: Testing and Polish
    • Added comprehensive integration tests covering end-to-end event flows
    • Implemented performance monitoring utilities with PerformanceMonitor and metrics tracking
    • Created event batching system with EventBatchProcessor for high-volume scenarios
    • Built OptimizedEventBus with configurable performance optimizations
    • Added object pooling system with EventPoolManager to reduce GC pressure
    • Implemented extensive error recovery testing including panic handling
    • Added memory usage monitoring and subscription cleanup verification
    • Created comprehensive documentation covering architecture, usage, and best practices
    • Fixed race conditions in event bus concurrent access patterns
    • Enhanced thread safety across all components with proper synchronization
    • Provided migration guides and troubleshooting documentation
    • Achieved high test coverage with robust integration and unit tests
  • Improved Dependency Management for Service Restarts
    • When restarting a service, its dependencies are now automatically restarted if they’re not active
    • This ensures services always have their requirements satisfied (e.g., restarting Grafana MCP will also restart its port forward if needed)
    • Dependencies are restarted regardless of their stop reason to guarantee service requirements
    • Clear manual stop reason when restarting a service to allow proper dependency management
  • Implemented Issue #46: Improved State Management Between TUI and Orchestrator
    • Phase 1: Unified State Management
      • Added helper methods to TUI Model to use StateStore as single source of truth
      • Implemented state reconciliation on TUI startup to ensure consistency
      • Updated TUI controller to use StateStore instead of directly updating model maps
      • Eliminated state duplication between TUI Model and StateStore
    • Phase 2: Message Sequencing
      • Added sequence numbers to ManagedServiceUpdate for proper message ordering
      • Implemented MessageBuffer for handling out-of-order messages
      • Added global sequence counter with atomic operations for thread safety
    • Phase 3: Enhanced Correlation Tracking
      • Added CascadeInfo type for tracking cascade relationships between services
      • Added StateTransition type for tracking state changes with full context
      • Enhanced StateStore to record state transitions and cascade operations automatically
      • Updated orchestrator to record cascade operations for better observability
    • Phase 4: Improved Error Handling
      • Added retry logic for critical updates that are dropped due to buffer overflow
      • Implemented BackpressureNotificationMsg for user notifications about dropped messages
      • Added configurable retry attempts with exponential backoff
      • Enhanced TUIReporter with retry queue processing and user feedback
  • Comprehensive Documentation Suite
    • Added Architecture Overview documenting system design, components, and principles
    • Created Quick Start Guide for new users to get up and running quickly
    • Added Troubleshooting Guide with common issues and solutions
    • Enhanced development documentation with recent architectural improvements
    • Documented dependency management, state management, and message flow in detail
  • Configurable Namespace for CR Discovery
    • Added namespace configuration option to config.yaml for Kubernetes CR discovery
    • Allows specifying which namespace to use for MCPServer, ServiceClass, and Workflow resources
    • Defaults to "default" when not specified
    • Enables muster to work properly in multi-namespace Kubernetes environments

Changed

  • Aggregator Config
    • Drop the “Enabled” field (always enabled in modes where it’s used)
  • Service Manager Refactoring
    • ServiceManager now accepts an optional KubeManager parameter for K8s connection services
    • Added support for K8s connection services in the service lifecycle management
    • Improved service stop handling to report “Stopping” state before closing channels
  • Orchestrator Improvements
    • Removed old health monitoring methods in favor of K8s connection services
    • Updated dependency graph to use service labels for K8s connections (e.g., “k8s-mc-mymc” instead of “k8s:context-name”)
    • Improved service restart logic to properly handle dependencies
  • Dependency graph now includes K8sConnection nodes as fundamental dependencies
  • Service manager’s StopServiceWithDependents method handles cascading stops
  • Health check failures trigger automatic cleanup of dependent services
  • Non-TUI mode now uses the orchestrator for health monitoring and dependency management
  • TUI mode no longer performs its own health checks - the orchestrator handles all health monitoring and the TUI only displays results
  • Proper separation of concerns: orchestrator manages health checks and service lifecycle, TUI only displays status
  • Orchestrator now performs initial health check before starting services
  • Refactored TUI message handling system
    • Introduced specialized controller/dispatcher for better separation of concerns
    • Controllers now focus on single responsibilities
    • Better error handling and logging throughout the message flow
  • Improved startup behavior - the UI now shows loading state until all clusters are fully loaded
  • Port forwards no longer start before K8s health checks pass - orchestrator now checks K8s health before starting dependent services
  • ManagedServiceUpdate now includes CorrelationID, CausedBy, and ParentID fields for tracing
  • TUIReporter now uses configurable buffered channels instead of simple channels
  • Service state updates now include correlation information in logs
  • Orchestrator operations (stop/restart) now generate and track correlation IDs
  • Removed unused DependsOnServices field from MCPServerDefinition - MCP servers never depend on other MCP servers
  • Enhanced RestartService to use the new startServiceWithDependencies method for dependency-aware restarts
  • Updated handleServiceStateUpdate to properly restart services with their dependencies
  • Improved Service Monitoring
    • Fixed monitorAndStartServices to respect StopReasonDependency - services stopped due to dependency failure won’t be restarted until their dependencies are restored
    • Added automatic restart of dependent services when a dependency becomes healthy again
    • Added 1-second delay before restarting services to ensure ports are properly released

Fixed

  • Exit CLI on standalone server failure
    • When the mcp-aggregator service (server) fails, the CLI now terminates gracefully
  • Port Forwarding State Issue
    • Fixed issue where port forwarding services would get stuck in “Stopping” state
    • ServiceManager now properly reports the “Stopping” state before closing the stop channel
    • Port forwarding processes correctly transition to “Stopped” state
  • Code Cleanup
    • Removed commented-out mcpServerProcess struct that was marked for deletion
    • Removed duplicate updatePortForwardFromSnapshot and updateMcpServerFromSnapshot methods
    • Cleaned up unused code and improved code organization
  • Dependency-Related Fixes
    • Fixed issue where MCP servers would restart even when their port forward dependencies were stopped
    • Services with StopReasonDependency now properly wait for their dependencies to be restored
    • When a service becomes healthy, its dependent services that were stopped due to dependency failure are automatically restarted
    • Fixed “address already in use” errors by adding proper restart delay
  • Fixed spurious error logs when stopping MCP servers
    • Suppressed expected “file already closed” errors that occurred when stopping MCP server processes
    • Added proper error handling for both stdout and stderr pipe closures during shutdown
    • These were harmless errors but created unnecessary noise in the logs
  • Fixed cascade stops not triggering when K8s connections fail
    • When a K8s connection transitions to Failed state (e.g., due to network issues), all dependent services (port forwards and MCP servers) are now properly stopped
    • This prevents orphaned services from continuing to run when their underlying K8s connection is no longer healthy
    • Services will automatically restart when the K8s connection recovers
  • Set config directory early to avoid bugs handling the empty string (those should be fixed with this change as well)

Documentation

  • Added comprehensive documentation about dependency graph implementation
  • Enhanced dependency management documentation with detailed examples
  • Added explanation of dependency rules and startup/restart behavior
  • Documented the relationship between stop reasons and automatic recovery
  • Created comprehensive architecture documentation covering all major components
  • Added troubleshooting guide with detailed debugging techniques
  • Created quick start guide for new users
  • Updated development guide with recent architectural improvements
  • Documented the entire dependency management system with visual diagrams
  • Updated outdated documentation sections
    • Removed obsolete “Package Design for Shared Core Logic” section from development.md
    • Updated development.md to reference the unified service architecture
    • Fixed test examples in development.md to match current implementation
    • Updated README.md prerequisites to remove mcp-proxy requirement
    • Clarified non-TUI mode behavior in README.md
    • Rewritten MCP Integration Notes in README.md to reflect YAML configuration system

Technical Details

  • New helper functions: NewManagedServiceUpdate(), WithCause(), WithError(), WithServiceData()
  • New types: BufferStrategy, BufferedChannel, ChannelMetrics, ChannelStats
  • Backwards compatibility maintained for existing interfaces
  • All existing tests updated and new comprehensive test suite added