Changed
- Session duration reduced from 90 days to 30 days. The refresh token TTL now
matches Dex’s
absoluteLifetime (720h). Previously, muster’s 90-day refresh token
outlived Dex’s 30-day session, causing confusing failures when auto-refresh silently
stopped working after day 30. Users who were logging in once every ~2 months will now
need to re-authenticate every 30 days. muster auth status now shows session expiry. Instead of Refresh: Available,
the output now shows Session: ~29 days remaining (auto-refresh), giving users a
concrete estimate of when re-authentication will be required.- Access token TTL is now explicitly set to 30 minutes (matching Dex’s
idTokens
expiry) instead of relying on the library default of 1 hour. - Session duration is now configurable via
oauth.server.sessionDuration in
config.yaml (default: 720h / 30 days). - Kubernetes event emission is now disabled by default (alpha feature). Use
--enable-events flag on muster serve or set events: true in config.yaml to opt in. - Switch CI to
push-to-registries-multiarch (architect-orb@6.14.0) with
amd64-only on branches for faster PR feedback and full multi-arch on release
tags. Chart tests now run before publishing to the app catalog. - Update Dockerfile to multi-stage build with native cross-compilation support
for multi-architecture images.
Note: The Server-Side Meta-Tools Migration below is a breaking change that will be released as part of the next major version. External integrations should prepare for this change.
Breaking Changes
Meta-tools (list_tools, call_tool, describe_tool, etc.) have moved from the agent to the aggregator server. This is a fundamental architectural change.
What Changed:
| Component | Before | After |
|---|
| Agent | Exposed 11 meta-tools + bridged to aggregator | Transport bridge only (OAuth shim + stdio↔HTTP) |
| Aggregator | Exposed 36+ core tools directly | Exposes ONLY meta-tools - no direct tool access |
| Tool Access | Direct tool calls to aggregator | All tool calls go through call_tool meta-tool |
| What Continues Working (Transparent Migration): | | |
- CLI commands (
muster list, muster get, etc.) - client wraps calls automatically - Agent REPL (
muster agent --repl) - uses same client with transparent wrapping - BDD test scenarios - test client wraps calls automatically
- MCP native protocol methods (
tools/list, resources/list) - not affected
What Breaks (Requires Update): - External integrations calling tools directly via HTTP
- Custom clients connecting directly to aggregator
Migration for External Clients:
// Before: Direct tool call
{"method": "tools/call", "params": {"name": "core_service_list", "arguments": {}}}
// After: Wrap through call_tool
{"method": "tools/call", "params": {
"name": "call_tool",
"arguments": {"name": "core_service_list", "arguments": {}}
}}
Benefits:
- OAuth-capable clients can connect directly to server without agent
- Simpler agent architecture (~200 lines vs ~700 lines)
- Consistent tool visibility across all clients
- Centralized meta-tool logic
See ADR-010 for design details.
Known External Integrations Affected:
- Any HTTP clients calling the aggregator directly
- Custom MCP clients not using
muster agent - CI/CD pipelines with direct tool calls
Recommended Migration Timeline:
- Review your integration code for direct tool calls
- Update to wrap calls through
call_tool meta-tool - Test with the new Muster version before deploying
Changed
- MCPServer CRD State Exposes Auth Required - The MCPServer CRD now shows
Auth Required state when a remote server returns 401 Unauthorized (#337)- Before: 401 response mapped to
Connected (hiding auth requirement) - After: 401 response shows as
Auth Required in CRD state - This gives operators clear visibility into which servers need authentication
- CLI output updated:
muster list mcpserver now shows Auth Required state - SESSION column values updated:
OK → Authenticated, Required → Pending Auth - Column header renamed:
AUTH → SESSION to match muster auth status output
Added
- Reconciliation Framework - Automatic synchronization between resource definitions (CRDs/YAML) and running services
- Supports both Kubernetes mode (using controller-runtime informers) and filesystem mode (using fsnotify)
- Auto-detects operating mode based on environment
- Configurable per-resource-type enable/disable
- Work queue with deduplication and exponential backoff
- Status tracking and API for observability
- See ADR 007 for design details
- StateChangeBridge - Real-time sync of runtime state changes to CRD status subresources
- Subscribes to orchestrator service state changes
- Triggers reconciliation to update CRD status when services start/stop/crash
Changed
- BREAKING: Consolidated OAuth Configuration Naming - OAuth configuration structure has been reorganized for clarity (#324)
- Before:
aggregator.oauth (client/proxy) + aggregator.oauthServer (server protection) - After:
aggregator.oauth.mcpClient (MCP client/proxy) + aggregator.oauth.server (server protection) - Both OAuth roles now live under a single
oauth section with explicit mcpClient/server sub-sections - The
mcpClient name makes it clear this is for authenticating TO remote MCP servers - CLI flags renamed:
--oauth → --oauth-mcp-client, --oauth-public-url → --oauth-mcp-client-public-url - Helm values updated:
muster.oauth.* → muster.oauth.mcpClient.*, muster.oauthServer.* → muster.oauth.server.* - CIMD configuration moved to nested structure:
cimdPath/cimdScopes → cimd.path/cimd.scopes - Migration: Update configuration files and Helm values to use the new structure
- BREAKING: CRD Status Field Changes - Status fields have been redesigned for session-aware tool availability
- MCPServerStatus: Removed
availableTools (session-dependent), added lastConnected and restartCount - ServiceClassStatus: Replaced
available/requiredTools/missingTools/toolAvailability with valid/validationErrors/referencedTools - WorkflowStatus: Replaced
available/requiredTools/missingTools/stepValidation with valid/validationErrors/referencedTools/stepCount - Tool availability is now computed per-session at runtime, not stored in CRs
- Existing CRs will have stale status fields that will be updated on first reconciliation
- Added Chart annotations to support OCI repositories.
Fixed
- Helm CiliumNetworkPolicy: Fixed incorrect values path for OAuth storage check (now uses
.Values.muster.oauth.server.storage)
Added
- Remote MCP Server Support for Kubernetes Environments
- Added comprehensive support for
stdio, streamable-http and sse transport protocols - Enhanced CRD Schema: Updated
MCPServerSpec to support all MCP server types- Added new config for
streamable-http and sse: url, headers and timeout fields - Added mutual exclusion validation and required field validation using kubebuilder annotations
- New CLI Commands: Added subcommands to use new type system
muster create mcpserver <name> --type stdio for local MCP serversmuster create mcpserver <name> --type streamable-http for HTTP remote serversmuster create mcpserver <name> --type sse for SSE remote servers
- Updated Examples: Enhanced example files to demonstrate both local and remote configurations
- Kubernetes Deployment Ready: Enables deployment patterns where Muster aggregator runs in cluster and connects to MCP servers deployed as separate Kubernetes services
- Systemd Socket Activation Support
- Added
muster.socket unit file for socket-activated systemd deployment - Modified
muster.service to use socket activation on localhost:8090 - Updated
scripts/setup-systemd.sh and scripts/dev-restart.sh to handle socket activation - Make use of new dependency
github.com/coreos/go-systemd to handle socket activation
- Service Health Monitoring
- Added health checks for MCP servers using the
tools/list JSON-RPC method - Added health checks for port forwards by testing TCP connectivity
- Health checks run every 30 seconds for all running services
- Health status is reported through the StateStore and displayed in the TUI
- Created
ServiceHealthChecker interface for extensible health checking
- Improved State Reconciliation
- Implemented proper
ReconcileState() method that syncs TUI state with StateStore - Updates service statuses, ports, PIDs, and error states from centralized store
- Synchronizes cluster health information from K8sStateManager
- Ensures UI consistency after startup and state changes
- K8s Connections as Services
- Kubernetes connections are now modeled as services in the dependency graph
- K8s connection health monitoring is now handled by dedicated K8s connection services
- Unified service management architecture - all services (K8s, port forwards, MCPs) follow the same lifecycle
- K8s connections can be stopped/restarted like other services with proper cascade handling
- Cascading stop functionality: stopping a service automatically stops all dependent services
- K8s connection health monitoring with automatic service lifecycle management
- Port forwards now depend on their kubernetes context being authenticated and healthy
- The kubernetes MCP server depends on the management cluster connection
- When k8s connections become unhealthy, dependent services are automatically stopped
- Manual stop (x key) now uses cascading stop to cleanly shut down dependent services
- New
StartServicesDependingOn method in ServiceManager to restart services when dependencies recover - New
orchestrator package that manages application state and service lifecycle for both TUI and non-TUI modes - New
HealthStatusUpdate and ReportHealth for proper health status reporting - Health-aware startup: Services now wait for their K8s dependencies to be healthy before starting
- Add comprehensive dependency management system for services
- Services now track why they were stopped (manual vs dependency cascade)
- Automatically restart services when their dependencies recover
- Ensure correct startup order based on dependency graph
- Prevent manually stopped services from auto-restarting
- Phase 1 of Issue #45: Message Handling Architecture Improvements
- Added correlation ID support to
ManagedServiceUpdate for tracing related messages and cascading effects - Implemented configurable buffer strategies for TUI message channels:
BufferActionDrop: Drop messages when buffer is fullBufferActionBlock: Block until space is availableBufferActionEvictOldest: Remove oldest message to make room for new ones
- Added priority-based buffer strategies to handle different message types differently
- Introduced
BufferedChannel with metrics tracking (messages sent, dropped, blocked, evicted) - Enhanced orchestrator with correlation tracking for health checks and cascading operations
- Updated service manager to use new correlation ID system for better debugging
- Added comprehensive test coverage for buffer strategies and correlation tracking
- Phase 2 of Issue #45: State Consolidation
- Implemented centralized
StateStore as single source of truth for all service states - Added
ServiceStateSnapshot for complete state information with correlation tracking - Introduced state change subscriptions with
StateSubscription for reactive updates - Enhanced
ServiceReporter interface with GetStateStore() method for direct state access - Updated
TUIReporter and ConsoleReporter to use centralized state management - Migrated
ServiceManager from local state tracking to centralized StateStore - Added comprehensive metrics tracking for state changes and subscription performance
- Implemented state change event system with old/new state tracking
- Added support for filtering services by type and state
- Maintained full backwards compatibility while eliminating state duplication
- Phase 3 of Issue #45: Structured Event System
- Implemented comprehensive event hierarchy with semantic event types:
ServiceStateEvent for service lifecycle changes with old/new state trackingHealthEvent for cluster health status updatesDependencyEvent for cascade start/stop operationsUserActionEvent for user-initiated actionsSystemEvent for system-level operations
- Added
EventBus interface with publish/subscribe functionality - Implemented flexible event filtering system with composable filters:
- Filter by event type, source, severity, correlation ID
- Combine filters with AND/OR logic for complex subscriptions
- Created
EventBusAdapter for backwards compatibility with existing ServiceReporter interface - Added comprehensive event metrics tracking (published, delivered, dropped events)
- Implemented both handler-based and channel-based event subscriptions
- Added event severity levels (trace, debug, info, warn, error, fatal) for better categorization
- Enhanced correlation tracking with event metadata support
- Provided thread-safe concurrent event publishing and subscription management
- Added extensive test coverage for all event types and bus functionality
- Phase 4 of Issue #45: Testing and Polish
- Added comprehensive integration tests covering end-to-end event flows
- Implemented performance monitoring utilities with
PerformanceMonitor and metrics tracking - Created event batching system with
EventBatchProcessor for high-volume scenarios - Built
OptimizedEventBus with configurable performance optimizations - Added object pooling system with
EventPoolManager to reduce GC pressure - Implemented extensive error recovery testing including panic handling
- Added memory usage monitoring and subscription cleanup verification
- Created comprehensive documentation covering architecture, usage, and best practices
- Fixed race conditions in event bus concurrent access patterns
- Enhanced thread safety across all components with proper synchronization
- Provided migration guides and troubleshooting documentation
- Achieved high test coverage with robust integration and unit tests
- Improved Dependency Management for Service Restarts
- When restarting a service, its dependencies are now automatically restarted if they’re not active
- This ensures services always have their requirements satisfied (e.g., restarting Grafana MCP will also restart its port forward if needed)
- Dependencies are restarted regardless of their stop reason to guarantee service requirements
- Clear manual stop reason when restarting a service to allow proper dependency management
- Implemented Issue #46: Improved State Management Between TUI and Orchestrator
- Phase 1: Unified State Management
- Added helper methods to TUI Model to use StateStore as single source of truth
- Implemented state reconciliation on TUI startup to ensure consistency
- Updated TUI controller to use StateStore instead of directly updating model maps
- Eliminated state duplication between TUI Model and StateStore
- Phase 2: Message Sequencing
- Added sequence numbers to
ManagedServiceUpdate for proper message ordering - Implemented
MessageBuffer for handling out-of-order messages - Added global sequence counter with atomic operations for thread safety
- Phase 3: Enhanced Correlation Tracking
- Added
CascadeInfo type for tracking cascade relationships between services - Added
StateTransition type for tracking state changes with full context - Enhanced StateStore to record state transitions and cascade operations automatically
- Updated orchestrator to record cascade operations for better observability
- Phase 4: Improved Error Handling
- Added retry logic for critical updates that are dropped due to buffer overflow
- Implemented
BackpressureNotificationMsg for user notifications about dropped messages - Added configurable retry attempts with exponential backoff
- Enhanced TUIReporter with retry queue processing and user feedback
- Comprehensive Documentation Suite
- Added Architecture Overview documenting system design, components, and principles
- Created Quick Start Guide for new users to get up and running quickly
- Added Troubleshooting Guide with common issues and solutions
- Enhanced development documentation with recent architectural improvements
- Documented dependency management, state management, and message flow in detail
- Configurable Namespace for CR Discovery
- Added
namespace configuration option to config.yaml for Kubernetes CR discovery - Allows specifying which namespace to use for MCPServer, ServiceClass, and Workflow resources
- Defaults to
"default" when not specified - Enables muster to work properly in multi-namespace Kubernetes environments
Changed
- Aggregator Config
- Drop the “Enabled” field (always enabled in modes where it’s used)
- Service Manager Refactoring
- ServiceManager now accepts an optional KubeManager parameter for K8s connection services
- Added support for K8s connection services in the service lifecycle management
- Improved service stop handling to report “Stopping” state before closing channels
- Orchestrator Improvements
- Removed old health monitoring methods in favor of K8s connection services
- Updated dependency graph to use service labels for K8s connections (e.g., “k8s-mc-mymc” instead of “k8s:context-name”)
- Improved service restart logic to properly handle dependencies
- Dependency graph now includes K8sConnection nodes as fundamental dependencies
- Service manager’s StopServiceWithDependents method handles cascading stops
- Health check failures trigger automatic cleanup of dependent services
- Non-TUI mode now uses the orchestrator for health monitoring and dependency management
- TUI mode no longer performs its own health checks - the orchestrator handles all health monitoring and the TUI only displays results
- Proper separation of concerns: orchestrator manages health checks and service lifecycle, TUI only displays status
- Orchestrator now performs initial health check before starting services
- Refactored TUI message handling system
- Introduced specialized controller/dispatcher for better separation of concerns
- Controllers now focus on single responsibilities
- Better error handling and logging throughout the message flow
- Improved startup behavior - the UI now shows loading state until all clusters are fully loaded
- Port forwards no longer start before K8s health checks pass - orchestrator now checks K8s health before starting dependent services
ManagedServiceUpdate now includes CorrelationID, CausedBy, and ParentID fields for tracingTUIReporter now uses configurable buffered channels instead of simple channels- Service state updates now include correlation information in logs
- Orchestrator operations (stop/restart) now generate and track correlation IDs
- Removed unused
DependsOnServices field from MCPServerDefinition - MCP servers never depend on other MCP servers - Enhanced
RestartService to use the new startServiceWithDependencies method for dependency-aware restarts - Updated
handleServiceStateUpdate to properly restart services with their dependencies - Improved Service Monitoring
- Fixed
monitorAndStartServices to respect StopReasonDependency - services stopped due to dependency failure won’t be restarted until their dependencies are restored - Added automatic restart of dependent services when a dependency becomes healthy again
- Added 1-second delay before restarting services to ensure ports are properly released
Fixed
- Exit CLI on standalone server failure
- When the mcp-aggregator service (server) fails, the CLI now terminates gracefully
- Port Forwarding State Issue
- Fixed issue where port forwarding services would get stuck in “Stopping” state
- ServiceManager now properly reports the “Stopping” state before closing the stop channel
- Port forwarding processes correctly transition to “Stopped” state
- Code Cleanup
- Removed commented-out
mcpServerProcess struct that was marked for deletion - Removed duplicate
updatePortForwardFromSnapshot and updateMcpServerFromSnapshot methods - Cleaned up unused code and improved code organization
- Dependency-Related Fixes
- Fixed issue where MCP servers would restart even when their port forward dependencies were stopped
- Services with
StopReasonDependency now properly wait for their dependencies to be restored - When a service becomes healthy, its dependent services that were stopped due to dependency failure are automatically restarted
- Fixed “address already in use” errors by adding proper restart delay
- Fixed spurious error logs when stopping MCP servers
- Suppressed expected “file already closed” errors that occurred when stopping MCP server processes
- Added proper error handling for both stdout and stderr pipe closures during shutdown
- These were harmless errors but created unnecessary noise in the logs
- Fixed cascade stops not triggering when K8s connections fail
- When a K8s connection transitions to Failed state (e.g., due to network issues), all dependent services (port forwards and MCP servers) are now properly stopped
- This prevents orphaned services from continuing to run when their underlying K8s connection is no longer healthy
- Services will automatically restart when the K8s connection recovers
- Set config directory early to avoid bugs handling the empty string (those should be fixed with this change as well)
Documentation
- Added comprehensive documentation about dependency graph implementation
- Enhanced dependency management documentation with detailed examples
- Added explanation of dependency rules and startup/restart behavior
- Documented the relationship between stop reasons and automatic recovery
- Created comprehensive architecture documentation covering all major components
- Added troubleshooting guide with detailed debugging techniques
- Created quick start guide for new users
- Updated development guide with recent architectural improvements
- Documented the entire dependency management system with visual diagrams
- Updated outdated documentation sections
- Removed obsolete “Package Design for Shared Core Logic” section from development.md
- Updated development.md to reference the unified service architecture
- Fixed test examples in development.md to match current implementation
- Updated README.md prerequisites to remove mcp-proxy requirement
- Clarified non-TUI mode behavior in README.md
- Rewritten MCP Integration Notes in README.md to reflect YAML configuration system
Technical Details
- New helper functions:
NewManagedServiceUpdate(), WithCause(), WithError(), WithServiceData() - New types:
BufferStrategy, BufferedChannel, ChannelMetrics, ChannelStats - Backwards compatibility maintained for existing interfaces
- All existing tests updated and new comprehensive test suite added