AgentEnsemble | Blog

A Control Plane for Long-Running Agent Services

Thu, 04 Jun 2026 00:00:00 GMT

An earlier post in this series covered running agent ensembles as long-running services — always-on processes that accept work over WebSocket, HTTP, queues, or topics instead of running once and exiting. Once an ensemble is a service, a new category of problem appears: how do external systems interact with it?

The existing WebSocket dashboard streams execution events and handles review decisions. That covers observability and human review. What it doesn’t cover is run submission. There’s no way for a CI pipeline, orchestrator, or custom UI to kick off a run, pass runtime parameters, query what’s currently executing, or cancel something that’s gone wrong — without a WebSocket connection and custom client code.

The Ensemble Control API fills that gap.

The Control Plane vs. the Data Plane

Before getting into the API itself, a design distinction worth stating explicitly.

The v3 network module handles ensemble-to-ensemble communication: tasks delegating work to remote peers, capability registries, federation across namespaces. That’s the data plane — ensemble-internal traffic, designed for ensemble peers.

The Control API is the control plane: CI pipelines, orchestrators, and custom UIs talking to an ensemble service. Different audience, different semantics. External systems shouldn’t need a WebSocket client, shouldn’t need to understand the ensemble networking protocol, and shouldn’t be treated as ensemble peers. The REST-first design reflects that distinction.

Phase 1: Core REST Endpoints

Four endpoints on the same Javalin server as the WebSocket dashboard — no new port, no new process:

POST /api/runs          Submit a run with input variables
GET  /api/runs          List recent runs (filterable by status, tag)
GET  /api/runs/{runId}  Get full run detail (status, task outputs, metrics)
GET  /api/capabilities  List registered tools, models, and preconfigured tasks

Setup

The API is activated by adding catalogs to WebDashboard.builder():

ToolCatalog tools = ToolCatalog.builder()
    .tool("web_search", webSearchTool)
    .tool("calculator", calculatorTool)
    .build();

ModelCatalog models = ModelCatalog.builder()
    .model("sonnet", claudeSonnetModel)
    .model("haiku", claudeHaikuModel)
    .build();

WebDashboard dashboard = WebDashboard.builder()
    .port(7329)
    .toolCatalog(tools)
    .modelCatalog(models)
    .maxConcurrentRuns(5)
    .maxRetainedCompletedRuns(100)
    .build();

The ensemble wires in the dashboard:

Ensemble.builder()
    .chatLanguageModel(claudeSonnetModel)
    .webDashboard(dashboard)
    .task(Task.builder()
        .description("Research {topic} focusing on recent developments in {year}")
        .tools(webSearchTool)
        .build())
    .task(Task.builder()
        .description("Write a concise executive summary of the research")
        .build())
    .build()
    .start(7329);

ToolCatalog and ModelCatalog serve two purposes. They make the API transport-agnostic (JSON refers to tools and models by name, not class). And they act as allowlists — only registered tools and models can be used. Dynamic task creation in later phases cannot instantiate arbitrary code.

Submitting a run

POST /api/runs submits the pre-configured ensemble tasks with variable substitution:

{
  "inputs": {
    "topic": "AI safety",
    "year": "2025"
  },
  "tags": {
    "triggeredBy": "ci-pipeline",
    "environment": "staging"
  }
}

Response (202 Accepted):

{
  "runId": "run-7f3a2b",
  "status": "ACCEPTED",
  "tasks": 2,
  "workflow": "SEQUENTIAL"
}

The run executes asynchronously — the response is immediate. Poll GET /api/runs/{runId} for completion. Tags are arbitrary metadata for filtering and auditing. An empty body submits the template ensemble with no substitution. If maxConcurrentRuns is reached, the response is 429 with a retryAfterMs hint.

Querying capabilities

GET /api/capabilities exposes what’s registered:

{
  "tools": [
    { "name": "web_search", "description": "Search the web using Google" },
    { "name": "calculator", "description": "Evaluate mathematical expressions" }
  ],
  "models": [
    { "alias": "sonnet", "provider": "anthropic" },
    { "alias": "haiku", "provider": "anthropic" }
  ],
  "preconfiguredTasks": [
    { "description": "Research {topic} focusing on recent developments in {year}" },
    { "description": "Write a concise executive summary of the research" }
  ]
}

GET /api/runs/{runId} returns full run detail including task outputs and metrics. GET /api/runs lists recent runs filterable by ?status=RUNNING, ?status=COMPLETED, or ?tag=triggeredBy:ci-pipeline.

Phase 2: The Three-Level Run Submission Model

The most interesting design decision in the Control API is the graduated run submission model. There are three levels, each more dynamic than the last.

Level 1 (covered above): substitute template variables into the pre-configured ensemble. The simplest and most constrained option — the Java code defines what runs.

Level 2: override specific fields of individual tasks at runtime.

Level 3: define a new task list entirely in the POST body, without changing any Java code.

This graduated approach keeps the simple case simple while making the more dynamic cases possible without abandoning the safety properties of the catalog model.

Task naming

To use Levels 2 and 3 effectively, tasks can be given logical names:

Task.builder()
    .name("researcher")
    .description("Research {topic} focusing on recent developments in {year}")
    .tools(webSearchTool)
    .build()

GET /api/capabilities returns task names alongside descriptions. Level 2 override keys match by exact name first, then by description prefix (first 50 characters, case-insensitive) as a fallback.

Level 2: Per-task overrides

taskOverrides lets a caller change a specific task’s description, model, tools, or context without recompilation:

{
  "inputs": { "topic": "AI safety" },
  "taskOverrides": {
    "researcher": {
      "description": "Research {topic} focusing on EU AI Act compliance",
      "expectedOutput": "A regulatory analysis report with citations",
      "model": "sonnet",
      "maxIterations": 15,
      "additionalContext": "The EU AI Act was formally adopted in March 2024.",
      "tools": {
        "add": ["web_search"],
        "remove": ["calculator"]
      }
    }
  }
}

The override key ("researcher") is matched against the template ensemble’s task names. If no matching task exists, the request is rejected with 400. The original task objects are never mutated — Task.toBuilder() creates modified copies.

All tool references are resolved against the ToolCatalog and all model references against the ModelCatalog. A caller cannot inject a tool or model that was not pre-registered.

Level 3: Dynamic task creation

When tasks is provided in the request body, the template ensemble’s task list is replaced entirely. The template’s model, catalogs, and configuration are preserved — only the task list changes:

{
  "tasks": [
    {
      "name": "researcher",
      "description": "Research the competitive landscape for {product}",
      "expectedOutput": "A competitive analysis identifying 5 key competitors",
      "tools": ["web_search"],
      "model": "sonnet",
      "maxIterations": 20
    },
    {
      "name": "writer",
      "description": "Write an executive brief based on the research",
      "expectedOutput": "A 1-page executive summary suitable for C-suite",
      "context": ["$researcher"],
      "model": "sonnet"
    }
  ],
  "inputs": { "product": "AgentEnsemble" }
}

The context field declares dependencies between tasks. $researcher references the task named "researcher"; $0 references the task at index 0. The scheduler infers the workflow type from these dependencies — if context references exist and no workflow is explicitly set, PARALLEL (DAG-based) is used. Circular dependencies and unknown references are rejected at submission time.

WebSocket run submission

REST isn’t the only submission channel. WebSocket clients can submit runs using the run_request message — useful for browser-based UIs that already have a dashboard connection:

{
  "type": "run_request",
  "requestId": "req-1",
  "inputs": { "topic": "AI safety" },
  "tags": { "env": "staging" }
}

The server acknowledges immediately with run_ack. On completion it sends run_result to the originating session only — the existing ensemble_completed broadcast continues to go to all connected clients unchanged.

Phase 3: Run Control

Two operations that apply to in-flight runs.

Cancellation

POST /api/runs/{runId}/cancel cancels a running or accepted run. This is cooperative cancellation — the current in-flight task completes normally; cancellation takes effect before the next task starts.

{ "runId": "run-abc", "status": "CANCELLING" }

The same operation is available over WebSocket: { "type": "run_control", "runId": "run-abc", "action": "cancel" }.

The cooperative model is intentional. A task mid-execution is mid-LLM-call. Interrupting that immediately would leave the ensemble in an undefined state. Completing the current task and stopping cleanly at the boundary gives deterministic behavior without losing progress already made.

Mid-run model switching

POST /api/runs/{runId}/model switches which LLM subsequent tasks will use:

{ "model": "haiku" }

The switch takes effect on the next LLM call; the in-flight call completes with the previous model. The model alias must be registered in the ModelCatalog. This is useful when a long-running ensemble is partway through and you want subsequent tasks to use a cheaper or faster model.

Phase 4: Event Streaming

The existing WebSocket dashboard broadcasts all execution events to all connected sessions. Phase 4 adds filtering and an HTTP-native alternative.

Subscription filtering

WebSocket clients can subscribe to a specific subset of events:

{ "type": "subscribe", "events": ["task_started", "task_completed", "run_result"] }

Or filter to a specific run:

{ "type": "subscribe", "events": ["run_result"], "runId": "run-abc" }

Reset to all events with "events": ["*"]. The server responds with a subscribe_ack confirming the effective subscription.

SSE streaming

For HTTP-only clients — curl scripts, serverless functions, server-side integrations — a WebSocket connection is awkward. The SSE endpoint offers the same event stream over a regular HTTP connection:

GET /api/runs/{runId}/events
Accept: text/event-stream

For completed runs, stored events replay immediately and the connection closes. For in-progress runs, events stream until the run completes. A from parameter supports reconnection by resuming from a specific position in the stored output.

Phase 5: Completing the Control Loop

Phase 5 rounds out the API with three operations that were previously only available through the WebSocket dashboard or by interacting with a running Java process directly.

REST review decisions

The human-in-the-loop system generates review gates where a reviewer approves, edits, or rejects task output before the ensemble proceeds. Phase 5 exposes this over REST, so server-side systems (Slack bots, CI pipelines) can automate or route review decisions:

POST /api/reviews/{reviewId}
{ "decision": "CONTINUE" }

For edits:

{ "decision": "EDIT", "revisedOutput": "Updated output..." }

Discover pending reviews:

GET /api/reviews
GET /api/reviews?runId=run-abc

Context injection

Inject a directive into a running ensemble’s DirectiveStore. The directive is picked up on the next LLM iteration of any agent in the ensemble:

POST /api/runs/{runId}/inject
{ "content": "Focus on EU AI Act compliance", "target": "researcher" }

This is the REST equivalent of what the dashboard allows through the live run view — useful for server-side automation that needs to steer a run mid-execution.

Direct tool invocation

Execute a registered tool from the ToolCatalog without running a full ensemble:

POST /api/tools/calculator/invoke
{ "input": "What is 42 * 17?" }

Response:

{ "tool": "calculator", "status": "SUCCESS", "output": "714", "durationMs": 2 }

This is useful for integration testing, for validating tool configuration, and for pipeline steps that need a single tool call without the overhead of an ensemble run.

The Design Tension

The interesting question in a feature like this is where the boundary sits between the control plane and the data plane.

The v3 network module already has capability queries (CapabilityQueryMessage), task delegation (NetworkTask/NetworkTool), and directives (DirectiveMessage). The Control API exposes similar operations — but over HTTP, for a different audience, with different security and access semantics.

The key distinction is the audience. External systems that should not need a WebSocket client and should not need to understand the ensemble networking protocol are not ensemble peers — they’re operators. The REST-first design, catalog-enforced allowlists, and graduated Level 1/2/3 submission model reflect that distinction throughout.

The Ensemble Control API is documented in the control API guide. The underlying design doc is design/28. Source is on GitHub.

I’d be interested in where the three-level submission model feels right or falls short. The boundary between Level 2 (override existing tasks) and Level 3 (define new tasks) is where the most design tension sits — curious whether that separation is useful or whether most real use cases collapse to one or the other.

Running Agent Tasks as Temporal Activities

Tue, 02 Jun 2026 00:00:00 GMT

If you’re running Temporal in production, you’ve already solved the hard parts of long-running workflow orchestration: durable execution, activity retries, heartbeating, workflow history, and cross-service coordination. The question is how agent tasks fit into that model.

The obvious answer — run AgentEnsemble as a separate service and call it over HTTP from Temporal activities — introduces latency, network failure modes, and another process to operate. A less obvious answer is that the two systems don’t need to be separated at all.

The agentensemble-executor module lets you call AgentEnsemble tasks directly in-process from any Temporal activity. No HTTP server. No Temporal SDK dependency inside the library. Just a Java call.

Two Execution Modes

The module provides two executors with different granularity:

Class	Granularity	When to use
`TaskExecutor`	One task = one external activity	Per-task Temporal retry, timeout, and heartbeat
`EnsembleExecutor`	One ensemble = one external activity	Simpler pipelines; AgentEnsemble handles internal orchestration inside a single activity

TaskExecutor is the recommended pattern when you want Temporal to own the retry and timeout semantics for individual AI steps. EnsembleExecutor is simpler when the pipeline is short and internal retry is not a concern.

Heartbeats Work

A common concern when embedding long-running work inside a Temporal activity is heartbeating. If the activity doesn’t heartbeat frequently enough, Temporal marks it as timed out.

HeartbeatEnsembleListener bridges EnsembleListener lifecycle events to any Consumer<Object>. Passing Temporal’s heartbeat method as the consumer is one line:

return executor.execute(request, Activity.getExecutionContext()::heartbeat);

The consumer fires on task_started, task_completed, tool_call, and llm_iteration_started — frequently enough that a 2-minute heartbeat window is generous for typical agent workloads. The heartbeat payload is a HeartbeatDetail record serializable by Temporal’s default Jackson DataConverter, so it’s visible in the Temporal UI and accessible via Activity.getLastHeartbeatDetails().

A Full Temporal Integration

The recommended pattern wraps each AgentEnsemble task as a separate @ActivityMethod:

@ActivityInterface
public interface ResearchPipelineActivity {
    @ActivityMethod TaskResult research(TaskRequest request);
    @ActivityMethod TaskResult write(TaskRequest request);
}

public class ResearchPipelineActivityImpl implements ResearchPipelineActivity {

    private final TaskExecutor executor;

    /** Production constructor. */
    public ResearchPipelineActivityImpl() {
        this(new TaskExecutor(
            SimpleModelProvider.of(
                OpenAiChatModel.builder()
                    .apiKey(System.getenv("OPENAI_API_KEY"))
                    .modelName("gpt-4o-mini")
                    .build()),
            SimpleToolProvider.builder()
                .tool("web-search", new WebSearchTool(System.getenv("SEARCH_API_KEY")))
                .build()));
    }

    /** Package-private constructor for testing -- accepts FakeTaskExecutor. */
    ResearchPipelineActivityImpl(TaskExecutor executor) {
        this.executor = executor;
    }

    @Override
    public TaskResult research(TaskRequest request) {
        return executor.execute(request, Activity.getExecutionContext()::heartbeat);
    }

    @Override
    public TaskResult write(TaskRequest request) {
        return executor.execute(request, Activity.getExecutionContext()::heartbeat);
    }
}

The workflow sequences activities and passes upstream outputs as context entries:

public class ResearchWorkflowImpl implements ResearchWorkflow {

    private final ResearchPipelineActivity activity =
        Workflow.newActivityStub(ResearchPipelineActivity.class,
            ActivityOptions.newBuilder()
                .setScheduleToCloseTimeout(Duration.ofMinutes(30))
                .setHeartbeatTimeout(Duration.ofMinutes(2))
                .setRetryOptions(RetryOptions.newBuilder()
                    .setMaximumAttempts(3)
                    .build())
                .build());

    @Override
    public String run(String topic) {

        TaskResult research = activity.research(
            TaskRequest.builder()
                .description("Research the latest developments in {topic}")
                .expectedOutput("A comprehensive, accurate research summary")
                .agent(AgentSpec.builder()
                    .role("Research Analyst")
                    .goal("Find accurate, up-to-date information on any topic")
                    .toolNames(List.of("web-search"))
                    .build())
                .inputs(Map.of("topic", topic))
                .build());

        TaskResult article = activity.write(
            TaskRequest.builder()
                .description("Write a blog post about {topic} using this research: {research}")
                .expectedOutput("A well-structured, engaging 500-word blog post")
                .agent(AgentSpec.of("Technical Writer", "Write clear, compelling content"))
                .context(Map.of("research", research.output()))
                .inputs(Map.of("topic", topic))
                .build());

        return article.output();
    }
}

Temporal handles sequencing, retry, and timeout. AgentEnsemble handles LLM calls, tool execution, and the ReAct loop. Each concern stays in the system designed for it.

Testing Without LLM Calls

Both executors ship with test doubles — FakeTaskExecutor and FakeEnsembleExecutor — that can be injected without any LLM calls:

FakeTaskExecutor fake = FakeTaskExecutor.builder()
    .whenDescriptionContains("Research", "AI is advancing rapidly in 2026.")
    .whenDescriptionContains("Write", "Article: AI reshapes every industry.")
    .build();

ResearchPipelineActivityImpl activity = new ResearchPipelineActivityImpl(fake);

Combined with Temporal’s TestWorkflowEnvironment, this lets you run the full workflow in fast deterministic tests:

@BeforeEach
void setUp() {
    testEnv = TestWorkflowEnvironment.newInstance();
    FakeTaskExecutor fake = FakeTaskExecutor.builder()
        .whenDescriptionContains("Research", "Research done: AI grows 40% YoY.")
        .whenDescriptionContains("Write", "Article: AI is reshaping every industry.")
        .build();

    Worker worker = testEnv.newWorker(TASK_QUEUE);
    worker.registerWorkflowImplementationTypes(ResearchWorkflowImpl.class);
    worker.registerActivitiesImplementations(new ResearchPipelineActivityImpl(fake));
    testEnv.start();
}

@Test
void run_sequencesResearchThenWrite_returnsArticleOutput() {
    ResearchWorkflow workflow = testEnv.newWorkflowStub(ResearchWorkflow.class,
        WorkflowOptions.newBuilder().setTaskQueue(TASK_QUEUE).build());

    String result = workflow.run("Artificial Intelligence");

    assertThat(result).isEqualTo("Article: AI is reshaping every industry.");
}

Model Selection at Request Time

Models and tools are configured on the worker side and never serialized into workflow history. A modelName in a TaskRequest selects a specific model at request time:

ModelProvider models = SimpleModelProvider.builder()
    .model("gpt-4o-mini", cheapModel)
    .model("gpt-4o", premiumModel)
    .defaultModel(cheapModel)
    .build();

// In the workflow:
TaskRequest.builder()
    .description("Synthesize the final executive summary")
    .modelName("gpt-4o")        // resolved by the worker's ModelProvider at run time
    .agent(AgentSpec.of("Executive Synthesizer", "Produce board-level summaries"))
    .build();

Not Temporal-Specific

The heartbeat consumer is a plain Consumer<Object>. The agentensemble-executor module has no Temporal SDK dependency. The same executors work with any external orchestrator:

AWS Step Functions — pass a heartbeat callback to a state machine activity poller
Kafka Streams — call execute() inside a Processor
Spring Batch — wrap in a Tasklet
Plain threads — pass null for no heartbeating

The Design Tradeoff

Task-per-activity gives you more operational visibility — each task is a separate entry in the Temporal UI, with its own retry history and timeout. Ensemble-per-activity is simpler to write but treats the entire pipeline as a black box from Temporal’s perspective.

The deeper tradeoff is about where you want the orchestration intelligence to live. If your Temporal workflows are already sophisticated — routing between task types, branching on outcomes, passing context between many steps — then task-per-activity is the natural fit. If AgentEnsemble’s phase grouping, DAG parallelism, or phase review gates are doing the interesting coordination work, then ensemble-per-activity keeps that logic inside the framework and Temporal handles only the outer lifecycle.

The executor module is documented in the integration guide. Source is on GitHub.

I’d be interested in whether the two-mode design maps cleanly to your Temporal workflows, or whether there are integration patterns that don’t fit either executor.

Error Handling in Agent Systems: Exception Hierarchies, Partial Results, and Exit Reasons

Sun, 31 May 2026 00:00:00 GMT

Agent systems fail in ways that traditional software does not. An LLM might return an unparseable response. A tool call might timeout. An agent might enter an infinite ReAct loop. A human reviewer might walk away from an approval gate. A task might succeed but produce output that a downstream task cannot use.

The interesting problem is not preventing these failures — some are inherent to non-deterministic systems. The interesting problem is giving operators enough information to handle them gracefully: what failed, what succeeded before the failure, and what the system’s terminal state actually is.

The Exception Hierarchy

AgentEnsemble uses a hierarchy of unchecked exceptions rooted at AgentEnsembleException. Every exception the framework throws extends this base, so you can catch everything with a single catch block or handle specific cases individually.

AgentEnsembleException (base)
  ValidationException             -- invalid configuration at build/run time
  TaskExecutionException          -- a task failed during execution
  AgentExecutionException         -- an LLM call failed
  MaxIterationsExceededException  -- agent exceeded its tool-call limit
  PromptTemplateException         -- unresolved template variables
  ToolExecutionException          -- a tool call failed
  ConstraintViolationException    -- required workers were not called
  GuardrailViolationException     -- a guardrail blocked execution

The hierarchy matters because different failure types require different responses. A ValidationException means your configuration is wrong — no LLM was ever called, and the fix is in the code. A TaskExecutionException means the pipeline started but a task failed — partial results may be available. A MaxIterationsExceededException means an agent got stuck in a tool-calling loop — the fix might be fewer tools or a higher iteration limit.

Partial Results on Failure

When a multi-task pipeline fails partway through, the work completed before the failure is not discarded. TaskExecutionException carries a list of TaskOutput objects for tasks that completed before the failure:

try {
    EnsembleOutput output = ensemble.run(inputs);
    saveResults(output);
} catch (TaskExecutionException e) {
    // Save whatever was completed before the failure
    for (TaskOutput partial : e.getCompletedTaskOutputs()) {
        savePartialResult(partial);
    }
    alertOnFailure(e.getTaskDescription(), e.getAgentRole());
}

This is operationally significant. In a five-task pipeline where task four fails, you still have the outputs of tasks one through three. You can save them, display them to a user, or use them to resume the pipeline from where it left off.

Exit Reasons

Not every non-completion is an error. EnsembleOutput.getExitReason() distinguishes between four terminal states:

Exit Reason	Meaning
`COMPLETED`	All tasks ran to completion normally
`USER_EXIT_EARLY`	A human reviewer chose to stop the pipeline
`TIMEOUT`	A review gate timeout expired
`ERROR`	An unrecoverable exception terminated the pipeline

EnsembleOutput output = ensemble.run();

switch (output.getExitReason()) {
    case COMPLETED:
        System.out.println("All done: " + output.getRaw());
        break;
    case USER_EXIT_EARLY:
        System.out.println("User stopped after "
            + output.completedTasks().size() + " task(s)");
        break;
    case TIMEOUT:
        System.out.println("Review gate timed out");
        break;
    case ERROR:
        // Typically handled via exception
        break;
}

The distinction between USER_EXIT_EARLY and TIMEOUT matters for operational dashboards. A user exit is intentional — the pipeline did its job and the human made a decision. A timeout might indicate a process problem (reviewer was not available) and may need escalation.

Specific Exception Types

ValidationException

Thrown before any LLM calls when the ensemble or its components are configured incorrectly. Common causes include missing required fields, tasks referencing unregistered agents, circular context dependencies, or invalid iteration limits.

This exception is your build-time safety net. If you see it, the fix is always in the configuration code.

AgentExecutionException

Thrown when the LLM call itself fails — network errors, API errors, rate limiting, timeouts. Contains the agent role and task description so you can route the failure to the right team.

MaxIterationsExceededException

Thrown when an agent exceeds its maxIterations limit during the ReAct loop. Contains both the configured limit and the actual iteration count.

This is often a sign that the agent has too many tools and is cycling between them without making progress. The fix is usually to reduce the tool set, make tool descriptions more specific, or increase the iteration limit if the task genuinely requires many tool calls.

PromptTemplateException

Thrown when a task description contains {variable} placeholders that were not resolved. The exception lists the missing variable names, making it straightforward to fix.

GuardrailViolationException

Thrown when an input or output guardrail blocks execution. Contains the guardrail type (INPUT or OUTPUT), the violation message, the task description, and the agent role. This integrates with the guardrail system covered in the previous post.

The Retry Question

AgentEnsemble does not include built-in retry logic. This is a deliberate design choice.

The reasoning is that retry policies are highly context-dependent. A rate-limited API call might benefit from exponential backoff. A malformed LLM response might benefit from a retry with the same prompt. A task that failed because the model cannot perform the requested work should not be retried at all.

For transient failures, implement retry at the call site:

int attempts = 0;
EnsembleOutput output = null;

while (attempts < 3) {
    try {
        output = ensemble.run(inputs);
        break;
    } catch (AgentExecutionException e) {
        attempts++;
        if (attempts == 3) throw e;
        Thread.sleep(1000L * attempts);
    }
}

For production use, consider integrating a resilience library such as Resilience4j, which provides circuit breakers, rate limiters, and retry policies that compose well with the exception hierarchy.

The Operational Model

The error handling design reflects a particular view of how agent systems should be operated: failures are expected, partial results are valuable, and the framework should give you structured information rather than opaque error strings.

The exception hierarchy makes it possible to build monitoring and alerting that distinguishes between configuration errors (fix the code), transient failures (retry or escalate), agent loops (tune the workflow), and intentional stops (human decision). The partial result preservation makes it possible to build resumable pipelines. The exit reasons make it possible to build dashboards that accurately represent pipeline outcomes.

None of this prevents failures. It gives you the handles to respond to them systematically.

The full error handling guide is in the documentation.

I’d be interested in whether you have found the exception hierarchy granularity to be sufficient, or whether there are failure modes in your agent systems that do not map cleanly to these categories.

Scoped Memory for Agent Systems: Cross-Run Persistence Without Global State

Fri, 29 May 2026 00:00:00 GMT

Most agent frameworks treat each run as stateless. The agent starts fresh, does its work, and the output is consumed by whatever called it. If you run the same workflow again next week, the agent has no memory of what it produced last time.

For some use cases that is fine. For others — recurring research tasks, iterative drafting, accumulated domain knowledge — you want the agent to remember what it learned in previous runs and build on it.

The question is how to add cross-run memory without introducing global shared state that makes the system hard to reason about.

Named Scopes as the Isolation Mechanism

AgentEnsemble uses named memory scopes. Each task declares which scopes it reads from and writes to. A task can only see memory from scopes it explicitly declares.

MemoryStore store = MemoryStore.inMemory();

Task researchTask = Task.builder()
    .description("Research current AI trends")
    .expectedOutput("A research report")
    .agent(researcher)
    .memory("ai-research")
    .build();

Ensemble.builder()
    .agent(researcher)
    .task(researchTask)
    .memoryStore(store)
    .build()
    .run();

After the run, the task’s output is stored in the "ai-research" scope. On a second run with the same store, the agent’s prompt automatically includes entries from the first run under a ## Memory: ai-research section.

The scope name is the isolation boundary. Task A storing into "research" and task B declaring only "drafts" means task B never sees task A’s output. This is not a security mechanism — it is an attention mechanism. It controls what context an agent receives, keeping prompts focused on relevant history rather than everything that ever happened.

How It Works at the Prompt Level

The mechanics are straightforward:

At task startup, the framework retrieves entries from every declared scope and injects them into the agent’s prompt.
At task completion, the framework stores the task output into every declared scope.
Because entries persist in the MemoryStore across runs, agents in later runs automatically see outputs from earlier runs.

The prompt injection looks like this:

## Memory: ai-project
The following information from scope "ai-project" may be relevant:

---
Research findings from previous run: AI is accelerating in healthcare...
---

## Task
Analyse the research findings

There is no magic retrieval. The framework puts the memory content into the prompt, and the LLM uses it (or ignores it) during reasoning.

Pluggable Storage

MemoryStore has two built-in implementations:

In-memory stores entries in insertion order per scope. Retrieval returns the most recent entries without semantic search. Suitable for development, testing, and single-JVM runs. Entries do not survive JVM restarts.

MemoryStore store = MemoryStore.inMemory();

Embedding-based stores entries via an embedding model and retrieves them via semantic similarity search. The backing EmbeddingStore controls durability — Chroma, Qdrant, Pinecone, pgvector, or any LangChain4j-compatible store.

EmbeddingModel embeddingModel = OpenAiEmbeddingModel.builder()
    .apiKey(System.getenv("OPENAI_API_KEY"))
    .modelName("text-embedding-3-small")
    .build();

EmbeddingStore<TextSegment> embeddingStore = ChromaEmbeddingStore.builder()
    .baseUrl("http://localhost:8000")
    .collectionName("agentensemble-memory")
    .build();

MemoryStore store = MemoryStore.embeddings(embeddingModel, embeddingStore);

The design tradeoff is explicit. In-memory is fast and simple but loses data on restart and does not do semantic retrieval. Embedding-based is durable and semantically aware but requires an embedding model and a vector store. You choose based on your operational requirements.

Eviction Policies

Unbounded memory is a prompt-size problem. Every stored entry adds tokens to the next run’s prompt. Scopes support optional eviction to keep sizes bounded:

// Retain only the 5 most recent entries
MemoryScope.builder()
    .name("research")
    .keepLastEntries(5)
    .build()

// Retain only entries from the past 7 days
MemoryScope.builder()
    .name("research")
    .keepEntriesWithin(Duration.ofDays(7))
    .build()

Eviction is applied after each task stores its output. For embedding-based stores, eviction is a no-op since most embedding stores do not support deletion of individual entries.

MemoryTool: Agent-Driven Memory Access

In addition to the automatic scope-based mechanism, agents can interact with memory directly during their ReAct loop using MemoryTool:

Agent researcher = Agent.builder()
    .role("Researcher")
    .goal("Research and remember important facts")
    .tools(MemoryTool.of("research", store))
    .build();

MemoryTool provides two tool methods the LLM can call: storeMemory(key, value) to store an arbitrary fact, and retrieveMemory(query) to retrieve relevant memories by query.

When the same MemoryStore instance is used for both MemoryTool and Ensemble.builder().memoryStore(...), explicit tool access and automatic scope-based access share the same backing store. This means an agent can both receive automatic context from previous runs and actively query or store additional facts during execution.

Multiple tasks can declare the same scope name. Each task writes its output to the scope after it completes, so later tasks in a sequential workflow see earlier tasks’ outputs:

Task research = Task.builder()
    .description("Research AI trends")
    .memory("ai-project")
    .build();

Task analysis = Task.builder()
    .description("Analyse the research findings")
    .memory("ai-project")
    .build();

Ensemble.builder()
    .task(research)
    .task(analysis)
    .memoryStore(store)
    .build()
    .run();

This is within-run memory sharing. The analysis task sees the research task’s output because they share the "ai-project" scope. On the next run, both tasks see outputs from the previous run’s research and analysis.

The Design Principle

The key design decision is that memory is opt-in and scoped, not global and automatic. An agent does not remember everything by default. Each task explicitly declares what it wants to remember and what it wants to recall.

This makes the system easier to reason about. You can look at a task definition and know exactly what memory context it will receive. You can test a task with a pre-populated store and verify that it uses the memory correctly. You can clear a scope without affecting other scopes.

The tradeoff is that you have to think about memory design upfront. Which tasks share scopes? How many entries should be retained? Should you use semantic search or recency-based retrieval? These are design decisions that the framework surfaces explicitly rather than hiding behind defaults.

The full memory guide is in the documentation.

I’d be interested in how you handle the prompt-size tension — whether bounded eviction is sufficient, or whether you have needed more sophisticated retrieval strategies for production memory systems.

Tool Pipelines: Eliminating LLM Round-Trips for Deterministic Tool Chains

Wed, 27 May 2026 00:00:00 GMT

In a standard ReAct loop, every tool call requires an LLM round-trip. The agent calls a search tool, receives results, reasons about them, calls a filter tool, receives filtered output, reasons again, calls a format tool, and so on. Each step costs tokens, adds latency, and requires the LLM to make a decision that is often trivial — the next step in the chain is predetermined.

For deterministic data transformation chains, the LLM adds no reasoning value between steps. It just passes the output of one tool as input to the next. The interesting question is whether you can collapse that chain into a single tool call.

The ToolPipeline Abstraction

AgentEnsemble provides ToolPipeline, which chains multiple tools into a single compound tool. The LLM calls it once; all steps execute sequentially without LLM round-trips between them.

// Standard ReAct loop (3 LLM round-trips for tool mediation):
LLM -> search_tool -> LLM -> filter_tool -> LLM -> format_tool -> LLM

// With ToolPipeline (0 extra round-trips):
LLM -> search_then_filter_then_format -> LLM

The simplest way to create one:

ToolPipeline pipeline = ToolPipeline.of(
    new WebSearchTool(provider),
    new JsonParserTool(),
    FileWriteTool.of(outputPath)
);
// name: "web_search_then_json_parser_then_file_write"

var task = Task.builder()
    .description("Research AI trends and save the top result to disk")
    .expectedOutput("Confirmation that the result was saved")
    .tools(List.of(pipeline))
    .build();

Data Flow and Adapters

By default, ToolResult.getOutput() from step N is passed as the input to step N+1. This works when tool outputs are directly consumable by the next tool.

When you need to reshape data between steps, attach an adapter:

ToolPipeline pipeline = ToolPipeline.builder()
    .name("extract_and_calculate")
    .description("Extract a numeric field from JSON and apply a formula")
    .step(new JsonParserTool())
    .adapter(result -> result.getOutput() + " * 1.1")
    .step(new CalculatorTool())
    .build();

The adapter transforms the JsonParserTool output (e.g., "149.99") into a calculator expression ("149.99 * 1.1") before passing it to CalculatorTool. Adapters have full access to ToolResult, including getStructuredOutput() for typed payloads.

This is the key design decision: adapters are plain Java functions, not LLM calls. They handle the deterministic reshaping that the LLM would otherwise do at full inference cost.

Error Strategies

Pipelines support two error strategies:

FAIL_FAST (default) stops the pipeline on the first failed step and returns that failure to the LLM immediately. Subsequent steps are never executed.

CONTINUE_ON_FAILURE continues executing subsequent steps even when an intermediate step fails. The failed step’s error message is forwarded as input to the next step.

ToolPipeline pipeline = ToolPipeline.builder()
    .name("resilient_pipeline")
    .description("Continues even when a step fails")
    .step(stepA)
    .step(stepB)
    .step(stepC)
    .errorStrategy(PipelineErrorStrategy.CONTINUE_ON_FAILURE)
    .build();

The choice between them depends on whether downstream steps can recover from upstream failures. For a search-then-save pipeline, FAIL_FAST makes sense — there is nothing to save if the search failed. For a multi-source aggregation, CONTINUE_ON_FAILURE lets the pipeline produce partial results.

Approval Gates Within Pipelines

Steps inside a pipeline that require human approval will pause mid-pipeline, exactly as if they were standalone tools. The pipeline propagates the ensemble’s ReviewHandler to all nested steps automatically.

ToolPipeline pipeline = ToolPipeline.of(
    new JsonParserTool(),
    FileWriteTool.builder(outputPath)
        .requireApproval(true)
        .build()
);

This means you can build pipelines that include a human checkpoint before a destructive operation (like writing to disk or calling an external API) without losing the token savings for the deterministic steps before the checkpoint.

Nesting and Composition

A ToolPipeline implements AgentTool, so it can be used as a step inside another pipeline:

ToolPipeline inner = ToolPipeline.of("step_a", "desc", toolA, toolB);
ToolPipeline outer = ToolPipeline.of("outer", "desc", inner, toolC);

This lets you build reusable pipeline fragments and compose them into larger chains. Each pipeline records its own aggregate metrics (timing, success/failure counts) in addition to the per-step metrics from individual tools.

When to Use Pipelines vs. Separate Tools

The decision boundary is whether the LLM needs to reason between steps.

Use ToolPipeline when steps are deterministic and order-locked — the LLM should not skip or reorder them, and the data transformations between steps are mechanical. The full chain appears as one operation to the LLM.

Use separate tools when the LLM needs to decide which tool to call next based on intermediate results, or when intermediate results are useful for the LLM to see and reason about.

In practice, this means pipelines work well for data retrieval and transformation chains (search, parse, filter, write), while separate tools work better for exploratory workflows where the agent needs to adapt its approach based on what it finds.

The Broader Pattern

ToolPipeline is one instance of a broader design principle in AgentEnsemble: when something is deterministic, do not pay LLM inference costs for it. This same principle appears in deterministic-only orchestration (tasks that never call an LLM), typed tool inputs (schema validation without LLM intervention), and phase-level workflow grouping (execution order declared in code, not negotiated by the LLM).

The common thread is that the framework should handle mechanical work mechanically, and reserve LLM inference for decisions that actually require reasoning.

The full tool pipeline guide is in the documentation.

Curious whether you have seen tool chains where the boundary between “deterministic” and “needs reasoning” is ambiguous, and how you would draw that line.

Guardrails for Agent Output: Pluggable Validation Before and After LLM Calls

Mon, 25 May 2026 00:00:00 GMT

One of the harder problems in agent systems is constraining output quality without turning every prompt into a wall of instructions. You can ask the LLM to stay under 3000 characters, or to always include a conclusion section, or to never mention competitor products. But prompt-based constraints are probabilistic. The LLM might follow them. It might not.

Guardrails are the deterministic layer. They run as Java code before and after the LLM call, and they enforce rules that prompts cannot guarantee.

The Model

AgentEnsemble implements guardrails as two functional interfaces: InputGuardrail and OutputGuardrail. Both return a GuardrailResult — either success or failure with a reason.

Input guardrails run before the LLM is contacted. If any fails, execution stops immediately and the agent’s LLM is never called. Output guardrails run after the agent produces a response (and after structured output parsing, if configured).

InputGuardrail piiGuardrail = input -> {
    String desc = input.taskDescription().toLowerCase();
    if (desc.contains("ssn") || desc.contains("credit card")) {
        return GuardrailResult.failure(
            "Task description may contain personally identifiable information");
    }
    return GuardrailResult.success();
};

OutputGuardrail lengthGuardrail = output -> {
    if (output.rawResponse().length() > 3000) {
        return GuardrailResult.failure(
            "Response is " + output.rawResponse().length()
            + " chars, exceeds limit of 3000");
    }
    return GuardrailResult.success();
};

Both are configured per-task:

var task = Task.builder()
    .description("Write an executive summary")
    .expectedOutput("A concise summary")
    .agent(writer)
    .inputGuardrails(List.of(piiGuardrail))
    .outputGuardrails(List.of(lengthGuardrail))
    .build();

Why Functional Interfaces

The choice to make guardrails functional interfaces rather than annotation-based or configuration-driven has a few practical consequences.

First, guardrails are composable. You can build them from lambdas, combine them, or wrap them in utility methods. A guardrail that checks for PII can be reused across every task in the ensemble without any framework-specific wiring.

Second, they are testable in isolation. A guardrail is a pure function from input to result. You can unit test it without standing up an ensemble or mocking an LLM.

Third, they are stateless by default. Since guardrails may run concurrently (in parallel workflows), stateless lambdas are inherently thread-safe. If you need stateful validation, thread safety is your responsibility.

What Input Guardrails See

The GuardrailInput record carries everything you need to make a pre-execution decision:

taskDescription() — the task description text
expectedOutput() — the expected output specification
contextOutputs() — outputs from prior context tasks (immutable)
agentRole() — the role of the agent about to execute

This means you can write guardrails that check not just the current task, but the outputs of upstream tasks. For example, a guardrail that rejects a writing task if the research task upstream produced no findings:

InputGuardrail requireResearch = input -> {
    boolean hasResearch = input.contextOutputs().stream()
        .anyMatch(o -> o.getRaw().length() > 100);
    if (!hasResearch) {
        return GuardrailResult.failure("No substantive research output found");
    }
    return GuardrailResult.success();
};

Output Guardrails and Typed Output

When a task uses outputType for structured output, the execution order is:

Input guardrails run (before LLM)
LLM executes and produces raw text
Structured output parsing (JSON extraction + deserialization)
Output guardrails run (with both rawResponse() and parsedOutput() available)

This means output guardrails can inspect the typed Java object directly:

record ResearchReport(String title, List<String> findings, String conclusion) {}

OutputGuardrail findingsGuardrail = output -> {
    if (output.parsedOutput() instanceof ResearchReport report) {
        if (report.findings() == null || report.findings().isEmpty()) {
            return GuardrailResult.failure(
                "Report must include at least one finding");
        }
    }
    return GuardrailResult.success();
};

This is where guardrails and typed outputs reinforce each other. The type system gives you a parsed object; the guardrail gives you a place to enforce business rules on that object.

Multiple Guardrails and Evaluation Order

Multiple guardrails per task are evaluated in order. The first failure stops evaluation — subsequent guardrails are not called.

var task = Task.builder()
    .description("Write an article")
    .expectedOutput("An article")
    .agent(writer)
    .inputGuardrails(List.of(piiGuardrail, roleGuardrail, domainGuardrail))
    .outputGuardrails(List.of(lengthGuardrail, conclusionGuardrail))
    .build();

If you want to collect all failures rather than short-circuit, compose them into a single guardrail:

InputGuardrail compositeGuardrail = input -> {
    List<String> failures = new ArrayList<>();
    for (InputGuardrail g : List.of(piiGuardrail, roleGuardrail)) {
        GuardrailResult r = g.validate(input);
        if (!r.isSuccess()) failures.add(r.getMessage());
    }
    return failures.isEmpty()
        ? GuardrailResult.success()
        : GuardrailResult.failure(String.join("; ", failures));
};

Exception Propagation

When a guardrail fails, GuardrailViolationException is thrown. It propagates through the workflow executor and is wrapped in TaskExecutionException, following the same pattern as other task failures.

The exception carries structured information — guardrail type (INPUT or OUTPUT), violation message, task description, and agent role — so you can route failures to metrics or alerting without parsing error strings.

try {
    ensemble.run();
} catch (TaskExecutionException ex) {
    if (ex.getCause() instanceof GuardrailViolationException gve) {
        metrics.increment("guardrail.violation." + gve.getGuardrailType());
        log.warn("Guardrail blocked task '{}': {}",
            gve.getTaskDescription(), gve.getViolationMessage());
    }
}

The Tradeoff

Guardrails are deterministic checks, not semantic analysis. A length limit is easy to enforce. A toxicity check is harder — you would need to call an external classifier inside the guardrail, which adds latency and its own failure modes.

The design intentionally keeps guardrails as simple synchronous functions. If you need async validation, external API calls, or retry logic, you implement that inside the guardrail function. The framework does not impose an opinion on how complex your validation should be.

This means guardrails are most useful for structural and policy checks — length limits, required sections, PII filters, role-based access, schema validation on typed outputs. For semantic quality checks, the phase review and task reflection mechanisms (covered in earlier posts) are a better fit.

The full guardrails guide is in the documentation.

I’d be interested in whether the input/output split feels like the right abstraction, or whether you have seen validation needs that do not fit cleanly into either category.

Wiring Agent Ensembles into Spring Boot, Micronaut, and Quarkus

Sat, 23 May 2026 00:00:00 GMT

One question that comes up early when evaluating an agent orchestration library is how it fits into an existing backend stack. If your services run on Spring Boot, Micronaut, or Quarkus, you want agents to live inside the same dependency injection container, use the same configuration system, and expose metrics through the same actuator endpoints.

The interesting design decision in AgentEnsemble is that it has no framework dependencies at all. It is a plain Java 21+ library with a builder API. Framework integration is just a matter of wrapping those builder calls in whatever DI mechanism your framework uses. Nothing in the library changes.

This keeps the core small and testable, but it also means the integration patterns are worth spelling out explicitly.

The Builder API as the Integration Surface

Every AgentEnsemble component — agents, tasks, ensembles, memory stores, listeners — is created through builders. The framework never scans for annotations, never registers beans automatically, and never assumes a particular lifecycle model.

This is deliberate. The builder API is the integration surface. In a DI container, you turn builder calls into bean definitions. In a plain main() method, you call the same builders directly.

Agent researcher = Agent.builder()
        .role("Research Analyst")
        .goal("Find accurate, up-to-date information")
        .backstory("You are a meticulous researcher.")
        .build();

That same code works identically inside a Spring @Configuration, a Micronaut @Factory, a Quarkus CDI producer, or a static main method.

Spring Boot

Spring Boot is the most common case. The LangChain4j Spring Boot starters handle ChatLanguageModel bean creation from application.properties automatically — AgentEnsemble does not duplicate that responsibility.

Dependencies

dependencies {
    implementation("net.agentensemble:agentensemble-core:2.10.0")
    implementation("dev.langchain4j:langchain4j-spring-boot-starter:1.11.0")
    implementation("dev.langchain4j:langchain4j-open-ai-spring-boot-starter:1.11.0")
    // Optional: metrics via Spring Boot Actuator
    implementation("net.agentensemble:agentensemble-metrics-micrometer:2.10.0")
}

Configuration Class

Spring injects the ChatLanguageModel bean (created by the LangChain4j starter) and any EnsembleListener beans you have declared elsewhere.

@Configuration
public class AgentEnsembleConfig {

    @Bean
    public Agent researcher() {
        return Agent.builder()
                .role("Research Analyst")
                .goal("Find accurate, up-to-date information on the given topic")
                .backstory("You are a meticulous researcher with a talent for "
                        + "finding relevant information quickly.")
                .build();
    }

    @Bean
    public Ensemble ensemble(
            ChatLanguageModel chatModel,
            Agent researcher,
            List<EnsembleListener> listeners,
            Optional<ToolMetrics> toolMetrics) {

        Ensemble.Builder builder = Ensemble.builder()
                .chatModel(chatModel)
                .agents(researcher);

        listeners.forEach(builder::listener);
        toolMetrics.ifPresent(builder::toolMetrics);

        return builder.build();
    }
}

The pattern here is standard Spring: declare beans, let Spring wire them. Any @Component implementing EnsembleListener is automatically collected via the List<EnsembleListener> injection.

Metrics via Actuator

If you use Micrometer with Spring Boot Actuator, declare a ToolMetrics bean and agent metrics appear at /actuator/metrics automatically:

@Bean
public ToolMetrics toolMetrics(MeterRegistry registry) {
    return new MicrometerToolMetrics(registry);
}

Using the Ensemble

Inject the Ensemble bean wherever you need it. Build tasks at the call site where you have the runtime inputs:

@Service
public class ResearchService {
    private final Ensemble ensemble;
    private final Agent researcher;

    public ResearchService(Ensemble ensemble, Agent researcher) {
        this.ensemble = ensemble;
        this.researcher = researcher;
    }

    public String research(String topic) {
        Task task = Task.builder()
                .description("Research and summarise: " + topic)
                .expectedOutput("A concise summary with key findings")
                .agent(researcher)
                .build();
        return ensemble.run(task).finalOutput();
    }
}

Micronaut

Micronaut does not have a LangChain4j integration module, so you create the ChatLanguageModel bean directly. The rest of the pattern is the same — a @Factory class with @Singleton methods.

@Factory
public class AgentEnsembleFactory {

    @Singleton
    public ChatLanguageModel chatModel(
            @Value("${agentensemble.openai.api-key}") String apiKey,
            @Value("${agentensemble.openai.model-name}") String modelName) {
        return OpenAiChatModel.builder()
                .apiKey(apiKey)
                .modelName(modelName)
                .build();
    }

    @Singleton
    public Ensemble ensemble(
            ChatLanguageModel chatModel,
            Agent researcher,
            List<EnsembleListener> listeners) {
        Ensemble.Builder builder = Ensemble.builder()
                .chatModel(chatModel)
                .agents(researcher);
        listeners.forEach(builder::listener);
        return builder.build();
    }
}

Micronaut injects all EnsembleListener beans automatically via the List<EnsembleListener> parameter. Micrometer metrics work out of the box since Micronaut ships with native Micrometer support.

Quarkus

Quarkus has its own quarkus-langchain4j extension with a different programming model. The example below uses the standard LangChain4j library directly with Quarkus CDI:

@ApplicationScoped
public class AgentEnsembleProducer {

    @ConfigProperty(name = "agentensemble.openai.api-key")
    String apiKey;

    @Produces @ApplicationScoped
    public ChatLanguageModel chatModel() {
        return OpenAiChatModel.builder()
                .apiKey(apiKey)
                .modelName("gpt-4o")
                .build();
    }

    @Produces @ApplicationScoped
    public Ensemble ensemble(
            ChatLanguageModel chatModel,
            Agent researcher,
            Instance<EnsembleListener> listeners) {
        Ensemble.Builder builder = Ensemble.builder()
                .chatModel(chatModel)
                .agents(researcher);
        listeners.forEach(builder::listener);
        return builder.build();
    }
}

The only Quarkus-specific detail is Instance<EnsembleListener> instead of List<EnsembleListener> — CDI’s lazy injection mechanism.

The Design Tradeoff

The choice to keep AgentEnsemble framework-agnostic means there is no auto-configuration, no classpath scanning, and no starter module that wires everything with a single dependency. You write the configuration class yourself.

The upside is that the integration is completely transparent. There is no hidden magic, no classpath-sensitive behavior, and no risk of version conflicts between the library’s framework assumptions and your application’s framework version. The builder API is the same everywhere, so moving between frameworks (or running without one) requires changing only the DI wiring.

For teams that already have a preferred framework and know how to write configuration classes, this is usually the right tradeoff. The wiring code is small, readable, and lives in one place.

What Crosses the DI Boundary

A few integration points are worth calling out:

Listeners integrate naturally as DI beans. Declare any EnsembleListener implementation as a bean, and the ensemble configuration collects them.
Memory components (MemoryStore, EnsembleMemory) are created via builders and passed to the ensemble. In a DI framework, declare them as beans.
Tools are configured per-agent. Declare tool instances as beans and inject them into agent factory methods.
Metrics via MicrometerToolMetrics plug into whatever MeterRegistry your framework provides.

The general rule: if a component is created via a builder, it can be a bean. If it is passed to the ensemble builder, it can be injected.

The framework integration guide and full code examples are in the documentation.

I’d be interested in whether this level of framework-agnosticism feels right, or whether starter modules that auto-configure common setups would be more useful for your team.

Operating Agent Networks: Visual Topology, Drill-Down, and Runtime Visibility

Thu, 21 May 2026 00:00:00 GMT

Building an agent network is one problem. Operating it is a different one. When you have ten ensembles communicating over WebSockets, sharing capabilities via discovery, routing requests across federation boundaries, and managing capacity with priority queues — you need to see what is happening.

The visibility gap

Individual ensemble dashboards show what one ensemble is doing. They do not show the network — how ensembles relate to each other, where requests flow, and where bottlenecks form.

The network dashboard

AgentEnsemble’s network dashboard provides a topology view of the entire ensemble network:

Ensembles as interactive nodes with lifecycle state, queue depth, and progress
Shared capabilities as animated edges between nodes
Click to open ensemble detail sidebar (capabilities, metrics, connection status)
Drill-down to live execution dashboard for any ensemble

Architecture: independent WebSocket connections to each ensemble, no central aggregator. The dashboard has no persistent state — refresh reconnects and rebuilds.

Audit trail

The historical record of network events: work requests, capacity changes, discovery events, federation routing decisions. Append-only, backed by the same transport infrastructure (in-memory for dev, Kafka for production).

Three levels of visibility

Network level — topology, connections, capacity distribution, routing patterns
Ensemble level — queue depth, active tasks, shared capabilities, health
Execution level — individual task traces, agent iterations, tool calls

Each level answers different questions. The network dashboard provides levels 1 and 2, with drill-down to level 3. The audit trail adds the historical dimension.

The network dashboard is part of AgentEnsemble. The network dashboard guide covers setup, and the audit trail guide covers the historical event log.

Testing Distributed Agent Systems: Stubs, Recordings, and Isolation

Tue, 19 May 2026 00:00:00 GMT

Testing a single agent ensemble is already harder than testing most software: the output is non-deterministic, the execution path depends on LLM responses, and the number of iterations is unpredictable.

Testing a network of agent ensembles adds distributed system concerns on top of that: WebSocket connections between services, shared state across ensembles, capability discovery, and cross-ensemble delegation.

The testing problem

An ensemble that delegates work via NetworkTask or NetworkTool has external dependencies. In tests, you need control over what those dependencies return without running real ensembles.

Stubs for predictable behavior

NetworkTask.stub() returns canned responses without connecting to any real ensemble:

StubNetworkTask mealStub = NetworkTask.stub("kitchen", "prepare-meal",
    "Meal prepared: wagyu steak, medium-rare. Estimated 25 minutes.");

Ensemble roomService = Ensemble.builder()
    .chatLanguageModel(model)
    .task(Task.builder()
        .description("Handle room service request")
        .tools(mealStub)
        .build())
    .build();

Deterministic network behavior while the ensemble’s own LLM interactions remain non-deterministic.

Recordings for assertion

NetworkTask.recording() captures every request for later assertion:

RecordingNetworkTask recorder = NetworkTask.recording("kitchen", "prepare-meal");
roomService.run();

assertThat(recorder.callCount()).isEqualTo(1);
assertThat(recorder.lastRequest()).contains("wagyu");

Testing patterns summary

What to test	Tool	Approach
Ensemble uses network response correctly	`NetworkTask.stub()`	Canned response, deterministic
Ensemble sends correct request	`NetworkTask.recording()`	Capture and assert
Two ensembles work together	In-process transport	Real interaction, no network
End-to-end	WebSocket transport	Full integration test

The design principle

Network behavior and business logic are separable concerns. Test doubles let you test business logic without infrastructure. In-process transport lets you test interaction without the network. Full integration tests verify everything works together.

Network testing tools are part of AgentEnsemble. The network testing guide covers the full API.

Capacity Management in Agent Networks: Rate Limiting, Priority Queues, and Backpressure

Sun, 17 May 2026 00:00:00 GMT

Agent ensembles that run as long-lived services on a network will, at some point, receive more work than they can handle. The question is what happens next.

Without capacity management, the answer is usually one of: unbounded queue growth (OOM), random request dropping, or cascade failures where an overloaded ensemble backs up its callers.

The capacity problem in agent networks

Agent workloads have properties that make capacity management harder than in traditional request/response systems:

Variable execution time. A simple analysis task might take 5 seconds. A complex coding task might take 5 minutes.
Variable cost. Each agent iteration consumes LLM tokens. An overloaded system burns money faster.
Fan-out amplification. One incoming request to a coordinator might fan out to 5 different ensembles.

Three layers of capacity management

1. Reactive: Rate limits and backpressure

Concurrency limits protect ensembles from overload. When the limit is reached, requests queue. When the queue is full, backpressure signals propagate upstream.

2. Priority: Queues with aging

PriorityRequestQueue adds priority levels with aging to prevent starvation. Requests waiting beyond the aging interval get promoted, guaranteeing every request is eventually processed.

3. Proactive: Operational profiles

NetworkProfile bundles per-ensemble capacity targets and shared memory pre-load directives into deployable units. Apply via schedule, directive system, or manual trigger.

NetworkProfile weekendProfile = NetworkProfile.builder()
    .name("sporting-event-weekend")
    .ensemble("front-desk", Capacity.replicas(4).maxConcurrent(50))
    .ensemble("kitchen", Capacity.replicas(3).maxConcurrent(100))
    .preload("kitchen", "inventory", "Extra beer and ice stocked")
    .build();

The design principle

Each layer addresses a different time horizon: seconds (rate limits), minutes (priority queues), and hours/days (operational profiles). Together, they give operators the tools to keep an agent network running under variable load.

Capacity management is part of AgentEnsemble. The rate limiting guide, operational profiles guide, and scheduled tasks guide cover the full APIs.

Shared Memory Across Agent Ensembles: Consistency Models for Distributed State

Fri, 15 May 2026 00:00:00 GMT

When agent ensembles operate as independent services on a network, they occasionally need to share state. The question is not whether to share state — it is how to share it without creating the coordination problems that shared mutable state always creates in distributed systems.

The consistency spectrum

Not all shared state needs the same consistency guarantees:

Inventory notes are advisory. Eventual consistency is fine.
Room assignments are exclusive. This needs distributed locks or optimistic locking.
Configuration preferences are rarely updated. Eventual consistency with version tracking works well.

SharedMemory with configurable consistency

AgentEnsemble v3.0.0 introduces SharedMemory with per-scope consistency selection:

Model	Behavior	Use case
`EVENTUAL`	Last-write-wins, no coordination	Context, preferences, notes
`OPTIMISTIC`	Version-checked writes, retry on conflict	Counters, shared documents
`LOCKED`	Distributed lock before each read/write	Room assignments, exclusive resources

Different scopes can use different models:

SharedMemory inventory = SharedMemory.builder()
    .store(MemoryStore.inMemory())
    .consistency(Consistency.EVENTUAL)
    .build();

SharedMemory rooms = SharedMemory.builder()
    .store(MemoryStore.inMemory())
    .consistency(Consistency.LOCKED)
    .build();

The design principle

The useful insight is that shared state in agent networks is not monolithic. Different categories of state need different consistency guarantees, and forcing a single model is either too expensive or too weak.

The consistency model is a property of the data, not a property of the system. Choose it based on what happens when two ensembles access the same state concurrently.

Shared memory is part of AgentEnsemble. The shared memory guide covers the full API including consistency models and network configuration.

Federation for Agent Networks: Cross-Namespace Capability Sharing via Realms

Wed, 13 May 2026 00:00:00 GMT

Discovery lets ensembles find capabilities within a network. But in a real deployment, not every ensemble lives in the same namespace or even the same cluster. A hotel chain might run separate ensemble networks at each property, each in its own Kubernetes namespace, but want them to share spare capacity when one property is overloaded.

This is the federation problem: how do you extend capability discovery across trust and network boundaries without collapsing everything into one flat namespace?

Realms as trust boundaries

AgentEnsemble v3.0.0 introduces realms as the organizational unit for federation. A realm is a namespace-level discovery and trust boundary — typically mapping to a Kubernetes namespace in production deployments.

FederationConfig federation = FederationConfig.builder()
    .localRealm("hotel-downtown")
    .federationName("Hotel Chain")
    .realm("hotel-airport", "hotel-airport-ns")
    .realm("hotel-beach", "hotel-beach-ns")
    .build();

Within a realm, ensembles discover each other freely. Cross-realm discovery requires explicit opt-in: an ensemble must advertise its capacity as shareable for other realms to use it.

Capacity advertisement

Ensembles periodically broadcast their current load and availability. The shareable flag is the federation gate — when true, spare capacity is available to other realms.

The routing hierarchy

Priority	Scope	Condition
1 (highest)	Local realm	Provider is in the same realm
2	Same realm (unregistered)	Provider has no realm info (assumed local)
3 (lowest)	Cross-realm	Provider is in a different realm and `shareable = true`

Within each level, the least-loaded provider is preferred. The hierarchy encodes a simple principle: prefer local providers, fall back to cross-realm when local capacity is insufficient.

The design principle

Federation is a capacity-sharing problem, not a networking problem. The networking already works across boundaries. What federation adds is a policy layer: who can use whose spare capacity, and in what order.

Realms provide the organizational unit. Capacity advertisement provides the data. The routing hierarchy provides the policy. Together, they turn independent agent networks into a cooperative federation that shares spare capacity while maintaining operational independence.

Federation is part of AgentEnsemble. The federation guide covers the full API including capacity advertisement and realm configuration.

Dynamic Discovery in Agent Networks: From Hardcoded Routes to Capability Catalogs

Mon, 11 May 2026 00:00:00 GMT

The simplest way to connect two agent ensembles is a direct reference: ensemble A knows ensemble B’s address and calls it. This works when you have two or three ensembles with stable relationships.

It stops working when you have ten ensembles, or when ensembles come and go, or when the same capability is provided by multiple ensembles and you want the caller to use whichever one is available. At that point, you need discovery — a way for ensembles to find capabilities without knowing in advance who provides them.

The static wiring problem

In a statically wired agent network, every cross-ensemble call requires knowing the provider’s identity and address. This creates coupling. If the provider moves, every caller needs updating. If you add a second provider for capacity, callers need load-balancing logic.

The fundamental issue is that callers should care about what they need, not who provides it or where it runs.

Capability advertisement with tags

AgentEnsemble v3.0.0 introduces capability discovery. Ensembles advertise their shared tasks and tools with optional tags, and other ensembles discover providers at runtime:

Ensemble kitchen = Ensemble.builder()
    .chatLanguageModel(model)
    .task(Task.of("Manage kitchen operations"))
    .shareTool("check-inventory", inventoryTool, "food", "stock")
    .shareTask("prepare-meal", mealTask, "food", "cooking")
    .build();

kitchen.start(7329);

// Another ensemble discovers capabilities dynamically
NetworkTool inventoryCheck = NetworkTool.discover("check-inventory", registry);

Tags classify capabilities for filtered discovery. Query for categories rather than specific names:

List<CapabilityInfo> food = registry.findByTag("food");
List<CapabilityInfo> stockChecks = registry.findByTags("food", "stock");

Dynamic vs. static wiring

The two approaches coexist. Use static wiring for well-known, stable relationships. Use dynamic discovery for capabilities that may be provided by different ensembles depending on deployment, capacity, or availability.

The agent using the task or tool does not know which approach was used to create it.

Tradeoffs

Discovery adds a lookup step. Initial resolution queries the registry; subsequent uses are cached.

Tag semantics are convention-based. No schema enforcement — tag conventions need to be agreed upon across teams.

Multiple providers create ambiguity. The registry needs a selection strategy (least-loaded, round-robin, affinity-based).

Registry availability is a dependency. For critical paths, consider falling back to static wiring when discovery is unavailable.

The design principle

The useful abstraction is separating what from who. An ensemble that needs a capability should express that need without specifying the provider. This separation enables the network to evolve — new providers can come online, existing providers can be replaced, capacity can be redistributed — without callers needing to change.

Capability discovery is part of AgentEnsemble. The discovery guide covers the full API including tag-based filtering.

Durable Transport for Agent Networks: Moving from In-Process Queues to Kafka

Sat, 09 May 2026 00:00:00 GMT

In-process queues are fine for development. They are fast, deterministic, and require zero infrastructure. But they have a property that becomes a liability in production: when the process dies, the queue contents disappear.

For agent networks that run as long-lived services — handling work requests over hours or days — losing queued requests on restart is not acceptable. The transport layer needs durability, and that means moving from in-process data structures to something that survives process failures.

What durability means for agent networks

An agent ensemble network has three communication patterns that need durable backing:

Work request delivery — a request from one ensemble to another should not be lost if the receiving ensemble is temporarily unavailable
Response routing — when an ensemble completes a request, the response needs to reach the original caller even if the caller restarted
Capability advertisement — shared tasks and tools should remain discoverable across process restarts

Kafka as the transport backing

The agentensemble-transport-kafka module implements the transport SPIs against Apache Kafka:

KafkaTransportConfig config = KafkaTransportConfig.builder()
    .bootstrapServers("kafka:9092")
    .consumerGroupId("kitchen-ensemble")
    .topicPrefix("agentensemble.")
    .build();

Request queues

KafkaRequestQueue produces work requests to a Kafka topic and consumes them with manual offset commits. If the ensemble crashes mid-processing, the request will be redelivered on restart.

Priority queues with aging

For workloads where some requests are more urgent than others, PriorityRequestQueue adds priority levels with aging to prevent starvation. Requests that wait longer than the aging interval are promoted to the next higher priority level.

What changes operationally

Moving from in-process to Kafka transport changes the operational profile:

Startup behavior — with Kafka, ensembles may start with a backlog of unprocessed requests
Failure modes — infrastructure-level errors (broker unavailable) rather than process-fatal
Monitoring — consumer lag, partition health, broker connectivity
Ordering — per-partition ordering, not strict FIFO

The configuration boundary

The ensemble does not know it is using Kafka — it interacts with transport SPIs. The Kafka-specific configuration lives in the infrastructure layer:

// Infrastructure layer
KafkaRequestQueue queue = KafkaRequestQueue.builder()
    .config(kafkaConfig)
    .ensembleName("kitchen")
    .build();

// Application layer (transport-agnostic)
Ensemble kitchen = Ensemble.builder()
    .chatLanguageModel(model)
    .task(Task.of("Manage kitchen operations"))
    .requestQueue(queue)
    .build();

Same ensemble code works in development (in-process queues) and production (Kafka) without changes.

Tradeoffs

At-least-once delivery. A request may be processed twice if the ensemble crashes after completing work but before committing the offset. For most agent workloads (non-deterministic anyway), this is acceptable.

Operational complexity. Kafka needs to be provisioned, monitored, and maintained. For small deployments, the overhead may not be justified.

Latency. Kafka adds millisecond-scale latency. For agent workloads where execution takes seconds or minutes, this is negligible.

The Kafka transport module is part of AgentEnsemble. The durable transport guide covers the full configuration and operational details.

Transport SPI: Making Agent Network Infrastructure Pluggable

Thu, 07 May 2026 00:00:00 GMT

When agent ensembles become long-running services that communicate over a network, the communication layer becomes infrastructure. And infrastructure has a property that application code should not: it varies by deployment environment.

Development uses in-process queues. Staging might use Redis. Production runs Kafka. The application code — the agents, tasks, workflows — should not change between these environments. The question is where to draw the abstraction line.

The transport problem

An ensemble network needs several communication primitives:

Request queues — how work requests arrive at an ensemble
Delivery registries — how responses get routed back to the requester
Capability registries — how ensembles advertise and discover shared tasks and tools
Capacity tracking — how ensembles report their current load

Each of these has a natural in-process implementation (maps, queues, lists) and at least one distributed implementation (Kafka topics, Redis streams, service registries). If these are hardcoded to a specific backing store, every deployment environment change requires code changes.

The SPI design

AgentEnsemble defines transport as a set of Java interfaces — a Service Provider Interface — with pluggable implementations:

Transport transport = Transport.websocket("kitchen");

// Or for production with delivery guarantees
Transport transport = Transport.simple("kitchen", deliveryRegistry);

The Transport interface provides access to the individual primitives:

Primitive	Interface	Purpose
Request queue	`RequestQueue`	Inbound work request buffering
Delivery registry	`DeliveryRegistry`	Response routing back to callers
Capability registry	`CapabilityRegistry`	Shared task/tool advertisement

Each interface has a simple contract. RequestQueue, for example:

public interface RequestQueue {
    void enqueue(WorkRequest request);
    Optional<WorkRequest> poll(Duration timeout);
    int size();
}

The in-process implementation uses a LinkedBlockingQueue. The Kafka implementation produces to a topic and consumes with manual offset commits. Same interface, different backing.

Why this matters for agent systems

The transport SPI is not unusual as an architectural pattern — it is a standard dependency inversion. What makes it interesting in the agent context is what it enables.

Agent networks are inherently non-deterministic. Agents take variable time, produce variable output, and may fail in unpredictable ways. Adding infrastructure variability on top of that makes the system harder to reason about.

By isolating transport from application logic, you can:

Test with in-process transport — no containers, no network, deterministic ordering
Develop locally with WebSocket transport — real network behavior, zero infrastructure setup
Deploy to production with Kafka — durability, horizontal scaling, replay capability
Switch between environments — without touching agent code, task definitions, or workflow configuration

The capability registry

One of the more interesting transport primitives is the capability registry. When an ensemble shares a task or tool on the network, that capability needs to be discoverable by other ensembles.

CapabilityRegistry registry = transport.capabilityRegistry();
registry.register("prepare-meal", CapabilityType.TASK, "kitchen");
registry.register("check-inventory", CapabilityType.TOOL, "kitchen");

Optional<String> provider = registry.findProvider("prepare-meal");

In simple mode, this is an in-memory map. In production, it could be backed by a service registry, a shared database, or Kafka’s consumer group protocol. The application code that registers and discovers capabilities does not change.

Tradeoffs

Abstraction leaks. In-process queues have different ordering and delivery guarantees than Kafka topics. The SPI abstracts the interface but cannot fully abstract the semantics.

Configuration complexity. Each transport implementation has its own configuration. The SPI does not unify configuration — you still need environment-specific setup for each backing store.

Performance characteristics vary. In-process queues are nanosecond-scale. Kafka adds millisecond-scale latency. If your agent workflow is latency-sensitive, the transport choice matters.

The design principle

The useful insight is that agent network communication has a small number of well-defined primitives, and these primitives have natural implementations at every scale. Defining the primitives as interfaces lets the infrastructure decision be made at deployment time rather than at development time.

This is standard dependency inversion. It is not novel. But it is the foundation that makes everything else in the ensemble network possible — durable transport, discovery, federation, and capacity management all build on these same interfaces.

The transport SPI is part of AgentEnsemble. The durable transport guide covers the Kafka implementation in detail.

Bridging MCP into Java Agent Systems: Reusing the Tool Ecosystem Without Leaving the JVM

Tue, 05 May 2026 00:00:00 GMT

The Model Context Protocol has created a growing ecosystem of tool servers — filesystem operations, git integration, database access, API connectors. Most of these servers are written in TypeScript and communicate over stdio or SSE.

If you are building agent systems on the JVM, you face a choice: rewrite every tool in Java, or find a way to use what already exists. The useful answer is usually both — and the bridge between them needs to be clean enough that the rest of your system does not care which approach a particular tool uses.

The integration problem

MCP servers expose tools through a well-defined protocol. LangChain4j (which AgentEnsemble builds on) already has MCP client support via McpClient and McpToolProvider. But there is a gap: LangChain4j’s MCP integration produces tools for its AiServices abstraction, not for AgentEnsemble’s AgentTool interface.

The bridge needs to:

Connect to any MCP server (stdio or SSE transport)
Discover available tools from the server
Adapt each MCP tool to the AgentTool interface
Manage the server subprocess lifecycle
Allow MCP tools and Java-native tools to coexist in the same agent’s tool list

McpToolFactory

The agentensemble-mcp module provides McpToolFactory as the primary entry point. Connect to any MCP-compatible server and get back standard AgentTool instances:

try (StdioMcpTransport transport = new StdioMcpTransport.Builder()
        .command(List.of("npx", "--yes",
            "@modelcontextprotocol/server-filesystem", "/workspace"))
        .build()) {

    List<AgentTool> tools = McpToolFactory.fromServer(transport);

    Agent agent = Agent.builder()
        .role("File analyst")
        .goal("Analyze project structure")
        .tools(tools)
        .llm(model)
        .build();
}

The factory connects to the server, enumerates its tools, and wraps each one as an McpAgentTool. Because MCP tools already have typed parameter schemas, the wrapper passes those schemas through to LangChain4j’s ToolSpecification directly — no intermediate Java record needed.

You can also filter to specific tools:

List<AgentTool> tools = McpToolFactory.fromServer(transport,
    "read_file", "search_files", "directory_tree");

This is useful when a server exposes tools you do not want the agent to have access to — write operations, for instance, when the agent should only read.

Convenience factories for common servers

The two most common MCP servers for coding workflows are the filesystem and git reference servers. McpToolFactory provides convenience methods that handle the subprocess setup:

try (McpServerLifecycle fs = McpToolFactory.filesystem(projectDir);
        McpServerLifecycle git = McpToolFactory.git(projectDir)) {
    fs.start();
    git.start();

    List<AgentTool> allTools = new ArrayList<>();
    allTools.addAll(fs.tools());
    allTools.addAll(git.tools());

    // Use allTools in any agent
}

The filesystem server provides: read_file, write_file, edit_file, search_files, list_directory, directory_tree, get_file_info.

The git server provides: git_status, git_diff_unstaged, git_diff_staged, git_diff, git_commit, git_add, git_log, git_branch, git_create_branch, git_checkout, git_show, git_reset.

Lifecycle management

MCP servers run as subprocesses. If you do not shut them down, you leak processes. McpServerLifecycle implements AutoCloseable so try-with-resources handles cleanup:

try (McpServerLifecycle server = McpToolFactory.filesystem(dir)) {
    server.start();
    // Use server.tools() ...
} // server is shut down here, subprocess is killed

For long-running ensembles, McpServerLifecycle also integrates with the ensemble’s lifecycle listener. When the ensemble stops, any attached MCP servers are shut down automatically.

Mixing MCP and Java-native tools

The most practical pattern is combining MCP tools with Java-native tools in the same agent. MCP provides the filesystem and git operations; Java-native tools handle domain-specific logic, calculations, or API calls:

try (McpServerLifecycle fs = McpToolFactory.filesystem(projectDir)) {
    fs.start();

    Agent agent = Agent.builder()
        .role("Code reviewer")
        .goal("Review code changes and check style compliance")
        .tools(fs.tools())                    // MCP filesystem tools
        .tools(List.of(                       // Java-native tools
            new StyleCheckerTool(),
            new MetricsCalculatorTool()))
        .llm(model)
        .build();
}

Both tool types implement the same AgentTool interface. The agent sees a flat list of tools with names and descriptions. It does not know or care which ones are backed by an MCP subprocess and which are pure Java.

This composability is the point. You can start with MCP servers for rapid capability acquisition, then replace individual tools with Java implementations when you need more control, better performance, or fewer runtime dependencies — without changing the agent configuration.

Connecting to custom MCP servers

Any MCP-compatible server works — not just the reference implementations. If you have a custom server that exposes domain-specific tools, connect it the same way:

try (StdioMcpTransport transport = new StdioMcpTransport.Builder()
        .command(List.of("python", "-m", "my_custom_mcp_server"))
        .build()) {

    List<AgentTool> tools = McpToolFactory.fromServer(transport);
}

SSE transport works for remote servers:

SseMcpTransport transport = new SseMcpTransport.Builder()
    .sseUrl("http://mcp-server:8080/sse")
    .build();

List<AgentTool> tools = McpToolFactory.fromServer(transport);

Tradeoffs

Subprocess overhead. Each MCP server is a separate process. For the reference servers, this means Node.js must be installed. The startup cost is measurable (typically 1-2 seconds). For long-running agents, this is negligible; for one-shot scripts, it adds latency.

Debugging across process boundaries. When an MCP tool fails, the error comes back as a string from the subprocess. You lose Java stack traces and structured exception types.

No hot-reload. If the MCP server crashes, the tools become unavailable. The bridge does not automatically restart servers. For production deployments, you would want health-check and restart logic around the lifecycle objects.

When to use MCP vs. Java-native tools

Consideration	MCP	Java-native
Ecosystem breadth	Large and growing	You build what you need
Runtime dependency	Node.js (for reference servers)	Pure JVM
Startup latency	1-2s per server	Instant
Debugging	Cross-process	Same-process stack traces
Customization	Limited to server’s API	Full control
Integration with Java types	String-based	Native records, type safety

The practical pattern: start with MCP for rapid capability bootstrapping, move to Java-native tools for anything performance-sensitive or deeply integrated with your domain.

The MCP bridge is part of AgentEnsemble. The MCP bridge guide covers the full API and transport options.

Coding Agents on the JVM: Project Detection, Workspace Isolation, and Tool Composition

Sun, 03 May 2026 00:00:00 GMT

Most agent frameworks treat coding tasks the same as any other task: give the agent a file-read tool and a file-write tool and hope for the best.

In practice, an agent that can read and write files is not the same as an agent that can reliably work on a codebase. The gap between “can modify files” and “can fix a bug in a Gradle project” is significant, and it is mostly about context that the agent needs but does not have.

The missing context

A coding agent needs to know things that a general-purpose agent does not:

What kind of project is this? Is it Java with Gradle, Python with pip, TypeScript with npm? The build command, test command, and source layout all follow from this.
Where is the code? Source roots like src/main/java are conventions, not universal truths. The agent needs to know where to look.
How do I verify my changes? Running ./gradlew test is fundamentally different from running npm test. The agent needs the right command for the project.
How do I avoid breaking things? If the agent edits files directly in the user’s working tree, a failed experiment leaves half-finished code behind.

Without this context, agents make predictable mistakes: they guess at build commands, search in wrong directories, and leave the codebase in a worse state than they found it.

Project detection as a first-class concern

The approach I’ve been working on in AgentEnsemble treats project detection as an explicit step before tool assembly.

ProjectDetector.analyze(Path) scans the project root for build-file markers and returns a ProjectContext that captures language, build system, source roots, and the commands needed to build and test:

Marker file	Language	Build command	Test command
`build.gradle.kts` / `build.gradle`	Java	`./gradlew build`	`./gradlew test`
`pom.xml`	Java	`mvn compile`	`mvn test`
`package.json` + `tsconfig.json`	TypeScript	`npm run build`	`npm test`
`pyproject.toml` / `requirements.txt`	Python	`python -m build`	`python -m pytest`
`go.mod`	Go	`go build ./...`	`go test ./...`
`Cargo.toml`	Rust	`cargo build`	`cargo test`

This is not magic — it is a lookup table backed by file-existence checks. But it means the agent’s system prompt includes the correct build and test commands for the specific project, rather than generic instructions that may or may not apply.

The detected context is injected into the agent’s instructions automatically. The agent knows it is working on a Java/Gradle project with source at src/main/java and tests at src/test/java, and it knows that ./gradlew test is the verification command.

Workspace isolation via git worktrees

The harder problem is safety. A coding agent that writes directly to the user’s working tree is an agent that can break your build, conflict with your uncommitted work, or leave half-finished refactoring behind if it fails partway through.

Git worktrees solve this cleanly. A worktree is a lightweight, branch-isolated copy of a repository that shares the same object store as the original. Creation is fast and disk-efficient because it does not duplicate the git history.

EnsembleOutput result = CodingEnsemble.runIsolated(model, repoRoot,
    CodingTask.implement("Add user profile endpoint"));

That runIsolated call:

Creates a git worktree from the current branch
Runs the coding agent inside the worktree
On success, preserves the worktree for review (you can inspect the changes, run tests again, then merge)
On failure, cleans up the worktree automatically

The key interface is Workspace:

public interface Workspace extends AutoCloseable {
    Path path();          // Absolute path to the isolated directory
    void close();         // Clean up (remove worktree)
}

For non-git projects, a DirectoryWorkspace creates a temporary directory and optionally copies source files. But for the common case — a git repository — worktrees provide isolation without the cost of a full clone.

The tradeoff is that worktrees require a git repository. If you are working on a non-git project or a freshly initialized directory, the fallback to temporary directories is less elegant. But for the vast majority of real codebases, worktrees are the right abstraction.

Composable tool backends

Different environments have different constraints. Some teams run pure-JVM deployments where Node.js is not available. Others already use MCP servers and want to reuse them. A coding agent framework should not force one approach.

AgentEnsemble provides three tool backends, selected via ToolBackend:

Backend	Description	Requires
`AUTO`	Detect best available backend	Nothing
`JAVA`	Java-native coding tools (glob, search, edit, shell, git, build, test)	`agentensemble-tools-coding` on classpath
`MCP`	MCP reference servers for filesystem + git	`agentensemble-mcp` + Node.js
`MINIMAL`	`FileReadTool` only	Always available

AUTO resolves in order: MCP > JAVA > MINIMAL. If neither optional module is on the classpath, the agent works with file-read only — limited, but functional for read-only analysis tasks.

The Java backend provides purpose-built coding tools:

GlobTool — find files by pattern across the project
GrepTool — search file contents with regex
CodeEditTool — surgical line-range replacement (not full-file overwrite)
ShellTool — execute build/test commands with output capture
GitTool — status, diff, stage, commit

The MCP backend starts the official MCP filesystem and git reference servers as subprocesses and adapts their tools to the AgentTool interface. Both backends produce the same tool interface, so the rest of the framework does not care which one is active.

The one-liner and the builder

For the common case, a single call handles everything:

EnsembleOutput result = CodingEnsemble.run(model, Path.of("/my/project"),
    CodingTask.fix("NullPointerException in UserService.getById()"));

That call detects the project, assembles tools, generates a coding-specific system prompt, and runs the agent with a higher iteration limit (75 vs the default 25 — coding tasks typically need more rounds).

For more control, the builder exposes every knob:

Agent agent = CodingAgent.builder()
    .llm(model)
    .workingDirectory(Path.of("/my/project"))
    .toolBackend(ToolBackend.JAVA)
    .requireApproval(true)
    .maxIterations(75)
    .additionalTools(myCustomTool)
    .build();

The builder returns a standard Agent — no subclassing, no special execution path. You can use it with Task, Ensemble, phases, or any other framework feature. The coding agent is composed from the same primitives as every other agent.

Pre-configured task types

Common coding workflows have predictable shapes. A bug-fix task needs different instructions than a feature implementation or a refactoring:

Task fix       = CodingTask.fix("NullPointerException in handler");
Task implement = CodingTask.implement("Add pagination to /api/users");
Task refactor  = CodingTask.refactor("Extract UserRepository interface");

Each returns a standard Task with appropriate description and expected-output templates. They can be further customized:

Task task = CodingTask.fix("Some bug")
    .toBuilder()
    .expectedOutput("Custom expected output")
    .build();

This is a convenience, not a requirement. You can always construct a Task manually and pass it to a coding agent.

Tradeoffs and limitations

Project detection is heuristic. It works for standard project layouts but will not detect custom build systems or unconventional directory structures. The fallback is explicit configuration via the builder.

Iteration limits are a blunt instrument. A higher limit gives the agent more chances to iterate, but it also means higher token costs if the agent goes in circles. There is no substitute for good prompting and appropriate task scoping.

Workspace isolation adds a step. The agent works in a worktree, but the user still needs to review and merge the changes. This is deliberate — automated merge would undermine the safety guarantee — but it does add friction to the workflow.

Tool backend selection is build-time. You choose your backend by including the right dependency. Runtime switching between Java and MCP backends is possible via AUTO, but you cannot hot-swap mid-execution.

The design principle

The useful abstraction is not “an agent that can code” but “a standard agent with the right tools and context for coding tasks.” The coding agent is not a special type — it is a regular agent, assembled with project-aware tools, operating in an isolated workspace, and configured with appropriate iteration limits.

This matters because it means coding agents compose with everything else in the framework: phases, delegation, human review, metrics, traces. There is no separate execution path to maintain.

The coding agent modules are part of AgentEnsemble. The coding agents guide and workspace isolation guide cover the full API.

Humans as Participants, Not Controllers: Designing Agent Systems That Run Without You

Thu, 30 Apr 2026 00:00:00 GMT

Most human-in-the-loop designs treat humans as gatekeepers. The agent pipeline pauses, a notification fires, a human reviews and approves, the pipeline continues. If the human is not there, the system waits. If the human takes too long, the system times out.

This works for simple approval workflows. It does not work for systems that need to run autonomously for hours or days while humans come and go.

The harder design problem is: how do you build agent systems where humans are participants in the system rather than controllers of it? Where the system runs without them, benefits from their presence, and does not break when they leave?

The Controller Model vs the Participant Model

In the controller model, the human is a required step in the pipeline. The system cannot proceed without them. If the human is unavailable, the system blocks. Every approval gate is a potential bottleneck.

In the participant model, the human connects to a running system, observes its current state, provides input where useful, makes decisions that require their authority, and disconnects. The system keeps running.

The distinction is not about removing humans from the loop. It is about changing the default from “blocked, waiting for human” to “running autonomously, human welcome.”

The Interaction Spectrum

Not all human interactions have the same urgency or the same blocking requirement. The design uses a five-level spectrum:

Level	Example	Behavior
Autonomous	Housekeeping cleans rooms after checkout	No human needed
Advisory	Manager says “prioritize VIP guest”	Human input welcomed but not required
Notifiable	”Water leak detected in room 305”	Alert a human, proceed with best-effort response
Approvable	Guest requests late checkout	Ask human if available, auto-approve on timeout
Gated	Opening the hotel safe	Cannot proceed without human authorization

Most interactions in a well-designed system should fall in the first three levels. The system handles them autonomously. Humans are notified of important events but do not need to take action for the system to continue.

The gated level is reserved for decisions that genuinely require human authority — security decisions, compliance gates, large financial commitments. These are intentionally rare and intentionally blocking.

Gated Reviews with Role Requirements

When a task requires human authorization, the review specifies who can approve:

Task openSafe = Task.builder()
    .description("Open the hotel safe for cash reconciliation")
    .review(Review.builder()
        .prompt("Manager authorization required to open the safe")
        .requiredRole("manager")
        .timeout(Duration.ZERO)  // no timeout -- wait until a human decides
        .build())
    .build();

When this review fires and no qualified human is connected:

The review is queued
An out-of-band notification is sent (Slack, email, webhook)
The task waits
When a qualified human connects to the dashboard, they see the pending review immediately
They approve or reject, and the task resumes

The key design choice: timeout(Duration.ZERO) means the system waits indefinitely. This is appropriate for decisions that genuinely cannot be made without human authority. For less critical approvals, a timeout with auto-approve provides the fallback:

Review.builder()
    .prompt("Guest requests late checkout -- approve?")
    .requiredRole("front-desk")
    .timeout(Duration.ofMinutes(10))
    .timeoutDecision(ReviewDecision.APPROVE)
    .build()

If no human responds within 10 minutes, the system auto-approves. The human can still intervene within the window, but the system does not block indefinitely for a non-critical decision.

Human Directives

Humans can inject guidance into any ensemble they have access to:

{
  "type": "directive",
  "to": "room-service",
  "from": "manager:human",
  "content": "Guest in 801 is VIP, prioritize all their requests"
}

Directives are non-blocking. They do not pause the system or wait for acknowledgment. They are injected as additional context for future task executions. The next time room service processes a request related to room 801, the directive is included in the prompt context.

This models how human managers actually work. A hotel manager does not approve every room service order. They walk through the hotel, observe what is happening, and give occasional direction: “That table needs attention.” “The VIP in the penthouse gets priority.” Then they move on.

Control Plane Directives

Beyond natural language guidance, humans (or automated policies) can send structured control plane directives:

{
  "type": "directive",
  "to": "kitchen",
  "from": "cost-policy:automated",
  "action": "SET_MODEL_TIER",
  "value": "FALLBACK"
}

This switches the kitchen ensemble to a cheaper LLM model without restarting. The ensemble has configurable model tiers:

Ensemble.builder()
    .chatLanguageModel(gpt4)        // primary
    .fallbackModel(gpt4Mini)        // cheaper fallback
    .build();

Other control plane actions include pausing an ensemble, adjusting priority weights, enabling or disabling specific shared tasks, and changing queue depth limits. These are operational controls that affect ensemble behavior at runtime without redeployment.

Late-Join State Synchronization

When a human connects to the dashboard — whether it is their first time today or they are reconnecting after a network interruption — they need to see the current state of the system immediately. They should not have to wait for events to stream in before understanding what is happening.

The existing late-join mechanism (from v2.1.0’s agentensemble-web module) extends to the network level. When a human connects:

The dashboard sends a hello message with the human’s identity and roles
Each ensemble the human has access to sends a snapshotTrace — the current state of all active tasks, pending reviews, queue depths, and recent events
Live events start streaming immediately

The human is caught up within seconds of connecting. Pending reviews that match their role are highlighted. They can start making decisions immediately without waiting for context to accumulate.

Operational Resilience

The participant model enables several operational patterns that the controller model cannot support:

Elastic scaling with human oversight. A conference weekend means higher load. The system scales automatically (K8s HPA watching queue depth). The human manager connects, observes the scaled-up state, adjusts priorities if needed, and disconnects. The system handles the load autonomously.

Operational profiles. Predefined configurations for known scenarios:

NetworkProfile sportingEvent = NetworkProfile.builder()
    .name("sporting-event-weekend")
    .ensemble("front-desk", Capacity.replicas(4).maxConcurrent(50))
    .ensemble("kitchen", Capacity.replicas(3).maxConcurrent(100))
    .ensemble("room-service", Capacity.replicas(3).maxConcurrent(80))
    .preload("kitchen", "inventory", "Extra beer and ice stocked")
    .build();

network.applyProfile(sportingEvent);

A human can apply a profile, or profiles can activate on a schedule or via rules.

Simulation and chaos engineering. Before the conference, simulate the expected load: “What happens if kitchen goes down during peak dinner service?” Run a simulation with mock LLMs, time-compressed. Get a capacity report. Then inject a kitchen failure as a chaos test. Assert that room service’s circuit breaker opens within 30 seconds and the fallback activates within 1 minute. These are built into the framework, not bolted on.

Federation. Hotel A is at capacity. Hotel B across town has idle kitchen capacity. Overflow requests route to Hotel B automatically. The human manager sees both hotels on the same dashboard. This is the network-of-networks level — multiple independent agent systems sharing capacity when needed.

Tradeoffs

Autonomy vs oversight. The more autonomous the system, the less opportunity for human correction before a mistake propagates. The mitigation is observability: the system runs autonomously but every decision is traced, logged, and visible. Humans review after the fact and inject directives to adjust future behavior.

Gating cost. Every gated review is a potential bottleneck and a source of latency. The design pressure is to minimize gated interactions — reserve them for decisions that genuinely require human authority. If you find yourself gating routine operations, the system design needs revision, not more human approvals.

Notification fatigue. A system that notifies humans about everything trains them to ignore notifications. The notification levels (autonomous, advisory, notifiable, approvable, gated) exist to keep the signal-to-noise ratio high. Most things should be autonomous. Notifications should be reserved for things that actually need attention.

Simulation fidelity. Simulations use mock LLMs and time compression. The behavior will not perfectly match production. The value is in finding structural problems — capacity bottlenecks, missing fallbacks, broken circuit breakers — not in predicting exact outcomes.

This is the third and final post in the Ensemble Network architecture arc. The architecture is planned for AgentEnsemble v3.0.0. The previous posts cover ensembles as services and cross-ensemble delegation.

The design document covers the full architecture including discovery, error handling, versioning, security, testing, and the phased delivery plan.

AgentEnsemble is open-source under the MIT license.

Task Sharing vs Tool Sharing: Cross-Ensemble Delegation in Distributed Agent Systems

Mon, 27 Apr 2026 00:00:00 GMT

MCP (Model Context Protocol) gives agents the ability to call tools hosted by other services. This is useful — it is function-level interoperability. An agent calls a function, gets a result, continues.

But there is a level above function calls that most frameworks have not addressed: what happens when one autonomous agent system needs to delegate a complex, multi-step process to another autonomous agent system?

The distinction matters. Calling a tool is like borrowing a calculator. Delegating a task is like hiring a department.

When agent ensembles run as long-lived services on a network (as described in the previous post), they need to share capabilities with each other. There are two fundamentally different kinds of sharing:

Tool sharing exposes a single function. The calling agent invokes it in its ReAct loop, gets a result, and continues reasoning. The tool executes atomically — there is no multi-step process, no internal agents, no review gates. This is what MCP provides.

Task sharing exposes a complete process. The calling ensemble delegates work to another ensemble, which runs its own agents, tools, memory, and review gates to produce the result. The caller does not know or control the internal process. It hands off work and gets back a result.

// Room service uses both kinds of sharing from kitchen
Ensemble roomService = Ensemble.builder()
    .name("room-service")
    .chatLanguageModel(model)
    .task(Task.builder()
        .description("Handle guest room service request")
        .tools(
            // Task sharing: delegates the full meal preparation process
            NetworkTask.from("kitchen", "prepare-meal"),

            // Tool sharing: calls a single function for inventory check
            NetworkTool.from("kitchen", "check-inventory"),
            NetworkTool.from("kitchen", "dietary-check"),

            // Task sharing: delegates repair work to maintenance
            NetworkTask.from("maintenance", "repair-request"))
        .build())
    .build();

Both NetworkTask and NetworkTool implement the same AgentTool interface. The agent calling them does not know whether a tool is local or remote, or whether it triggers a single function or an entire pipeline. The existing ReAct loop, tool executor, metrics, and tracing all work unchanged.

How Delegation Works

When an agent calls a shared tool, the flow is straightforward:

Agent calls check-inventory("wagyu beef")
NetworkTool serializes the call into a WorkRequest
Request is sent to the kitchen ensemble (WebSocket or queue)
Kitchen executes inventoryTool.execute("wagyu beef") locally
Result flows back: "Yes, 3 portions available"
Agent continues its ReAct loop

When an agent calls a shared task, the flow involves a full pipeline on the other side:

Agent calls prepare-meal("Wagyu steak, medium-rare, room 403")
NetworkTask serializes a WorkRequest with the full task context
Request is sent to kitchen
Kitchen runs its complete task pipeline — agent synthesis, tool calls, execution, review gates
Result flows back: "Preparing now, estimated 25 minutes, ticket #4071"
Agent continues

The critical difference: in step 4 of the task delegation, the kitchen ensemble is running its own agents with its own tools and its own review gates. The room service agent is not involved in any of that. It delegated the work and is waiting for a result — or continuing with other work if the request was async.

The WorkRequest Envelope

Every cross-ensemble message uses a standardized envelope:

public record WorkRequest(
    String requestId,           // Correlation + idempotency key
    String from,                // Requesting ensemble name
    String task,                // Shared task or tool name to execute
    String context,             // Natural language input/context
    Priority priority,          // CRITICAL / HIGH / NORMAL / LOW
    Duration deadline,          // Caller's SLA ("I need this within...")
    DeliverySpec delivery,      // How and where to return the result
    String traceContext,        // W3C traceparent for distributed tracing
    CachePolicy cachePolicy,    // USE_CACHED / FORCE_FRESH
    String cacheKey             // Optional, for result caching
) {}

A few design choices in this envelope are worth noting:

The context field is natural language. When maintenance asks procurement to order parts, the context is: “Order replacement valve for building 2 boiler.” Not a typed JSON schema. Not a protobuf message. Natural language that the receiving ensemble’s LLM interprets.

The deadline belongs to the caller, not the provider. The requester sets the SLA: “I need this within 30 minutes.” The provider responds with an estimated completion time. If the estimate exceeds the deadline, the caller decides: accept the longer wait, try another provider (federation), or continue without.

Delivery is caller-specified. The requester tells the provider how to return the result — WebSocket for real-time, a durable queue for reliability, a webhook for external integration, or a shared store for polling.

Natural Language as Contract

This is the design choice I find most interesting and most debatable.

In traditional microservice architectures, services communicate via typed schemas — protobuf, OpenAPI, GraphQL. Schema versioning is a constant source of friction. A field name change breaks callers. A new required field breaks backwards compatibility. Teams spend significant effort on schema evolution, versioning policies, and migration tooling.

In the Ensemble Network, the contract between services is natural language. When maintenance tells procurement “order replacement parts for the boiler valve,” it does not matter whether procurement’s internal schema changed. The LLM on the receiving side interprets the request. Minor changes in wording do not break callers.

This works because the participants are LLMs, not deterministic parsers. An LLM that receives “order parts for the boiler” and an LLM that receives “purchase replacement components for the heating system” will produce equivalent behavior. The semantic intent is preserved even when the exact phrasing varies.

The tradeoff is real: you lose type safety. A typed schema guarantees that the data conforms to a specific shape. Natural language does not. If the receiving ensemble misinterprets the request, you get a wrong result, not a compile error. The mitigation is the same as elsewhere in agent systems: review gates, guardrails, and observability.

Three Request Modes

The caller decides how to wait for the result:

Mode	Behavior	Use case
Await	Block until result	Critical path: “Can’t continue without this”
Async	Submit and continue; result delivered later	Non-critical: “Order towels when you get to it”
Await with deadline	Wait up to N; then continue with partial/no result	Balanced: “Wait 30 min, then proceed with what I know”

The await-with-deadline mode is the most operationally useful. It lets the caller set a budget for how long to wait before continuing. If the provider delivers within the deadline, the caller uses the result. If not, it makes a decision: retry, use a fallback, or proceed without.

Capacity Management

The provider’s default response to load is accept and queue, not reject. LLM tasks are not real-time request/response — they take seconds to hours. Everyone expects latency. The provider accepts the work into a priority queue and returns an estimated completion time:

{
  "type": "task_accepted",
  "requestId": "maint-7721",
  "queuePosition": 7,
  "estimatedCompletion": "PT45M"
}

Rejection only happens at hard limits — the queue itself is full. This “bend, don’t break” approach matches the reality of LLM workloads: capacity is elastic, latency is expected, and it is almost always better to queue work than to reject it.

Priority queuing ensures critical requests are processed first (CRITICAL > HIGH > NORMAL > LOW). Within the same priority, FIFO. Low-priority items age over time to prevent starvation.

Distributed Tracing

Every WorkRequest carries a W3C traceparent header. When maintenance delegates to procurement, which delegates to logistics, the trace context propagates across all three. Open Jaeger (or any W3C-compatible tracing backend) and you see the full chain: which ensemble originated the request, how long each step took, where the bottleneck was.

This is standard distributed tracing, not a custom solution. The same infrastructure teams use for HTTP microservices works here. The difference is that each span may represent an LLM call that takes 30 seconds instead of a database query that takes 3 milliseconds.

Tradeoffs

Loose coupling vs type safety. Natural language contracts are resilient to change but do not guarantee correctness. Typed schemas guarantee correctness but are brittle to change. The right choice depends on how stable the interface is. For evolving, exploratory agent systems, natural language is pragmatic. For stable, high-volume interfaces, a typed schema wrapper may be worth the friction.

Latency tolerance. Cross-ensemble delegation adds network hops and queuing delays. A task that takes 10 seconds locally may take 2 minutes when delegated across a network. The architecture assumes latency tolerance — if your use case requires sub-second responses, delegation is the wrong pattern.

Failure modes. When the kitchen ensemble is down, room service’s prepare-meal call fails. The circuit breaker opens. The agent needs a fallback — suggest alternatives, queue the request for later, or inform the guest. Distributed systems fail in distributed ways. The framework provides the circuit breaker and fallback mechanisms, but the failure strategy is application-specific.

Observability cost. Every cross-ensemble request generates trace data, metrics, and log entries. In a busy network with many delegations, the observability overhead is non-trivial. The tracing infrastructure needs to handle the volume, and teams need dashboards that make sense of the flow.

This is the second post in a three-part arc on the Ensemble Network architecture. The next post covers human participation — how humans connect to and interact with a network of autonomous ensembles without becoming bottlenecks.

The design document covers the full architecture.

AgentEnsemble is open-source under the MIT license.

From Run-and-Exit to Always-On: When Agent Ensembles Become Services

Fri, 24 Apr 2026 00:00:00 GMT

Every multi-agent framework works the same way at its core. You define some agents, give them tasks, press go, get output. The agents exist for the duration of the run and then disappear.

This is fine for bounded problems: “research this topic and write a report.” But it does not model how real work gets done in production systems that need to be always-on, multi-domain, and human-augmented.

The question I kept coming back to was: what changes when an ensemble stops being a script and starts being a service?

Scripts vs Services

A script runs and exits. You invoke it, it does work, it returns a result, the process terminates. Every multi-agent framework today — CrewAI, AutoGen, LangGraph, AgentEnsemble v2.x — operates in this mode.

A service runs continuously. It handles work as it arrives, communicates with peers, maintains state between requests, and survives restarts. The difference is not just about uptime — it changes the entire interaction model.

When an ensemble is a script, it is invoked by something external. When an ensemble is a service, it participates in a network of other services. It can accept work from multiple sources, share capabilities with peers, and run proactive tasks on a schedule — all without an external orchestrator telling it what to do.

The Hotel Model

Consider a hotel. It is composed of departments: front desk, housekeeping, kitchen, room service, maintenance, procurement. Each department is autonomous — it has its own staff, processes, and expertise. These departments communicate with each other directly. Room service calls the kitchen to prepare a meal. Maintenance calls procurement to order spare parts.

The hotel runs continuously. The manager comes in at 8am, walks around, checks on things, gives some direction, handles decisions that require authority, and goes home at 6pm. The hotel does not stop when the manager leaves.

This maps directly to a distributed agent architecture:

Hotel concept	Agent system equivalent
A department	An ensemble — long-running, autonomous
Staff within a department	Agents and tasks within the ensemble
The intercom / phone system	WebSocket mesh — the message transport
A work order	A WorkRequest — the standard message envelope
The hotel directory	Service registry — ensembles discover each other
The duty manager	A human who connects via the dashboard to observe and intervene

The key observation: the hotel is not centrally orchestrated. There is no “manager agent” that routes every message. Departments handle their domain and communicate laterally.

Two Execution Modes

The existing one-shot mode remains unchanged:

EnsembleOutput output = Ensemble.run(model,
    Task.of("Research AI trends"),
    Task.of("Write a report"));

Tasks execute, output is returned, the ensemble is done. This is a “gig” — a bounded unit of work.

The new long-running mode turns the ensemble into a service:

Ensemble kitchen = Ensemble.builder()
    .name("kitchen")
    .chatLanguageModel(model)
    .task(Task.of("Manage kitchen operations"))

    // Share capabilities to the network
    .shareTask("prepare-meal", Task.builder()
        .description("Prepare a meal as specified")
        .expectedOutput("Confirmation with preparation details and timing")
        .build())
    .shareTool("check-inventory", inventoryTool)

    // Scheduled proactive task
    .scheduledTask(ScheduledTask.builder()
        .name("inventory-report")
        .task(Task.of("Check current inventory levels and report shortages"))
        .schedule(Schedule.every(Duration.ofHours(1)))
        .broadcastTo("hotel.inventory")
        .build())

    .build();

kitchen.start(7329);  // WebSocket server, K8s Service fronts this

In long-running mode, the ensemble:

Registers shared tasks and tools on the network
Accepts incoming work requests via WebSocket, queue, HTTP, or topic subscription
Processes work through a priority queue
Delivers results via the caller-specified delivery method
Runs scheduled proactive tasks on configured intervals
Continues until explicitly stopped or drained

The start(port) call is the boundary between script and service. Before it, the ensemble is a configuration. After it, the ensemble is an active participant in a network.

Work Ingress

When an ensemble becomes a service, work can arrive from multiple sources simultaneously:

Source	Description
WebSocket	Direct from another ensemble (real-time)
Queue	Pull from durable queue (Kafka, SQS, Redis Streams)
HTTP API	`POST /api/work` (external systems, scripts, CI pipelines)
Topic subscription	React to events from other ensembles
Schedule	Internal cron/interval (proactive tasks)

All sources normalize into the same internal format before entering the ensemble’s priority queue. The ensemble processes work by priority (CRITICAL > HIGH > NORMAL > LOW), with FIFO ordering within the same priority level.

This means an ensemble can simultaneously handle direct requests from peer ensembles, pull batch work from a queue, respond to events, and run scheduled health checks — without any of these mechanisms knowing about each other.

Deployment Model

Each ensemble deploys as a Kubernetes service — one or more pods behind a K8s Service resource. Ensembles discover each other via DNS name. This is standard infrastructure that operations teams already know how to manage.

Namespace: hotel-downtown
  +-- Service: kitchen
  +-- Service: room-service
  +-- Service: maintenance
  +-- Service: front-desk
  +-- Service: dashboard

Scaling is handled by Kubernetes HPA watching queue depth or request latency. Conference weekend with heavy kitchen load? Scale kitchen to 3 replicas. Off-peak Tuesday? Scale back to 1. The ensemble handles replica coordination through broadcast-claim delivery: a work request is offered to all replicas, and the first to claim it processes it.

What Changes

The shift from script to service changes several things:

Lifecycle management matters. A script that crashes restarts from scratch. A service that crashes needs graceful shutdown, drain logic, and state recovery. The ensemble supports a drain mode where it stops accepting new work, finishes in-flight tasks, and shuts down cleanly. On restart, it picks up queued work from durable sources.

Proactive work becomes possible. A script only does what you tell it to do. A service can schedule its own work — periodic inventory checks, health assessments, report generation. These scheduled tasks run on internal timers and broadcast results to interested subscribers.

Observability changes. A script that runs for 30 seconds needs a log. A service that runs for months needs a dashboard. The existing web module (WebSocket server, live trace streaming, late-join snapshot) extends naturally to the long-running model.

The human relationship changes. A script blocks on human input and times out. A service has humans who connect and disconnect. They observe the current state, give direction, handle decisions that need authority, and leave. The system keeps running. This is a deep enough topic that the next post in this series will cover it in detail.

Tradeoffs

Complexity vs capability. A script is simple: invoke it, get a result. A service requires infrastructure — Kubernetes, queues, monitoring, lifecycle management. If your workload is “run this pipeline once and give me the output,” the service model is unnecessary overhead.

Always-on cost. A script uses resources only while it runs. A service uses resources continuously, even when idle. For intermittent workloads, the cost calculus favors one-shot execution with on-demand scaling.

State management. Scripts are stateless by nature — they start fresh every time. Services accumulate state: queued work, scheduled tasks, shared memory, connection state. This state needs to be durable, recoverable, and observable.

When to use which. The one-shot mode is right for discrete, bounded problems. The long-running mode is right when the workload is continuous, when multiple domains need to communicate, when humans need to observe and participate without blocking, and when the system needs to be always-on.

Both modes coexist. An ensemble that runs as a long-running service can still execute individual tasks in one-shot mode internally. The architecture does not force a choice — it extends the existing model.

This is the first post in a three-part arc on the Ensemble Network architecture planned for v3.0.0. The next post covers cross-ensemble delegation — how ensembles share tasks and tools across service boundaries, and why the contract between them is natural language, not typed schemas.

The design document covers the full architecture.

AgentEnsemble is open-source under the MIT license.

AgentEnsemble | Blog

A Control Plane for Long-Running Agent Services

The Control Plane vs. the Data Plane

Phase 1: Core REST Endpoints

Setup

Submitting a run

Querying capabilities

Phase 2: The Three-Level Run Submission Model

Task naming

Level 2: Per-task overrides

Level 3: Dynamic task creation

WebSocket run submission

Phase 3: Run Control

Cancellation

Mid-run model switching

Phase 4: Event Streaming

Subscription filtering

SSE streaming

Phase 5: Completing the Control Loop

REST review decisions

Context injection

Direct tool invocation

The Design Tension

Running Agent Tasks as Temporal Activities

Two Execution Modes

Heartbeats Work

A Full Temporal Integration

Testing Without LLM Calls

Model Selection at Request Time

Not Temporal-Specific

The Design Tradeoff

Error Handling in Agent Systems: Exception Hierarchies, Partial Results, and Exit Reasons

The Exception Hierarchy

Partial Results on Failure

Exit Reasons

Specific Exception Types

ValidationException

AgentExecutionException

MaxIterationsExceededException

PromptTemplateException

GuardrailViolationException

The Retry Question

The Operational Model

Scoped Memory for Agent Systems: Cross-Run Persistence Without Global State

Named Scopes as the Isolation Mechanism

How It Works at the Prompt Level

Pluggable Storage

Eviction Policies

MemoryTool: Agent-Driven Memory Access

Multiple Tasks Sharing a Scope

The Design Principle

Tool Pipelines: Eliminating LLM Round-Trips for Deterministic Tool Chains

The ToolPipeline Abstraction

Data Flow and Adapters

Error Strategies

Approval Gates Within Pipelines

Nesting and Composition

When to Use Pipelines vs. Separate Tools

The Broader Pattern

Guardrails for Agent Output: Pluggable Validation Before and After LLM Calls

The Model

Why Functional Interfaces

What Input Guardrails See

Output Guardrails and Typed Output

Multiple Guardrails and Evaluation Order

Exception Propagation

The Tradeoff

Wiring Agent Ensembles into Spring Boot, Micronaut, and Quarkus

The Builder API as the Integration Surface

Spring Boot

Dependencies

Configuration Class

Metrics via Actuator

Using the Ensemble

Micronaut

Quarkus

The Design Tradeoff

What Crosses the DI Boundary

Operating Agent Networks: Visual Topology, Drill-Down, and Runtime Visibility

The visibility gap