Agentic & Event-Driven AI Systems

Designing stateful AI systems that reason, decide, and act over time — reliably and at scale.

Many discussions around “agentic AI” focus on orchestrating LLM calls. In production systems, that framing is insufficient.

Agentic systems that work reliably are long-running, stateful system components. They react to signals, maintain durable state, coordinate with other components, and recover deterministically from failure.

This page describes how we design agentic systems as distributed systems, using event-driven and stateful architectures — with LLMs as reasoning components, not as controllers.

What Organizations Gain

Production-grade agentic systems require distributed-systems thinking, not just prompt engineering.

Fault Tolerance

Agents are designed as long-running processes with durable, externally managed state. Failures are recovered deterministically, not through retry loops.

Replayability

Historical inputs can be replayed. State transitions are explicit. Behavior can be audited and reasoned about — enabling safe evolution and correction of logic.

Determinism

Agents react to signals over time. Decisions are emitted as outputs. State is checkpointed and recoverable — not reconstructed from chat history.

Observability

Agent behavior, state changes, and decisions are traceable. Debugging and introspection are first-class concerns, not afterthoughts.

Governance

LLMs are invoked selectively, not as controllers. Context is retrieved explicitly. Reasoning is constrained to valid options. Results are validated deterministically.

Cost Control

Memory is managed deliberately. LLMs are used when they add value, not as a default step. This leads to predictable cost and bounded failure modes.

What “Agentic Systems” Mean in Production

In production, an agent is not a prompt template, a polling loop around an LLM, or a stateless function.

An agent is a long-running process with durable, externally managed state, reacting to signals over time, producing decisions, actions, or new signals.

Agents live inside the system, not at its edges.

This introduces non-negotiable requirements: fault tolerance, replayability, determinism, observability, governance, and cost control.

These are distributed-systems concerns — not prompt-engineering problems.

Signal-Driven vs Prompt-Driven Agents

Many agent implementations are prompt-driven. Signal-driven agents work differently.

Prompt-Driven Approach

Many agent implementations today are prompt-driven: receive input, reconstruct context, call the model, return a response. This approach breaks down once decisions span multiple steps, context accumulates over time, workflows branch dynamically, or failures must be recovered cleanly.

Signal-Driven Agents

Agents react to signals (events, batches, replays, scheduled runs). Signals trigger explicit state transitions. Decisions are emitted as outputs. State is checkpointed and recoverable. Whether input arrives as a stream, batch, bulk upload, or replayed dataset is irrelevant.

Stateful Decision Logic

What matters is stateful decision logic over time, not ingestion mode. Agents maintain durable state, enabling deterministic recovery, replayability, and correct behavior under real-world conditions.

Stateful Agents, Determinism & Replay

Agent state is not chat history.

Agent State in Production

In production systems, agent state includes decisions already taken, partial workflow progress, accumulated facts or signals, validation outcomes, and coordination metadata. This state must be durable, replayable, inspectable, and versionable.

Explicit State Transitions

We design agents so that state transitions are explicit, failures recover deterministically, and historical inputs can be replayed. Behavior can be audited and reasoned about at every step.

Safe Evolution & Correction

This enables safe evolution, backfills, and correction of logic. Systems can be improved without losing historical context or breaking existing workflows.

Event-Driven Agent Frameworks in Practice

We implement agentic systems using stateful, event-driven runtimes.

Apache Flink Agents

Using Flink, agents are implemented as long-running, stateful operators driven by signals (streams, batches, replays) with exactly-once state guarantees. This enables durable agent state, deterministic recovery, controlled reprocessing, and coordination across multiple agents.

Akka & Event-Sourced Agents

Akka provides a complementary model: isolated agents via actors, explicit supervision and lifecycle management, event sourcing as a first-class concept, and strong modeling of command/event flows. Particularly effective for agent hierarchies and complex business logic.

LLMs as Components

In both cases, LLMs are invoked by agents — not vice versa. Agents control execution flow, manage state, and decide when reasoning is required. This ensures reproducibility and bounded failure modes.

LLMs as Reasoning Components

In production agentic systems, LLMs do not own state or control execution flow.

Agent Decision Cycle

A typical agent decision cycle: receive a signal, evaluate current state, decide whether reasoning is required, retrieve constrained context, invoke the LLM, validate and structure the output, update state, and emit actions or new signals.

Reproducibility & Auditability

This ensures reproducibility and auditability. Every decision can be traced back to its inputs, state, and reasoning process. Behavior is inspectable and verifiable.

Bounded Failure Modes

Bounded failure modes and predictable cost. LLMs are invoked selectively, not as a default step. Context is constrained, and outputs are validated deterministically.

Memory, Context & Token Optimization

Agentic systems must manage memory deliberately.

Layered Memory Model

Agent state: Durable, structured, replayable system state. Interaction context: Context scoped to a specific workflow or decision step. Long-term memory: Persisted knowledge such as historical decisions, user profiles, or domain facts.

Context Retrieval as Constraint

Agents often reason within boundaries defined by system or user context: schemas, allowed categories, validation rules, processing constraints. Context is retrieved explicitly, reasoning is constrained to valid options, and results are validated deterministically.

Token Optimization

This leads to higher reliability, lower token usage, and clearer failure modes. LLMs are used when they add value, not as a default step for every operation.

Multi-Agent Coordination & Dependencies

Real-world systems rarely involve a single agent.

Explicit Dependencies & Ordering

We design multi-agent systems with explicit dependencies, ordering guarantees, backpressure handling, and failure isolation. Coordination is deterministic, not emergent.

Common Patterns

Common patterns include staged pipelines, conditional branching, dependency-aware scheduling, and background rebuilds and reprocessing. These patterns borrow from distributed DAG execution and event-driven coordination.

Coordinated Intelligence

The outcome is coordinated intelligence, not uncontrolled autonomy. Agents work together predictably, with clear boundaries and failure isolation.

Operating Agentic Systems in Production

Agentic systems are operational systems.

Observability & Debugging

We design for observability of agent behavior, tracing of decisions and state changes, and debugging and introspection. Every action is traceable to its cause.

Controlled Evolution

Controlled rollouts and upgrades, rollback and replay strategies, and governance and access control. Agents can be versioned, deployed blue/green, shadow-executed, and migrated gradually.

Long-Term Operation

This allows systems to evolve safely over long lifecycles. Agents are designed to run for years, not minutes, with continuous improvement without disruption.

When Agentic Systems Make Sense (and When They Don’t)

Agentic systems are effective when processing involves multiple steps.

When Agents Make Sense

Agentic systems are effective when processing involves multiple steps, decisions depend on accumulated context, workflows branch dynamically, reasoning must be combined with deterministic validation, and behavior must be replayable and auditable.

Trigger Modes

Agents can be triggered by continuous streams, scheduled batch executions, bulk uploads, or historical reprocessing. The deciding factor is stateful decision logic, not ingestion mode.

When to Avoid Agents

Agentic systems are not the right choice when logic is purely stateless, processing is a single deterministic transformation, or no coordination or branching is required. We deliberately avoid agents where simpler architectures are sufficient.

Technologies & Frameworks

Production-grade, pluggable by design.

Apache Flink

Stateful operators with exactly-once guarantees for agent execution.

Akka

Event-sourced agents and supervision for hierarchical systems.

PostgreSQL

Structured data storage for agent state and operational data.

MongoDB

Document storage for flexible agent context and configuration.

Apache Iceberg

Historical context and analytical lookups for replayable decision inputs.

Apache Paimon

Streaming table storage for agent state and context.

Pinecone

Vector database for semantic search and long-term memory.

Milvus

Open-source vector database for embedding storage and retrieval.

Weaviate

Vector database with native AI integration for agent memory.

Qdrant

High-performance vector database for fast semantic retrieval.

Neo4j

Graph database for relationship modeling and knowledge graphs.

Apache JanusGraph

Distributed graph database for large-scale relationship tracking.

How This Expertise Is Applied

This expertise is applied to:

decision intelligence systems
multi-step processing pipelines
intelligent automation
AI-assisted operational platforms
coordinated agent-based systems

It integrates naturally with:

Frequently Asked Questions

How are agentic systems different from LLM orchestration?

Agentic systems are long-running, stateful processes with durable state and deterministic recovery. LLM orchestration libraries typically focus on chaining prompts without addressing fault tolerance, replay, or multi-step coordination.

Can agentic systems work with batch processing?

Yes. Agents are signal-driven, not stream-only. They can be triggered by scheduled batches, bulk uploads, historical reprocessing, or continuous streams. The architecture is ingestion-mode agnostic.

How do you handle agent failures in production?

Agents checkpoint state explicitly. Failures trigger deterministic recovery from the last consistent checkpoint. Historical inputs can be replayed. State transitions are auditable.

What role do LLMs play in agentic systems?

LLMs are reasoning components, not controllers. Agents decide when to invoke the LLM, retrieve constrained context, validate outputs, and update state. This ensures reproducibility and cost control.

How do you coordinate multiple agents?

Multi-agent systems use explicit dependencies, ordering guarantees, and backpressure handling. Patterns include staged pipelines, conditional branching, and dependency-aware scheduling — borrowed from distributed DAG execution.

When should organizations avoid agentic systems?

When logic is purely stateless, processing is a single deterministic transformation, or no coordination is required. We deliberately avoid agents where simpler architectures are sufficient.

Building agentic systems that must run reliably in production? Let’s talk about your stateful AI architecture.

Discuss Your Agentic AI System