Streaming & Event-Driven Systems

Architecting stateful, real-time systems that remain correct, evolvable, and operable over time.

Event-driven systems are easy to prototype — and notoriously hard to operate correctly at scale. As organizations move from batch-oriented processing to continuous, stateful streaming, architectural decisions around state, time, deployment, and evolution become critical.

Many systems fail not because of throughput limits, but because they were not designed to evolve safely under live traffic.

Acosom works with software architects and platform engineers to design and operate streaming and event-driven systems that remain correct under continuous load, evolvable over years, and operable by real teams.

This expertise is about systems, not individual pipelines.

What Architects Gain

When streaming systems are designed for state, failure, and change.

Event-First Architecture

Events represent business facts, not integration artifacts. Event schemas act as long-lived contracts, enabling decoupled evolution, replayability, and clear ownership boundaries.

Stateful Processing as a Core Primitive

State is modeled explicitly, supports deterministic rebuilds, and separates evolution from deployment mechanics. Essential when systems evolve under live traffic.

Controlled State Growth & Cost

State growth is managed through offloading, externalization, and table-based patterns. Predictable cost, faster rebuilds, and explicit lifecycle control.

Change Data Capture as Integration Boundary

CDC is applied deliberately as a bridge from legacy systems, with explicit semantics and controlled schema evolution. Observable, restartable, resilient.

Safe Evolution Under Live Traffic

Coordinated rebuilds, dependency-aware deployments, and blue-green patterns for stateful systems. Systems that can evolve without interrupting correctness.

Schema Governance & Observability

Contracts enforced centrally, compatibility rules, and observability around state size, correctness, and recovery behavior. Long-lived platform assets.

Event-Driven Foundations & Domain Modeling

Streaming architectures start with how events are modeled, not with tools.

Events are explicit, durable, and meaningful across system boundaries. Domains, ownership, and bounded contexts are modeled deliberately. Event sourcing is applied where auditability and deterministic rebuilds matter — not as a default for every system.

Event schemas are treated as long-lived contracts. Compatibility, versioning, and ownership are architectural concerns, not afterthoughts.

Stateful Stream Processing & Execution Models

Stateful computation is the defining characteristic of real streaming systems.

Stateful Stream Processing as a Core Primitive

State is modeled explicitly and treated as part of the system design — not hidden inside jobs. Time, ordering, and correctness are designed for event time, late data, and determinism under real-world conditions.

Exactly-Once Processing & Deduplication

Exactly-once semantics are treated as an architectural property, supported by design decisions around identity, transactions, and replay — not just configuration flags.

Cluster-Based Stream Processing

We design shared, platform-level stream processing systems using engines such as Apache Flink, focusing on state management, recovery semantics, and long-lived operation.

Managing State Growth & Cost at Scale

One of the most common failure modes in mature streaming systems is unbounded state growth.

The Problem of Growing State

Increasing cardinality, longer retention, and evolving business logic often lead to silent state explosions that drive cost and instability. For long-running systems, state must be treated as a first-class architectural concern.

Offloading and Externalizing State

We design systems where large or historical state is offloaded from the streaming engine, rather than accumulating indefinitely. State is separated into “hot” operational and “cold” historical layers.

Table-Based Streaming & Object Storage

State is externalized into table-based storage on S3-compatible object stores, making it inspectable, queryable, and lifecycle-managed. Disaggregated state architectures decouple compute from state using Apache Paimon, Apache Fluss, and modern Flink.

Change Data Capture & Data Ingestion Patterns

CDC is often the pragmatic entry point into event-driven architectures.

CDC as an Architectural Boundary

Database changes are captured intentionally and modeled as events with clear semantics — not raw table diffs. We avoid “changelog spaghetti” by mapping low-level changes to meaningful domain events.

Operational Characteristics of CDC

CDC pipelines are designed to be observable, restartable, and resilient under continuous load. Schema evolution is managed deliberately, with compatibility and replay in mind. We commonly work with Kafka Connect, Debezium, and Flink-based CDC.

Projections, Read Models & Real-Time Data Access

Event streams are rarely consumed directly.

Materialized Projections from Streams

We design projections derived exclusively from streams, reproducible from history and isolated per use case. Read models are tailored to individual services, avoiding shared databases and tight coupling.

Real-Time Analytical Data Stores

For low-latency queries, we use real-time analytical and time-series databases such as ClickHouse and QuestDB. Streaming-derived data is exposed through analytics and visualization layers like Apache Superset.

Service-Centric Streaming with Kafka Streams

Not all streaming workloads belong on a shared processing cluster.

Kafka Streams vs. Cluster-Based Processing

We deliberately choose between embedded and platform-level stream processing based on ownership, state size, and deployment needs. This work is built on deep experience with Kafka Streams in production.

Topology Design & Performance Optimization

Kafka Streams topologies are optimized by minimizing repartitioning, using GlobalKTables where appropriate, and managing joins and state stores carefully. Topology design is reviewed as part of architecture, not left to chance.

Internal Topics & State Store Management

Internal topics and state stores are explicitly accounted for, monitored, and lifecycle-managed. We design systems with correct transactional boundaries, deduplication strategies, and controlled reprocessing behavior.

Operating & Evolving Streaming Platforms

Correct architecture is meaningless without operability.

Coordinated Rebuilds & Dependency Graphs

We design rebuild mechanisms where upstream state is reconstructed first and downstream applications follow dependency-aware orderings. Rebuilds happen without interrupting live traffic.

Blue–Green Deployments for Stateful Systems

Stateful deployments are upgraded safely by ensuring both old and new versions reach consistent state before traffic is switched. This enables evolution under continuous load.

Schema Governance & Compatibility Rules

Contracts are enforced centrally to prevent accidental breakage across teams. Observability covers lag, state size, correctness, and recovery behavior — not just lag alone.

Technologies

Technologies support architecture — they do not define it.

Apache Flink

Stateful stream processing at scale. Cluster-based execution for complex streaming systems with large state, long-lived operation, and recovery semantics.

Kafka Streams

Embedded, service-centric streaming. Stream processing co-located with application logic for service-owned state and deployment.

Apache Kafka

Distributed event log. Foundation for event-driven architectures, providing durability, ordering, and replayability.

Kafka Connect

Connector framework for streaming data integration. Observable, restartable, and resilient pipelines.

Debezium

CDC platform capturing database changes as events. Intentional modeling of database changes with clear semantics.

Flink CDC

Flink-based change data capture. Direct integration of database changes into streaming pipelines.

Apache Paimon

Table format for streaming. State externalized to object storage, making it inspectable, queryable, and lifecycle-managed.

Apache Fluss

Streaming storage for disaggregated state. Decouple compute from state for predictable cost and faster rebuilds.

ClickHouse

Real-time analytical database. Low-latency queries on streaming-derived data with columnar storage and aggregation.

QuestDB

Time-series database for streaming data. Fast ingestion and queries for time-series workloads.

Apache Superset

Operational analytics and visualization. Expose streaming-derived data to humans through dashboards and exploration.

Who This Expertise Is For

This page is written for software architects, platform engineers, and senior engineers responsible for distributed systems.

If you are accountable for systems that must keep working while they evolve, this is where we typically engage.

Frequently Asked Questions

How do you handle state growth in long-running streaming systems?

State growth is one of the most common failure modes in mature streaming platforms.

Our approach:

Treat state as a first-class architectural concern from the start
Design for state offloading and externalization early
Use table-based storage on S3-compatible object stores
Separate “hot” operational state from “cold” historical state
Apply disaggregated state patterns with modern streaming engines

This enables predictable cost, faster rebuilds, and long-lived platforms that don’t collapse under their own state.

When should we use Kafka Streams vs. a cluster-based processor like Flink?

The choice depends on ownership, state size, deployment model, and operational maturity.

Kafka Streams makes sense when:

Stream processing is embedded directly in services
State size is manageable within service instances
Teams prefer service-centric deployment models
Ownership aligns with service boundaries

Cluster-based processing makes sense when:

State is large and needs disaggregation
Multiple teams share processing infrastructure
Complex windowing and late-data handling are required
Platform-level observability and governance are needed

We have deep experience with both and choose deliberately based on constraints.

How do you ensure exactly-once semantics in production?

Exactly-once is an architectural property, not a configuration flag.

Our approach:

Design around event identity and idempotency keys
Use correct transactional boundaries in Kafka Streams
Apply end-to-end exactly-once processing where required
Design deduplication based on business semantics
Control replay and reprocessing strategies explicitly

This ensures correctness under failure, not just under happy-path conditions.

How do you handle schema evolution in event-driven systems?

Event schemas are long-lived contracts that must evolve safely.

Our approach:

Treat schemas as architectural artifacts with explicit ownership
Enforce compatibility rules centrally (forward, backward, full)
Design for additive changes and avoid breaking modifications
Version schemas explicitly and document evolution
Test schema changes before deployment

This prevents accidental breakage across teams and enables safe evolution over years.

Can you help with existing streaming systems that have grown problematic?

Yes. Many of our engagements involve improving existing streaming platforms.

Common improvement areas:

Addressing unbounded state growth and cost explosions
Adding coordinated rebuild mechanisms
Improving observability beyond lag metrics
Refactoring fragile topologies and dependencies
Implementing schema governance retroactively
Enabling safe deployments for stateful systems

We assess current architecture, identify failure modes, and evolve systems incrementally without rewriting from scratch.

Do you work with specific streaming technologies only?

We’re technology-agnostic and choose based on constraints.

We commonly work with:

Apache Flink and Kafka Streams for stream processing
Apache Kafka as the event backbone
CDC tools like Debezium, Kafka Connect, and Flink CDC
Table formats like Apache Paimon and Apache Fluss
Real-time stores like ClickHouse and QuestDB

Technology choices follow operating model, lifecycle, and constraints — not trends or vendor preference.

Building streaming systems that must keep working while they evolve? Let’s talk about your architectural challenges.

Discuss Your Streaming Architecture

Streaming & Event-Driven Systems

What Architects Gain

Event-First Architecture

Stateful Processing as a Core Primitive

Controlled State Growth & Cost

Change Data Capture as Integration Boundary

Safe Evolution Under Live Traffic

Schema Governance & Observability

Event-Driven Foundations & Domain Modeling

Stateful Stream Processing & Execution Models

Stateful Stream Processing as a Core Primitive

Exactly-Once Processing & Deduplication

Cluster-Based Stream Processing

Managing State Growth & Cost at Scale

The Problem of Growing State

Offloading and Externalizing State

Table-Based Streaming & Object Storage

Change Data Capture & Data Ingestion Patterns

CDC as an Architectural Boundary

Operational Characteristics of CDC

Projections, Read Models & Real-Time Data Access

Materialized Projections from Streams

Real-Time Analytical Data Stores

Service-Centric Streaming with Kafka Streams

Kafka Streams vs. Cluster-Based Processing

Topology Design & Performance Optimization

Internal Topics & State Store Management

Operating & Evolving Streaming Platforms

Coordinated Rebuilds & Dependency Graphs

Blue–Green Deployments for Stateful Systems

Schema Governance & Compatibility Rules

Technologies

Apache Flink

Kafka Streams

Apache Kafka

Kafka Connect

Debezium

Flink CDC

Apache Paimon

Apache Fluss

ClickHouse

QuestDB

Apache Superset

Who This Expertise Is For

Frequently Asked Questions

How do you handle state growth in long-running streaming systems?

When should we use Kafka Streams vs. a cluster-based processor like Flink?

How do you ensure exactly-once semantics in production?

How do you handle schema evolution in event-driven systems?

Can you help with existing streaming systems that have grown problematic?

Do you work with specific streaming technologies only?

Building streaming systems that must keep working while they evolve? Let’s talk about your architectural challenges.

State Rebuilds: Kafka Streams vs. Apache Flink

Acosom is a Beconn partner

How to Effectively Test Flink SQL Scripts Using Unit & Intregration Test