CQRS in Go — Part 3: sagas and event choreography

CQRS in Go series:

Part 1: the aggregate, Transition() and Clone()
Part 2: command handlers without side effects
Part 3: sagas and event choreography
Part 4: PostgreSQL as an event store

The problem — two aggregates need to cooperate

Concrete scenario: a payment is validated. The Payment aggregate emits a PaymentConfirmed event. In response, a shipment needs to be created in the Shipping aggregate.

The problem: the two aggregates don't know each other, and that's intentional. An aggregate that directly calls another aggregate is a fundamental violation of the pattern. The aggregate is an isolated unit of consistency. If Payment knows about Shipping, you're recreating tight coupling — exactly what you were trying to avoid by leaving the classic transactional model.

The question is: how do you make Shipping react to something that happened in Payment, without one knowing the other? There are two families of answers: orchestration and choreography.

Orchestration vs choreography

Centralized orchestration: the SagaManager drives each service. Decentralized choreography: services react in a chain to each other's events.

Orchestration sets up a central conductor — a SagaManager — that knows all the steps, all transitions, and all possible rollbacks. When PaymentConfirmed arrives, the SagaManager explicitly calls the Shipping service, then the Email service, then the Stock service. The flow is readable in one place. In case of error, the manager knows exactly what to compensate.

The downside: the SagaManager knows everyone. Adding a new step to the flow forces you to modify this central component. It becomes a tight coupling point between domains that should ignore each other.

Choreography works differently. Each aggregate emits events. Independent handlers listen to these events and trigger commands on other aggregates. Nobody centralizes knowledge of the flow — it emerges from the sum of reactions.

Choreography wins in most cases because it scales better, adding new behavior on an existing event doesn't touch existing code, and each handler can evolve, be redeployed, or even fail without blocking the others. The price to pay: the flow is not readable in one place — we'll come back to that in section 6.

The event handler — the link between aggregates

The central tool of choreography is the event handler. It listens to events from one aggregate and triggers commands on another. Its responsibility is narrow: translate an event into a command. Nothing more.

// PaymentEventHandler reacts to Payment events to trigger Shipping.
type PaymentEventHandler struct {
    shippingCmd ShippingCommandService
}

func (h PaymentEventHandler) Handle(ctx context.Context, event DomainEvent) error {
    switch e := event.(type) {
    case PaymentConfirmed:
        return h.shippingCmd.CreateShipment(ctx, CreateShipmentCmd{
            OrderID:    e.OrderID,
            CustomerID: e.CustomerID,
            Items:      e.Items,
            Address:    e.ShippingAddress,
        })
    }
    return nil
}

The event handler contains no business logic. It doesn't decide whether the shipment should be created — that decision was already made by the fact that PaymentConfirmed was emitted. It doesn't validate data — that's the Shipping command handler's job. It translates, nothing more.

If the logic in an event handler starts to grow — conditions, branches, multiple calls — it's a signal that an aggregate is missing somewhere, or that a responsibility hasn't been properly carved out.

Saga idempotency

An event can be redelivered. This happens whenever a handler successfully processes an event but crashes before acknowledging the message. The broker (Kafka, RabbitMQ, or even a simple queue table in PostgreSQL) doesn't know the processing succeeded — it redelivers. If the saga isn't idempotent, you create two shipments for a single order.

The simplest solution: check before acting. If a shipment already exists for this OrderID, treat it as a duplicate and ignore it.

func (h PaymentEventHandler) Handle(ctx context.Context, event DomainEvent) error {
    switch e := event.(type) {
    case PaymentConfirmed:
        err := h.shippingCmd.CreateShipment(ctx, CreateShipmentCmd{
            OrderID:    e.OrderID,
            CustomerID: e.CustomerID,
            Items:      e.Items,
            Address:    e.ShippingAddress,
        })
        // If the shipment already exists, it's a duplicate — ignore it.
        if errors.Is(err, ErrShipmentAlreadyExists) {
            return nil
        }
        return err
    }
    return nil
}

This ErrShipmentAlreadyExists error must be a sentinel error defined in the Shipping package, and the CreateShipment command handler must check uniqueness on OrderID before inserting — ideally with a unique constraint in the database to make the check atomic.

The other approach is an idempotency key: pass the event's ID as an idempotency key to the command handler, which stores it and silently rejects any duplicate with the same key. This is more generic but requires a dedicated tracking table. Both approaches are valid; the choice often depends on whether the command handler is already instrumented for idempotency keys (see Part 2 of the series).

Partial failures

An event can trigger multiple independent reactions: create the shipment, send the confirmation email, decrement stock. If all these reactions are in a single handler and the email fails, the shipment is created but the stock is never updated. The handler failed midway and the final state is inconsistent.

// Bad: one big handler doing everything sequentially.
func (h BigHandler) Handle(ctx context.Context, event DomainEvent) error {
    switch e := event.(type) {
    case PaymentConfirmed:
        if err := h.shipping.Create(ctx, toShipmentCmd(e)); err != nil {
            return err // email and stock will never be processed
        }
        if err := h.email.Send(ctx, toConfirmationEmail(e)); err != nil {
            return err // stock will never be processed
        }
        if err := h.stock.Update(ctx, toStockCmd(e)); err != nil {
            return err
        }
    }
    return nil
}

The solution is to never put multiple responsibilities in the same handler. One handler, one action.

// One handler per responsibility. Each can fail independently.

type ShippingOnPayment struct {
    shippingCmd ShippingCommandService
}

func (h ShippingOnPayment) Handle(ctx context.Context, event DomainEvent) error {
    if e, ok := event.(PaymentConfirmed); ok {
        return h.shippingCmd.CreateShipment(ctx, CreateShipmentCmd{
            OrderID: e.OrderID,
            Items:   e.Items,
            Address: e.ShippingAddress,
        })
    }
    return nil
}

type EmailOnPayment struct {
    mailer EmailService
}

func (h EmailOnPayment) Handle(ctx context.Context, event DomainEvent) error {
    if e, ok := event.(PaymentConfirmed); ok {
        return h.mailer.SendConfirmation(ctx, e.CustomerEmail, e.OrderID)
    }
    return nil
}

type StockOnPayment struct {
    stockCmd StockCommandService
}

func (h StockOnPayment) Handle(ctx context.Context, event DomainEvent) error {
    if e, ok := event.(PaymentConfirmed); ok {
        return h.stockCmd.DecrementStock(ctx, DecrementStockCmd{
            Items: e.Items,
        })
    }
    return nil
}

These three handlers are registered separately on the PaymentConfirmed event. If the email fails, it will be retried independently — without blocking the shipment or the stock update. A failure in one branch doesn't poison the others.

Retry with exponential backoff is handled by the infrastructure (broker, worker), not the handler. The handler has one responsibility: try to perform its action and return an error if it fails. The system decides how many times to retry.

Eventual consistency is an accepted consequence. After a crash, the stock will be updated on the next retry — in a few seconds or minutes depending on the retry policy. For most e-commerce use cases, this is acceptable. If it's not acceptable, you need a distributed transaction, and you're leaving the scope of choreography.

Tracing the flow — the event/handler table

In choreography, the flow is not readable in any single file. It's distributed across dozens of handlers. Without documentation, it's impossible to know what gets triggered when PaymentConfirmed is emitted. That's the cost of choreography.

The practical answer is a mapping table, maintained manually or generated from handler registrations. For each event, list the handlers and the action they perform. This is the system's map.

Event	Handler	Action
`PaymentConfirmed`	`ShippingOnPayment`	Creates the shipment
`PaymentConfirmed`	`EmailOnPayment`	Order confirmation email
`PaymentConfirmed`	`StockOnPayment`	Decrements stock
`ShipmentCreated`	`EmailOnShipment`	"Your package has shipped" email
`ShipmentCreated`	`TrackingOnShipment`	Creates the delivery tracking

This table should live in the project documentation, not in a comment buried in some file. It answers the question "what happens when this event is emitted?" — a question you ask frequently in production when something goes wrong.

In Go, you can also centralize handler registrations in a wire.go or handlers.go file per domain, which makes the mapping readable in code:

func registerPaymentHandlers(bus EventBus, deps Dependencies) {
    bus.Register(PaymentConfirmed{}, ShippingOnPayment{shippingCmd: deps.ShippingCmd})
    bus.Register(PaymentConfirmed{}, EmailOnPayment{mailer: deps.Mailer})
    bus.Register(PaymentConfirmed{}, StockOnPayment{stockCmd: deps.StockCmd})
}

All handlers tied to PaymentConfirmed are visible in one function. No need to grep the entire codebase to find what reacts to this event.

When choreography isn't enough

Choreography has one limit: flows with compensation. Compensation = undoing an already-completed step because a subsequent step failed. The classic example: booking a hotel, a flight, and a rental car for a trip. If the flight is unavailable, you need to cancel the hotel reservation that already succeeded.

In pure choreography, this scenario forces you to emit a FlightBookingFailed event and have a handler that cancels the hotel. This works, but the handler needs to know whether the hotel was booked for this trip — which implies shared state. You're rebuilding an orchestrator, but implicitly and scattered across the codebase.

In this case, a process manager is justified. A process manager is an event handler with its own state. It listens to events from different steps, maintains progress state, and decides on compensations if needed.

type BookingProcessManager struct {
    bookingID uuid.UUID
    hotelOK   bool
    flightOK  bool
    carOK     bool
    hotelCmd  HotelCommandService
}

func (pm *BookingProcessManager) Handle(ctx context.Context, event DomainEvent) error {
    switch e := event.(type) {
    case HotelReserved:
        pm.hotelOK = true
        // Trigger the flight booking
        return pm.flightCmd.BookFlight(ctx, BookFlightCmd{
            BookingID: pm.bookingID,
            FlightRef: e.FlightRef,
        })

    case FlightBooked:
        pm.flightOK = true
        // Trigger the car reservation
        return pm.carCmd.ReserveCar(ctx, ReserveCarCmd{
            BookingID: pm.bookingID,
        })

    case FlightBookingFailed:
        // Compensation: cancel the hotel if already booked
        if pm.hotelOK {
            return pm.hotelCmd.CancelReservation(ctx, CancelHotelCmd{
                BookingID: pm.bookingID,
            })
        }
    }
    return nil
}

The difference from a global SagaManager: this process manager only knows the travel booking flow. It's not a centralization point for all system flows. One process manager per saga, not one manager for everything.

Its state must be persisted — in a database or the event store — to survive restarts. This is the only place in a choreographed architecture where you explicitly accept having a component with its own state and centralized flow logic. The rest of the system continues to operate through choreography.

Summary

An aggregate cannot call another aggregate directly — choreography through events is the standard solution.
The event handler is a translator: it transforms an event into a command on another aggregate. No business logic inside.
Sagas must be idempotent: an event can be redelivered, the handler must detect and ignore duplicates.
One handler per responsibility: never put multiple independent actions in a single handler. A partial failure must not block other branches.
Eventual consistency is an accepted consequence: a branch that fails will be retried independently, without blocking the others.
The event/handler table is the essential documentation for a choreographed system. Without it, the flow is unreadable.
For flows with compensation, a process manager with its own state is justified — but one per saga, not a global manager.

Part 4 covers the PostgreSQL event store: how to persist events, manage aggregate versioning, and implement replay from the database.