Circuit breaker in Go: surviving crypto exchange outages

Binance announces maintenance at 2am. Your service keeps hammering it at 10 req/sec for 30 minutes. Result: 18,000 lost requests, goroutines piling up as each call waits for its timeout, and memory climbing steadily until the scheduler starts to struggle.

A circuit breaker takes 50 lines to write. Binance maintenance is inevitable. OKEx returns random 503s. Coinbase rate limits without warning. Crypto exchanges have SLAs that would make any ops engineer used to AWS flinch. If your service consumes several of them, resilience is not optional.

The previous article covered the rate limiter: how to avoid exceeding an API's quota. This pattern answers a different question — what to do when the API is down, not just slow?

The three states of a circuit breaker

A circuit breaker is a three-state machine. The analogy with an electrical breaker is accurate: when too many errors occur, the circuit opens and cuts the flow.

Closed — normal state. Requests pass through. Failures are counted. As long as the threshold isn't reached, everything runs normally.

Open — tripped. Requests fail immediately, without calling the exchange. No network connection is established. The upstream service doesn't know the exchange is down — it just receives a fast error. The state stays open for a configurable timeout.

Half-open — recovery attempt. After the timeout, a single "probe" request is allowed. If it succeeds: back to Closed, counters reset. If it fails: back to Open, timeout restarts.

CLOSED ──(5 failures in 10s)──► OPEN
  ▲                                │
  │                           (30s timeout)
  │                                ▼
  └──(probe success)────── HALF-OPEN

The circuit breaker state machine: Closed lets traffic through, Open cuts it off, Half-open tests recovery.

The Half-open state is what distinguishes a circuit breaker from a simple "disable". Recovery is automatic — the service resumes as soon as the exchange comes back, without manual intervention or restarts.

Implementation in Go

The implementation fits in under 60 lines. No external dependency, thread-safe with a simple sync.Mutex.

package circuit

import (
    "errors"
    "sync"
    "time"
)

var ErrCircuitOpen = errors.New("circuit breaker open")

type State int

const (
    StateClosed   State = iota
    StateOpen
    StateHalfOpen
)

type CircuitBreaker struct {
    mu          sync.Mutex
    state       State
    failures    int
    lastFailure time.Time
    threshold   int           // failures before opening
    timeout     time.Duration // how long to stay open
}

func New(threshold int, timeout time.Duration) *CircuitBreaker {
    return &CircuitBreaker{
        threshold: threshold,
        timeout:   timeout,
    }
}

func (cb *CircuitBreaker) Call(fn func() error) error {
    cb.mu.Lock()
    switch cb.state {
    case StateOpen:
        if time.Since(cb.lastFailure) > cb.timeout {
            cb.state = StateHalfOpen
        } else {
            cb.mu.Unlock()
            return ErrCircuitOpen
        }
    }
    cb.mu.Unlock()

    err := fn()

    cb.mu.Lock()
    defer cb.mu.Unlock()

    if err != nil {
        cb.failures++
        cb.lastFailure = time.Now()
        if cb.failures >= cb.threshold || cb.state == StateHalfOpen {
            cb.state = StateOpen
        }
        return err
    }

    // success: reset
    cb.failures = 0
    cb.state = StateClosed
    return nil
}

A few important details. The mutex is released before calling fn() — if held, all concurrent goroutines would block for the duration of the network request. That would be worse than the problem we're trying to solve.

The Open → HalfOpen transition happens when the next request arrives after the timeout, not via a background goroutine. Simple, no timer, no goroutine that leaks if the circuit breaker is abandoned.

In HalfOpen, any error immediately pushes back to Open. This isn't the time to be lenient — if the probe fails, the exchange hasn't recovered yet.

Retry with exponential backoff

The circuit breaker handles sustained outages. Retry handles transient errors: a dropped packet, a severed TCP connection, a fleeting 429. The two patterns are complementary, not redundant.

func withRetry(ctx context.Context, maxAttempts int, fn func() error) error {
    var err error
    for i := 0; i < maxAttempts; i++ {
        err = fn()
        if err == nil {
            return nil
        }
        // Open circuit = structural outage, not a transient error
        // Retrying immediately is pointless
        if errors.Is(err, ErrCircuitOpen) {
            return err
        }
        wait := time.Duration(math.Pow(2, float64(i))) * 100 * time.Millisecond
        select {
        case <-time.After(wait):
        case <-ctx.Done():
            return ctx.Err()
        }
    }
    return fmt.Errorf("after %d attempts: %w", maxAttempts, err)
}

The key point is the errors.Is(err, ErrCircuitOpen) check. When the circuit is open, retrying is pointless: the next attempt will return the same error within a microsecond. Exponential backoff only applies to genuine network errors.

The select on ctx.Done() ensures retries stop if the parent context is cancelled — an HTTP request cancelled by the client, a server shutdown, a global timeout. Without this, the goroutine keeps retrying into the void.

Timeout and context: the third line of defense

Without a per-request timeout, a connection to Binance can block indefinitely if the server accepts the TCP connection but never responds. The circuit breaker doesn't trigger — calls don't fail, they just wait. Goroutines accumulate. The symptom is identical to a hard outage, but the protection mechanism never fires.

type ExchangeService struct {
    binance *BinanceClient
    cb      *CircuitBreaker
}

func (s *ExchangeService) GetOrderBook(ctx context.Context, pair string) (*OrderBook, error) {
    // Per-request timeout, independent of parent context
    callCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
    defer cancel()

    var result *OrderBook
    err := s.cb.Call(func() error {
        var err error
        result, err = s.binance.GetOrderBook(callCtx, pair)
        return err
    })

    if err != nil {
        if errors.Is(err, ErrCircuitOpen) {
            // Circuit open: serve fallback rather than a 500
            return s.getFallbackOrderBook(pair)
        }
        return nil, err
    }
    return result, nil
}

The three layers each play a distinct role. The timeout (2s) ensures a slow call fails fast and increments the circuit breaker's failure counter. The circuit breaker opens after repeated failures and avoids calling a clearly down exchange. The fallback serves degraded data when the circuit is open.

callCtx derives from the parent context: if the parent is cancelled (HTTP request dropped by client), the exchange call stops too. The 2s timeout is an upper bound, not a guaranteed duration.

Fallback and graceful degradation

When the circuit is open, what do you return? The answer depends on the data type.

Order book, mid price, spread. The last known value with a timestamp is acceptable for a few minutes. A 3-minute-old order book is better than a 503 that crashes the calling service. The cache should expose the data age so consumers can decide for themselves.

type CachedOrderBook struct {
    Data      *OrderBook
    FetchedAt time.Time
}

func (s *ExchangeService) getFallbackOrderBook(pair string) (*OrderBook, error) {
    s.cacheMu.RLock()
    cached, ok := s.cache[pair]
    s.cacheMu.RUnlock()

    if !ok {
        return nil, fmt.Errorf("circuit open and no cached data for %s", pair)
    }
    // Warn if data is too stale
    if time.Since(cached.FetchedAt) > 10*time.Minute {
        slog.Warn("serving stale order book", "pair", pair,
            "age", time.Since(cached.FetchedAt))
    }
    return cached.Data, nil
}

Account balances. Stale data is dangerous here. If the service makes trading decisions based on a 10-minute-old balance, it may exceed the real available amount. For this type of data, the right answer is an explicit error — not a silent fallback.

func (s *ExchangeService) GetBalance(ctx context.Context) (*Balance, error) {
    callCtx, cancel := context.WithTimeout(ctx, 3*time.Second)
    defer cancel()

    var result *Balance
    err := s.cb.Call(func() error {
        var err error
        result, err = s.binance.GetBalance(callCtx)
        return err
    })
    if err != nil {
        // No fallback for balances — stale data is worse than an error
        return nil, fmt.Errorf("balance unavailable: %w", err)
    }
    return result, nil
}

The distinction "stale data acceptable / stale data dangerous" is a business decision, not a technical one. It needs to be made per domain, per endpoint, and documented explicitly in the code — not left to the judgment of whoever adds a fallback "to make it work".

sony/gobreaker vs rolling your own

The github.com/sony/gobreaker library is the Go reference for circuit breakers. It's battle-tested, well-documented, and covers cases the implementation above ignores: sliding window counting, state-change callbacks, configurable success/failure conditions.

import "github.com/sony/gobreaker"

cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
    Name:        "binance-orderbook",
    MaxRequests: 1,                      // probes in HalfOpen
    Interval:    10 * time.Second,       // counting window
    Timeout:     30 * time.Second,       // Open state duration
    ReadyToTrip: func(counts gobreaker.Counts) bool {
        return counts.ConsecutiveFailures >= 5
    },
    OnStateChange: func(name string, from, to gobreaker.State) {
        slog.Info("circuit breaker state change",
            "name", name, "from", from, "to", to)
    },
})

The hand-rolled implementation covers 90% of needs with no dependency. gobreaker is worth the import if you need the sliding window (to avoid a burst of 5 errors in 1 second opening the circuit when the overall error rate is low) or state-change callbacks to feed Prometheus metrics.

For a service consuming 2-3 exchanges with predictable error patterns, the hand-rolled version is more readable and easier to adapt. For a platform managing 20 exchanges with SLA dashboards, gobreaker is the right call.

Key takeaway

The real cost of a Binance maintenance window without a circuit breaker isn't the 18,000 lost requests — it's the silent degradation. Accumulating goroutines don't produce an immediate error. Memory climbs slowly. The service stays "up" as far as health checks are concerned, but it's dying.

Circuit breaker + retry + timeout are three distinct mechanisms protecting against three different classes of problems: sustained outages, transient errors, and stalled connections. All three together form a resilience layer that leaves your service indifferent to exchange incidents.

The fallback decision is the only one requiring genuine thought. Technical code can be copied. Deciding whether a 5-minute-old order book is acceptable in your business context — that's something nobody else can decide for you.