Binance announces maintenance at 2am. Your service keeps hammering it at 10 req/sec for 30 minutes. Result: 18,000 lost requests, goroutines piling up as each call waits for its timeout, and memory climbing steadily until the scheduler starts to struggle.
A circuit breaker takes 50 lines to write. Binance maintenance is inevitable. OKEx returns random 503s. Coinbase rate limits without warning. Crypto exchanges have SLAs that would make any ops engineer used to AWS flinch. If your service consumes several of them, resilience is not optional.
The previous article covered the rate limiter: how to avoid exceeding an API's quota. This pattern answers a different question — what to do when the API is down, not just slow?
The three states of a circuit breaker
A circuit breaker is a three-state machine. The analogy with an electrical breaker is accurate: when too many errors occur, the circuit opens and cuts the flow.
Closed — normal state. Requests pass through. Failures are counted. As long as the threshold isn't reached, everything runs normally.
Open — tripped. Requests fail immediately, without calling the exchange. No network connection is established. The upstream service doesn't know the exchange is down — it just receives a fast error. The state stays open for a configurable timeout.
Half-open — recovery attempt. After the timeout, a single "probe" request is allowed. If it succeeds: back to Closed, counters reset. If it fails: back to Open, timeout restarts.
CLOSED ──(5 failures in 10s)──► OPEN
▲ │
│ (30s timeout)
│ ▼
└──(probe success)────── HALF-OPEN
The Half-open state is what distinguishes a circuit breaker from a simple "disable". Recovery is automatic — the service resumes as soon as the exchange comes back, without manual intervention or restarts.
Implementation in Go
The implementation fits in under 60 lines. No external dependency,
thread-safe with a simple sync.Mutex.
package circuit
import (
"errors"
"sync"
"time"
)
var ErrCircuitOpen = errors.New("circuit breaker open")
type State int
const (
StateClosed State = iota
StateOpen
StateHalfOpen
)
type CircuitBreaker struct {
mu sync.Mutex
state State
failures int
lastFailure time.Time
threshold int // failures before opening
timeout time.Duration // how long to stay open
}
func New(threshold int, timeout time.Duration) *CircuitBreaker {
return &CircuitBreaker{
threshold: threshold,
timeout: timeout,
}
}
func (cb *CircuitBreaker) Call(fn func() error) error {
cb.mu.Lock()
switch cb.state {
case StateOpen:
if time.Since(cb.lastFailure) > cb.timeout {
cb.state = StateHalfOpen
} else {
cb.mu.Unlock()
return ErrCircuitOpen
}
}
cb.mu.Unlock()
err := fn()
cb.mu.Lock()
defer cb.mu.Unlock()
if err != nil {
cb.failures++
cb.lastFailure = time.Now()
if cb.failures >= cb.threshold || cb.state == StateHalfOpen {
cb.state = StateOpen
}
return err
}
// success: reset
cb.failures = 0
cb.state = StateClosed
return nil
}
A few important details. The mutex is released before calling fn() —
if held, all concurrent goroutines would block for the duration of the network request.
That would be worse than the problem we're trying to solve.
The Open → HalfOpen transition happens when the next request arrives after the timeout, not via a background goroutine. Simple, no timer, no goroutine that leaks if the circuit breaker is abandoned.
In HalfOpen, any error immediately pushes back to Open. This isn't the time to be lenient — if the probe fails, the exchange hasn't recovered yet.
Retry with exponential backoff
The circuit breaker handles sustained outages. Retry handles transient errors: a dropped packet, a severed TCP connection, a fleeting 429. The two patterns are complementary, not redundant.
func withRetry(ctx context.Context, maxAttempts int, fn func() error) error {
var err error
for i := 0; i < maxAttempts; i++ {
err = fn()
if err == nil {
return nil
}
// Open circuit = structural outage, not a transient error
// Retrying immediately is pointless
if errors.Is(err, ErrCircuitOpen) {
return err
}
wait := time.Duration(math.Pow(2, float64(i))) * 100 * time.Millisecond
select {
case <-time.After(wait):
case <-ctx.Done():
return ctx.Err()
}
}
return fmt.Errorf("after %d attempts: %w", maxAttempts, err)
}
The key point is the errors.Is(err, ErrCircuitOpen) check.
When the circuit is open, retrying is pointless: the next attempt will return
the same error within a microsecond. Exponential backoff only applies
to genuine network errors.
The select on ctx.Done() ensures retries stop
if the parent context is cancelled — an HTTP request cancelled by the client,
a server shutdown, a global timeout.
Without this, the goroutine keeps retrying into the void.
Timeout and context: the third line of defense
Without a per-request timeout, a connection to Binance can block indefinitely if the server accepts the TCP connection but never responds. The circuit breaker doesn't trigger — calls don't fail, they just wait. Goroutines accumulate. The symptom is identical to a hard outage, but the protection mechanism never fires.
type ExchangeService struct {
binance *BinanceClient
cb *CircuitBreaker
}
func (s *ExchangeService) GetOrderBook(ctx context.Context, pair string) (*OrderBook, error) {
// Per-request timeout, independent of parent context
callCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
defer cancel()
var result *OrderBook
err := s.cb.Call(func() error {
var err error
result, err = s.binance.GetOrderBook(callCtx, pair)
return err
})
if err != nil {
if errors.Is(err, ErrCircuitOpen) {
// Circuit open: serve fallback rather than a 500
return s.getFallbackOrderBook(pair)
}
return nil, err
}
return result, nil
}
The three layers each play a distinct role. The timeout (2s)
ensures a slow call fails fast and increments the circuit breaker's failure counter.
The circuit breaker opens after repeated failures and avoids calling
a clearly down exchange. The fallback serves degraded data when the circuit is open.
callCtx derives from the parent context: if the parent is cancelled
(HTTP request dropped by client), the exchange call stops too.
The 2s timeout is an upper bound, not a guaranteed duration.
Fallback and graceful degradation
When the circuit is open, what do you return? The answer depends on the data type.
Order book, mid price, spread. The last known value with a timestamp is acceptable for a few minutes. A 3-minute-old order book is better than a 503 that crashes the calling service. The cache should expose the data age so consumers can decide for themselves.
type CachedOrderBook struct {
Data *OrderBook
FetchedAt time.Time
}
func (s *ExchangeService) getFallbackOrderBook(pair string) (*OrderBook, error) {
s.cacheMu.RLock()
cached, ok := s.cache[pair]
s.cacheMu.RUnlock()
if !ok {
return nil, fmt.Errorf("circuit open and no cached data for %s", pair)
}
// Warn if data is too stale
if time.Since(cached.FetchedAt) > 10*time.Minute {
slog.Warn("serving stale order book", "pair", pair,
"age", time.Since(cached.FetchedAt))
}
return cached.Data, nil
}
Account balances. Stale data is dangerous here. If the service makes trading decisions based on a 10-minute-old balance, it may exceed the real available amount. For this type of data, the right answer is an explicit error — not a silent fallback.
func (s *ExchangeService) GetBalance(ctx context.Context) (*Balance, error) {
callCtx, cancel := context.WithTimeout(ctx, 3*time.Second)
defer cancel()
var result *Balance
err := s.cb.Call(func() error {
var err error
result, err = s.binance.GetBalance(callCtx)
return err
})
if err != nil {
// No fallback for balances — stale data is worse than an error
return nil, fmt.Errorf("balance unavailable: %w", err)
}
return result, nil
}
The distinction "stale data acceptable / stale data dangerous" is a business decision, not a technical one. It needs to be made per domain, per endpoint, and documented explicitly in the code — not left to the judgment of whoever adds a fallback "to make it work".
sony/gobreaker vs rolling your own
The github.com/sony/gobreaker
library is the Go reference for circuit breakers. It's battle-tested,
well-documented, and covers cases the implementation above ignores:
sliding window counting, state-change callbacks,
configurable success/failure conditions.
import "github.com/sony/gobreaker"
cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{
Name: "binance-orderbook",
MaxRequests: 1, // probes in HalfOpen
Interval: 10 * time.Second, // counting window
Timeout: 30 * time.Second, // Open state duration
ReadyToTrip: func(counts gobreaker.Counts) bool {
return counts.ConsecutiveFailures >= 5
},
OnStateChange: func(name string, from, to gobreaker.State) {
slog.Info("circuit breaker state change",
"name", name, "from", from, "to", to)
},
})
The hand-rolled implementation covers 90% of needs with no dependency.
gobreaker is worth the import if you need the sliding window
(to avoid a burst of 5 errors in 1 second opening the circuit when the overall
error rate is low) or state-change callbacks to feed Prometheus metrics.
For a service consuming 2-3 exchanges with predictable error patterns,
the hand-rolled version is more readable and easier to adapt.
For a platform managing 20 exchanges with SLA dashboards,
gobreaker is the right call.
Conclusion
The real cost of a Binance maintenance window without a circuit breaker isn't the 18,000 lost requests — it's the silent degradation. Accumulating goroutines don't produce an immediate error. Memory climbs slowly. The service stays "up" as far as health checks are concerned, but it's dying.
Circuit breaker + retry + timeout are three distinct mechanisms protecting against three different classes of problems: sustained outages, transient errors, and stalled connections. All three together form a resilience layer that leaves your service indifferent to exchange incidents.
The fallback decision is the only one requiring genuine thought. Technical code can be copied. Deciding whether a 5-minute-old order book is acceptable in your business context — that's something nobody else can decide for you.