Concurrency in Go — Part 3: context, errgroup and production patterns

Series: Concurrency in Go

Goroutines and channels, you've got those down. Now the real questions: how do you cancel cleanly? How do you handle errors from multiple goroutines without losing details? How do you prevent the program from crashing at 3am because SIGTERM cut an operation right in the middle?

This part covers the tools and patterns I use in production. No pure theory — answers to problems I've actually encountered.

1. context.Context — cancelling goroutines cleanly

First instinct when discovering Go: look for how to "kill" a goroutine from the outside. The answer is simple and slightly unsettling: you can't. That's intentional. Go doesn't expose a forced-stop mechanism because a forced stop leaves resources in an unknown state — half-written open files, uncommitted transactions, locks never released.

The clean solution is a cooperative signal: you signal to the goroutine that it should stop, and it decides when it's safe to do so. That's exactly what context.Context does.

Three constructors to know:

  • context.WithCancel — manual cancellation, call cancel() when you want
  • context.WithTimeout — automatic cancellation after a duration
  • context.WithDeadline — cancellation at a specific time

In all cases, the pattern is identical: the goroutine listens to ctx.Done(), a channel closed when the context is cancelled.

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel() // releases resources even if the timeout hasn't fired yet

go func() {
    select {
    case <-ctx.Done():
        fmt.Println("cancelled:", ctx.Err()) // context.DeadlineExceeded or Canceled
        return
    case résultat := <-traitement(ctx):
        fmt.Println("result:", résultat)
    }
}()

The defer cancel() is mandatory, even if the timeout will fire anyway. Without it, the context and its associated resources stay in memory until expiration. The go vet linter will remind you if you forget it.

In practice, for an HTTP request that spawns multiple child goroutines:

func handleSearch(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 3*time.Second)
    defer cancel()

    // Both parallel requests receive the same context.
    // If the timeout expires, both are cancelled automatically.
    résultatsDB    := make(chan []Item, 1)
    résultatsCache := make(chan []Item, 1)

    go func() {
        items, err := db.Search(ctx, r.URL.Query().Get("q"))
        if err != nil {
            résultatsDB <- nil
            return
        }
        résultatsDB <- items
    }()

    go func() {
        items, _ := cache.Search(ctx, r.URL.Query().Get("q"))
        résultatsCache <- items
    }()

    db     := <-résultatsDB
    cached := <-résultatsCache
    render(w, merge(db, cached))
}

Important rule: context always flows downward, never upward. A function can restrict the context it receives (add a shorter timeout), but never cancel it from inside to signal something to the caller. For that, use normal return values.

2. errgroup — goroutines + clean error handling

sync.WaitGroup has a well-known limitation: it doesn't handle errors. The classic pattern to work around that is to add an error channel:

// WaitGroup + error channel pattern — works, but verbose
var wg sync.WaitGroup
errs := make(chan error, len(urls))

for _, url := range urls {
    url := url
    wg.Add(1)
    go func() {
        defer wg.Done()
        if err := télécharger(url); err != nil {
            errs <- err
        }
    }()
}

wg.Wait()
close(errs)

for err := range errs {
    log.Println("error:", err)
}

It works, but it's boilerplate you'll rewrite every time. golang.org/x/sync/errgroup condenses this into a few lines:

import "golang.org/x/sync/errgroup"

g, ctx := errgroup.WithContext(context.Background())

for _, url := range urls {
    url := url // capture loop variable
    g.Go(func() error {
        return télécharger(ctx, url)
    })
}

// Wait blocks until all goroutines have finished.
// It returns the first non-nil error encountered.
if err := g.Wait(); err != nil {
    log.Fatal("at least one error:", err)
}

With errgroup.WithContext, if a goroutine returns an error, the context is cancelled automatically. All other goroutines listening to ctx.Done() stop cleanly. This is fail-fast behavior: as soon as one part fails, we don't continue pointlessly.

An important nuance: g.Wait() returns the first error, not all of them. If you need to collect all errors to log or return them to the client, the WaitGroup + channel pattern is still relevant. errgroup fits cases where a single error is enough to invalidate the whole operation.

3. Graceful shutdown

SIGTERM arrives — deployment, restart, clean kill — and two behaviors are possible. The bad one: the process stops immediately, the in-progress job is lost halfway through, the database has an open transaction that will sit there until timeout. The good one: goroutines finish what they're doing, release their resources, then the process exits.

Since Go 1.16, signal.NotifyContext simplifies this greatly:

import (
    "context"
    "os"
    "os/signal"
    "syscall"
)

func main() {
    ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
    defer stop() // deregisters the signal handler

    // ctx is cancelled as soon as SIGTERM or Ctrl+C arrives.
    // All goroutines that receive this context stop cleanly.
    if err := runApplication(ctx); err != nil {
        log.Fatal(err)
    }
}

The key point: this single context flows down through the entire application. No additional plumbing needed. When SIGTERM arrives, the signal propagates automatically to everything listening to ctx.Done().

For a worker processing jobs from a queue, graceful shutdown looks like this:

func worker(ctx context.Context, jobs <-chan Job) {
    for {
        select {
        case <-ctx.Done():
            slog.Info("worker: clean stop")
            return
        case job, ok := <-jobs:
            if !ok {
                return // channel closed, no more jobs to process
            }
            // The current job runs to completion.
            // We pass ctx to internal operations (DB, HTTP)
            // so they also respect cancellation.
            if err := processJob(ctx, job); err != nil {
                slog.Error("job failed", "id", job.ID, "error", err)
            }
        }
    }
}

Subtlety: in a select, if both cases are ready simultaneously (a job available AND the context cancelled), Go picks one at random. To guarantee we finish the current job before stopping, check ctx.Done() only between jobs, not during their processing — that's what this pattern does: processJob runs to completion, and it's only on the next loop iteration that the select detects the cancellation.

4. sync.Mutex and sync.RWMutex

Channels are elegant, but they have a cost. For sharing state between goroutines, a mutex is often more appropriate — especially when the "sharing" boils down to reading or writing to a map or struct.

sync.RWMutex distinguishes readers from writers. Multiple goroutines can read simultaneously; only one can write, and it waits until all current readers have finished. This is the ideal pattern for an in-memory cache:

type Cache struct {
    mu    sync.RWMutex
    store map[string]string
}

func NewCache() *Cache {
    return &Cache{store: make(map[string]string)}
}

func (c *Cache) Get(key string) (string, bool) {
    c.mu.RLock() // multiple simultaneous readers allowed
    defer c.mu.RUnlock()
    val, ok := c.store[key]
    return val, ok
}

func (c *Cache) Set(key, val string) {
    c.mu.Lock() // exclusive write — waits for all readers to finish
    defer c.mu.Unlock()
    c.store[key] = val
}

func (c *Cache) Delete(key string) {
    c.mu.Lock()
    defer c.mu.Unlock()
    delete(c.store, key)
}

Mnemonic rule: if the coordination logic is naturally expressed as "send this to someone who will process it", use a channel. If it's "multiple goroutines accessing the same data", use a mutex.

5. sync.Once — thread-safe initialization

Classic case: an expensive resource to initialize — DB connection, loading a config file, parsing a TLS certificate — that you want to create on first use, only once, even if multiple goroutines arrive simultaneously.

type DBPool struct {
    once sync.Once
    pool *sql.DB
    err  error
}

func (d *DBPool) Get() (*sql.DB, error) {
    d.once.Do(func() {
        d.pool, d.err = sql.Open("postgres", os.Getenv("DATABASE_URL"))
    })
    return d.pool, d.err
}

var globalDB DBPool

func getDB() (*sql.DB, error) {
    return globalDB.Get()
}

sync.Once guarantees the function passed to Do executes only once, even if ten goroutines call Get() simultaneously. The other nine wait until the first finishes, then retrieve the stored result.

Note: if the function fails, it won't be re-executed. If you need initialization with retry, sync.Once is not enough — you'll need to handle that manually with a mutex and a state flag.

6. Real-world case — production worker pool

A complete example combining everything we've seen. This is the structure I use for job processing systems: data imports, email sending, batch report generation.

package main

import (
    "context"
    "errors"
    "fmt"
    "log/slog"
    "os"
    "os/signal"
    "sync"
    "syscall"
    "time"
)

type Job struct {
    ID      string
    Payload []byte
}

type Stats struct {
    mu      sync.Mutex
    success int
    failed  int
}

func (s *Stats) RecordSuccess() {
    s.mu.Lock()
    defer s.mu.Unlock()
    s.success++
}

func (s *Stats) RecordFailure() {
    s.mu.Lock()
    defer s.mu.Unlock()
    s.failed++
}

func (s *Stats) Log() {
    s.mu.Lock()
    defer s.mu.Unlock()
    slog.Info("final statistics", "succès", s.success, "échecs", s.failed)
}

func processJob(ctx context.Context, job Job) error {
    // Simulates an operation that respects context (DB request, HTTP call...)
    select {
    case <-ctx.Done():
        return ctx.Err()
    case <-time.After(100 * time.Millisecond):
    }

    if job.ID == "bad-job" {
        return errors.New("invalid payload")
    }
    return nil
}

func runWorker(ctx context.Context, id int, jobs <-chan Job, stats *Stats, wg *sync.WaitGroup) {
    defer wg.Done()
    slog.Info("worker started", "worker", id)

    for {
        select {
        case <-ctx.Done():
            slog.Info("worker stopped cleanly", "worker", id)
            return
        case job, ok := <-jobs:
            if !ok {
                slog.Info("worker done (queue empty)", "worker", id)
                return
            }

            start := time.Now()
            err := processJob(ctx, job)
            duration := time.Since(start)

            if err != nil {
                // Partial failure: log and continue.
                // A failed job doesn't impact others.
                slog.Error("job failed",
                    "worker", id,
                    "job_id", job.ID,
                    "duration_ms", duration.Milliseconds(),
                    "error", err,
                )
                stats.RecordFailure()
            } else {
                slog.Info("job processed",
                    "worker", id,
                    "job_id", job.ID,
                    "duration_ms", duration.Milliseconds(),
                )
                stats.RecordSuccess()
            }
        }
    }
}

func main() {
    // Graceful shutdown: ctx cancelled on SIGTERM or Ctrl+C
    ctx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
    defer stop()

    const numWorkers = 4
    jobs  := make(chan Job, 100) // buffer to absorb bursts
    stats := &Stats{}

    var wg sync.WaitGroup
    for i := range numWorkers {
        wg.Add(1)
        go runWorker(ctx, i+1, jobs, stats, &wg)
    }

    // Job producer
    // In practice: reading from Redis, SQS, Kafka, etc.
    go func() {
        defer close(jobs) // signal to workers: nothing more to distribute
        for i := range 20 {
            select {
            case <-ctx.Done():
                slog.Info("producer stopped, remaining jobs abandoned")
                return
            case jobs <- Job{ID: fmt.Sprintf("job-%d", i)}:
            }
        }
    }()

    wg.Wait() // waits for all workers to finish cleanly
    stats.Log()
    slog.Info("shutdown complete")
}

What this code guarantees in production:

  • SIGTERM arrives → the producer stops distributing new jobs
  • Workers finish the current job before exiting
  • A failed job doesn't block others (partial failure pattern)
  • Statistics are mutex-protected and available at end of run
  • No goroutine leaks: all finish before wg.Wait()

7. Detecting problems

Even with the right patterns, concurrency bugs happen. Three essential tools:

The race detector

Enable it systematically in development and in CI. It instruments the binary to report unprotected concurrent accesses at the moment they occur:

go run -race ./...
go test -race ./...

The cost is real (5x to 10x slower, 5x to 10x more memory), so don't enable -race in production. But in tests, it's indispensable — a silent race condition in dev becomes an incident in prod under load.

Monitoring goroutine count

import "runtime"

// In a debug handler, or in a ticker every 30s
slog.Info("active goroutines", "count", runtime.NumGoroutine())

If this number grows without plateauing on a process handling a constant stream, you probably have a leak. For more on detection and fixes, I wrote a dedicated article: Goroutine leaks: detect and fix.

pprof in production

import _ "net/http/pprof"

// In main(), parallel to the main server.
// Never expose publicly — loopback only.
go func() {
    log.Println(http.ListenAndServe("localhost:6060", nil))
}()

From your machine, via an SSH tunnel:

// Goroutine profile: see all blocked goroutines with their stack
go tool pprof http://localhost:6060/debug/pprof/goroutine

// In the interactive interface
(pprof) top    // most represented goroutines
(pprof) web    // SVG graph of stacks (requires graphviz)

The pprof goroutine profile is the most powerful tool for diagnosing abnormal behavior in production. It shows exactly where each goroutine is blocked, with full stacks. In seconds, you know whether the problem comes from a contended lock, a channel that never receives, or a DB query without timeout.

What you can do now

In three parts, you've gone from the basics to production patterns:

  • Part 1 — goroutines, WaitGroup, race conditions and the race detector
  • Part 2 — buffered and unbuffered channels, select, worker pool
  • Part 3 — errors in concurrent context and panic handling

Concurrency in Go isn't magically simple, but it's consistent. The same patterns appear across all projects: context flowing down, channels coordinating, mutex protecting shared state. Once internalized, writing readable and robust concurrent code becomes natural.

Resources I genuinely recommend for going further:

📄 Associated CLAUDE.md

Comments (0)