Name: Designing Data-Intensive Applications
Author: Martin Kleppmann
ISBN: 9781449373320

The complete map of data systems: SQL, NoSQL, replication, streaming. The absolute must-read.

Why this book

Every web developer lives on top of a stack they didn't build: a SQL database behind an ORM, a Redis cache, a message queue, a search index. As long as everything works, that stack is invisible. The day a MySQL replica returns stale data, two concurrent requests silently clobber a counter, or a network timeout leaves the system in an inconsistent state, you discover you don't really understand your own foundations.

Martin Kleppmann wrote the book that fills that gap. Almost four years of writing, more than 800 references, and a promise kept from cover to cover: explaining not how to use any particular tool, but why each tool works the way it works, and what questions to ask before picking one. Since 2017, "DDIA" has probably been the most recommended book in all of backend engineering. It deserves it.

The ideas that stay

1You are already a data system designer

Tool categories merged without telling anyone: Redis is a datastore used as a message queue, Kafka is a message queue with database-grade durability guarantees. As soon as your application combines a database, a cache and a search index kept in sync by your own code, Kleppmann is blunt: "You are now not only an application developer, but also a data system designer" (p. 5).

The title isn't honorary, it comes with the liabilities: your application code is what guarantees (or fails to guarantee) that the cache gets invalidated at the right time. The whole book follows from this observation: since you're assembling data systems anyway, you might as well understand what's inside them.

2Describe the load before it crushes you: Twitter's fan-out

"X is scalable" means nothing; the real question is "if the load grows in a particular way, what are our options?". The book's flagship example: Twitter in 2012, 4,600 tweets posted per second, but 300,000 timeline reads per second. Precomputing every timeline at write time turns 4,600 tweets/s into 345,000 writes/s, and an account with 30 million followers triggers 30 million writes for a single tweet (p. 11-13). The final solution is hybrid: fan-out for everyone, except celebrities, merged at read time.

The lesson outlives Twitter: identify your load parameter (here, the distribution of followers) before choosing an architecture. And measure in percentiles, not averages: Amazon specifies its services at the 99.9th percentile because the slowest requests hit the customers with the fullest accounts, hence the most valuable ones (p. 15).

3The world's simplest database fits in two bash functions

db_set () {
    echo "$1,$2" >> database
}
db_get () {
    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

Constant-time writes (append to the end of the file), catastrophic reads (scan everything). That's the opening of chapter 3, and every real storage engine is just an answer to this imbalance, summed up in one law: "well-chosen indexes speed up read queries, but every index slows down writes" (p. 71). From there, two families:

LSM-trees (Cassandra, RocksDB) — keep the append-only spirit, compact in the background; fast at writing, better for high-volume ingest.
B-trees (PostgreSQL, MySQL) — overwrite pages in place; fast and predictable at reading, better for mixed workloads.

Kleppmann refuses to decide for you: "There is no quick and easy rule […] it is worth testing empirically" (p. 85).

4NoSQL is a remake: this match was already played in 1970

IMS, the IBM database originally developed for stock-keeping in the Apollo space program (1968), stored data as nested trees that look strangely like the JSON of modern document databases. Same strengths (one-to-many feels natural), same dead ends: many-to-many relationships are painful and joins don't exist. "These problems of the 1960s and '70s were very much like the problems that developers are running into with document databases today" (p. 36). Codd's relational model won the "great debate" of the 70s precisely by hiding the implementation behind a clean interface.

As for "schemaless", Kleppmann dismantles it with one analogy: the schema always exists, the only question is who enforces it:

Schema-on-write (relational) — the database enforces the structure at insert time, like static typing: bad data is rejected upfront.
Schema-on-read (document) — the application interprets the structure at read time, like dynamic typing: flexibility now, potential surprises later.

Split scene: a 1970s engineer at a mainframe holds a tree diagram on paper; a beanie-wearing developer at a laptop looks at the exact same tree on screen — 1968, 2010: the same tree, the same dead ends, forty years apart.

5Replicating is easy. Replicating changes is the whole problem

"All of the difficulty in replication lies in handling changes to replicated data" (p. 151). A leader takes the writes, followers trail behind with variable delay: this replication lag manufactures anomalies the chapter names one by one:

Read-your-own-writes — you post a comment, reload, it's gone: your read hit a lagging follower that doesn't have your write yet.
Consistent prefix reads — an answer appears before its question, a causality violation illustrated by a Pratchett-worthy exchange between Mr. Poons and Mrs. Cake (p. 165).
Failover hazards — at GitHub, a lagging follower promoted to leader reused primary keys already distributed; since those keys indexed a Redis store too, private data went to the wrong users (p. 157).

Every extra guarantee has a cost; pretending asynchronous replication is synchronous is "a recipe for problems down the line".

6"ACID has unfortunately become mostly a marketing term"

The sentence is Kleppmann's (p. 223), and he documents it: the C in Consistency was "tossed in to make the acronym work" according to Joe Hellerstein, and it's a property of your application anyway, not of the database. Worse, isolation levels carry the same names everywhere while meaning different things: Oracle's "serializable" is actually snapshot isolation, to the point that "nobody really knows what repeatable read means" (p. 242).

Chapter 7 is worth the book on its own for its gallery of named anomalies. The sneakiest, write skew: Alice and Bob, doctors on call, sign off at the same time. The drama reads in two SQL transactions running in parallel:

-- T1 (Alice)                          -- T2 (Bob), AT THE SAME TIME
SELECT count(*) FROM oncall           SELECT count(*) FROM oncall
 WHERE on_call = true;  -- → 2        WHERE on_call = true;  -- → 2 too
-- "2 on call, I can leave"            -- "2 on call, I can leave"
UPDATE oncall SET on_call = false      UPDATE oncall SET on_call = false
 WHERE doctor = 'Alice';             WHERE doctor = 'Bob';
COMMIT;  -- ✓                       COMMIT;  -- ✓ → zero doctors on call!

Each transaction read a true state at the moment it read it, then wrote on the strength of that now-stale read. No conflict detected, two valid transactions, zero doctors on call (p. 246-248). Only serializable isolation (which behaves as if the transactions ran one after another) prevents it, and almost no database enables it by default: it's on you to ask for it when the stakes justify it.

7The network lies, the clock lies, and so does your process

Chapter 8 is the book's most famous, and its most counter-intuitive. In a distributed system you can't trust anything at face value:

The network lies — an unanswered request doesn't tell you whether the request was lost, the server died, or it's just sitting in a garbage collector pause (those "stop-the-world" pauses can last several minutes, p. 296).
Clocks lie — Google budgets 200 ppm for its servers, which is 17 seconds of drift per day without resynchronization (p. 289). A thread can check that it holds a lock, freeze for 15 seconds, then write while believing it's still the leader; the cure, the fencing token, is a simple increasing counter that the storage verifies.
Processes lie — a process can pause mid-execution, lose its lease, resume, and still believe it's the leader. The litany of real-world causes runs from sharks biting undersea cables to a hypoglycemic driver crashing his pickup into a datacenter's HVAC (p. 275, 279).

Hence the chapter's motto: "In distributed systems, suspicion, pessimism, and paranoia pay off" (p. 277).

Cross-section of an ocean: a shark bites an undersea data cable, sparks fly, a small datacenter on the distant shore lights a red warning — The network can fail for any reason. Including this one (the book is serious, p. 279).

8Forget the CAP theorem, learn the word "linearizable"

CAP as "consistency, availability, partition tolerance: pick 2 out of 3" is misleading, says Kleppmann: a network partition isn't an architect's choice, it's a fault that will happen to you anyway. His verdict is dry: "CAP is best avoided" (p. 337).

The useful concept instead: linearizability, the illusion that only one copy of the data exists. The book's example is crystal clear: Alice and Bob are watching the 2014 World Cup final, Alice refreshes and announces the final score; Bob refreshes after hearing her, and his phone, hitting a lagging replica, still shows the game as ongoing (p. 325).

You have likely lived the dev version: a site replicating its database across servers to handle load; a user posts a comment (written to the primary), refreshes instantly, and their read hits a replica 200 ms behind that hasn't received it yet, so their own comment has "vanished". The linearizable fix (always read the primary) kills the bug but costs the latency and fault-tolerance of the replicas. The book teaches you to know when you actually need it, and when causal consistency is enough.

9Your application state is the integral of its event stream

The most beautiful idea of the book's final act fits in one sentence: state is what you get when you integrate an event stream over time; the change stream is what you get when you differentiate the state (p. 460).

Concretely: instead of writing to the database AND the cache AND the index (three writes that can contradict each other), you write to an ordered log (Kafka) and everything else derives from it in the same order. That's Change Data Capture, and it's the idea Kleppmann named "database inside-out". His unfinished dream: being able to write mysql | elasticsearch like a Unix pipe, the unbundled equivalent of CREATE INDEX (p. 503). Today, entire products (Debezium, Materialize) live off this idea.

Three things I didn't know before reading it

Every chapter opens with an illustrated map, drawn like a treasure map of the chapter's territory (by Shabbir Diwan and Edie Freedman). Kleppmann thanks them for taking on "the unconventional idea of creating maps" (p. xix).
Mrs. Cake, the medium who answers questions before they're asked in the replication chapter, is an official Terry Pratchett reference: Reaper Man is citation [25] of chapter 5.
Reference [14] of the partitioning chapter is a Mashable article titled "3% of Twitter's Servers Dedicated to Justin Bieber" (2010). Celebrity hot spots are not a theoretical problem.

My take, honestly

It's the best backend engineering book I've read, and that opinion is banal: everyone has been saying it since 2017. What makes it timeless isn't its catalog of technologies, it's its method: every concept arrives with the problem it solves, a named example you never forget (the doctors on call, the World Cup score, the pickup in the HVAC), and the trade-off stated plainly. Kleppmann is unusually honest: he writes "there is no quick and easy rule, test empirically" where others would sell a method, and he switches to the first person in chapter 12 to keep facts and opinions clearly separated.

The flaws are real. It's long, dense, and unevenly useful: a web developer will have a great time in chapters 1 through 8, but Hadoop-era batch processing (chapter 10) now reads like ancient history: MapReduce is dead and buried. The technologies have aged ten years: Riak is gone, Kafka no longer needs ZooKeeper. The principles, though, haven't moved a millimeter: LSM vs B-tree, replication lag, write skew and fencing tokens are exactly the same in 2026.

And then there's the final chapter, which almost nobody mentions. The book is dedicated "to everyone working toward the good", and its last section offers a chilling thought experiment: replace the word "data" with "surveillance" in your sentences ("our surveillance warehouse", "our surveillance scientists") and listen to how it sounds (p. 537). Written before the GDPR and before LLMs, that chapter has aged better than everything else. A database book that ends on users' dignity: I don't know another one.

Odilon

Still relevant in 2026?

The principles, yes, entirely: they were already twenty years old when the book came out. The technology snapshot is 2017 vintage, and it shows in places (Riak, MapReduce, ZooKeeper). Note that this page covers the first edition: a second edition co-written with Chris Riccomini came out in late 2025. If you're buying it today, buy that one; the spine of the book is the same.

Who is it for?

Read it if

You've done backend for 2-3 years and want to understand what happens beneath the ORM, the replica and the message queue you already use
You pick infrastructure (SQL vs NoSQL, queue vs log) and want criteria, not fashions
You've already lost a day to a concurrency or stale-data bug you couldn't name
You're preparing system design interviews: this is THE reference book of the field

Skip it if

You're a beginner: without prior SQL practice and an app in production, the trade-offs will stay abstract
You only do frontend: half the book covers problems you'll never face directly
You want a tutorial: there isn't a single line of "how to install PostgreSQL" in the whole book, by design

For going further

The storage and transaction concepts connect with the SQL course on this site. Chapter 8's concurrency mindset is practiced hands-on in the Go course (goroutines, channels, race conditions). And the request path from client to server is mapped in the HTTP course.

Designing Data-Intensive Applications