Library · Reading notes

Designing Data-Intensive Applications

By Martin Kleppmann. The map of the whole data territory: storage, replication, transactions, streaming, and everything that breaks in between.

FR EN
Designing Data-Intensive Applications book cover, Martin Kleppmann

Designing Data-Intensive Applications

Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems

9 /10

« Dense, demanding, monumental: the book that turns 'works on my machine' into 'I know why it breaks in production'. »

  • AuthorMartin Kleppmann
  • Year2017 · O'Reilly Media
  • Pages614
  • This page~10 min read
Book rating across 5 dimensionsIdeas10/10Practical7/10Readability8/10Aged well8/10Examples9/10

The complete map of data systems: SQL, NoSQL, replication, streaming. The absolute must-read.

Why this book

Every web developer lives on top of a stack they didn't build: a SQL database behind an ORM, a Redis cache, a message queue, a search index. As long as everything works, that stack is invisible. The day a MySQL replica returns stale data, two concurrent requests silently clobber a counter, or a network timeout leaves the system in an inconsistent state, you discover you don't really understand your own foundations.

Martin Kleppmann wrote the book that fills that gap. Almost four years of writing, more than 800 references, and a promise kept from cover to cover: explaining not how to use any particular tool, but why each tool works the way it works, and what questions to ask before picking one. Since 2017, "DDIA" has probably been the most recommended book in all of backend engineering. It deserves it.

The ideas that stay

1You are already a data system designer

Tool categories merged without telling anyone: Redis is a datastore used as a message queue, Kafka is a message queue with database-grade durability guarantees. As soon as your application combines a database, a cache and a search index kept in sync by your own code, Kleppmann is blunt: "You are now not only an application developer, but also a data system designer" (p. 5). The title isn't honorary, it comes with the liabilities: your application code is what guarantees (or fails to guarantee) that the cache gets invalidated at the right time. The whole book follows from this observation: since you're assembling data systems anyway, you might as well understand what's inside them.

2Describe the load before it crushes you: Twitter's fan-out

"X is scalable" means nothing; the real question is "if the load grows in a particular way, what are our options?". The book's flagship example: Twitter in 2012, 4,600 tweets posted per second, but 300,000 timeline reads per second. Precomputing every timeline at write time turns 4,600 tweets/s into 345,000 writes/s, and an account with 30 million followers triggers 30 million writes for a single tweet (p. 11-13). The final solution is hybrid: fan-out for everyone, except celebrities, merged at read time. The lesson outlives Twitter: identify your load parameter (here, the distribution of followers) before choosing an architecture. And measure in percentiles, not averages: Amazon specifies its services at the 99.9th percentile because the slowest requests hit the customers with the fullest accounts, hence the most valuable ones (p. 15).

3The world's simplest database fits in two bash functions

db_set () {
    echo "$1,$2" >> database
}
db_get () {
    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}

Constant-time writes (append to the end of the file), catastrophic reads (scan everything). That's the opening of chapter 3, and every real storage engine is just an answer to this imbalance, summed up in one law: "well-chosen indexes speed up read queries, but every index slows down writes" (p. 71). From there, two families: LSM-trees (Cassandra, RocksDB), which keep the append-only spirit and compact in the background, fast at writing; and B-trees (PostgreSQL, MySQL), which overwrite pages in place, fast and predictable at reading. Kleppmann refuses to decide for you: "There is no quick and easy rule […] it is worth testing empirically" (p. 85).

4NoSQL is a remake: this match was already played in 1970

IMS, the IBM database originally developed for stock-keeping in the Apollo space program (1968), stored data as nested trees that look strangely like the JSON of modern document databases. Same strengths (one-to-many feels natural), same dead ends: many-to-many relationships are painful and joins don't exist. "These problems of the 1960s and '70s were very much like the problems that developers are running into with document databases today" (p. 36). Codd's relational model won the "great debate" of the 70s precisely by hiding the implementation behind a clean interface. As for "schemaless", Kleppmann dismantles it with one analogy: schema-on-read is dynamic typing, schema-on-write is static typing. The schema always exists; the only question is who enforces it.

5Replicating is easy. Replicating changes is the whole problem

"All of the difficulty in replication lies in handling changes to replicated data" (p. 151). A leader takes the writes, followers trail behind with variable delay: this replication lag manufactures anomalies the chapter names one by one. You post a comment, you reload, it's gone (you read from a lagging follower: read-your-own-writes). An answer shows up before its question (causality violation: consistent prefix reads, illustrated by a Pratchett-worthy dialogue between Mr. Poons and Mrs. Cake, p. 165). And failover isn't free: at GitHub, a lagging MySQL follower promoted to leader reused primary keys that were already handed out, and since those keys also indexed a Redis store, private data went to the wrong users (p. 157). Every extra guarantee has a cost; pretending asynchronous replication is synchronous is "a recipe for problems down the line".

6"ACID has unfortunately become mostly a marketing term"

The sentence is Kleppmann's (p. 223), and he documents it: the C in Consistency was "tossed in to make the acronym work" according to Joe Hellerstein, and it's a property of your application anyway, not of the database. Worse, isolation levels carry the same names everywhere while meaning different things: Oracle's "serializable" is actually snapshot isolation, to the point that "nobody really knows what repeatable read means" (p. 242). Chapter 7 is worth the book on its own for its gallery of named anomalies. The sneakiest: write skew. Alice and Bob, doctors on call, each check the schedule, both see "2 doctors on call", and both sign off at the same time. No conflict detected, two valid transactions, zero doctors on call (p. 246-248). Only serializable isolation prevents it, and almost no database enables it by default.

7The network lies, the clock lies, and so does your process

Chapter 8 is the book's most famous, and its most counter-intuitive. In a distributed system, an unanswered request doesn't tell you whether the request was lost, the server died, or it's just sitting in a garbage collector pause (those "stop-the-world" pauses can last several minutes, p. 296). Clocks drift: Google budgets 200 ppm for its servers, which is 17 seconds of drift per day without resynchronization (p. 289). A thread can check that it holds a lock, freeze for 15 seconds, then write while believing it's still the leader; the cure, the fencing token, is a simple increasing counter that the storage verifies. The litany of real-world failures runs from sharks biting undersea cables to a hypoglycemic driver crashing his pickup truck into a datacenter's HVAC system (p. 275, 279). Hence the chapter's motto: "In distributed systems, suspicion, pessimism, and paranoia pay off" (p. 277).

8Forget the CAP theorem, learn the word "linearizable"

CAP as "consistency, availability, partition tolerance: pick 2 out of 3" is misleading, says Kleppmann: a network partition isn't an architect's choice, it's a fault that will happen to you anyway. His verdict is dry: "CAP is best avoided" (p. 337). The useful concept instead: linearizability, the illusion that only one copy of the data exists. The book's example is crystal clear: Alice and Bob are watching the 2014 World Cup final, Alice refreshes and announces the final score; Bob refreshes after hearing her, and his phone, hitting a lagging replica, still shows the game as ongoing (p. 325). That freshness guarantee is expensive in latency; the book teaches you to know when you actually need it, and when causal consistency is enough.

9Your application state is the integral of its event stream

The most beautiful idea of the book's final act fits in one sentence: state is what you get when you integrate an event stream over time; the change stream is what you get when you differentiate the state (p. 460). Concretely: instead of writing to the database AND the cache AND the index (three writes that can contradict each other), you write to an ordered log (Kafka) and everything else derives from it in the same order. That's Change Data Capture, and it's the idea Kleppmann named "database inside-out". His unfinished dream: being able to write mysql | elasticsearch like a Unix pipe, the unbundled equivalent of CREATE INDEX (p. 503). Today, entire products (Debezium, Materialize) live off this idea.

Three things I didn't know before reading it

My take, honestly

It's the best backend engineering book I've read, and that opinion is banal: everyone has been saying it since 2017. What makes it timeless isn't its catalog of technologies, it's its method: every concept arrives with the problem it solves, a named example you never forget (the doctors on call, the World Cup score, the pickup in the HVAC), and the trade-off stated plainly. Kleppmann is unusually honest: he writes "there is no quick and easy rule, test empirically" where others would sell a method, and he switches to the first person in chapter 12 to keep facts and opinions clearly separated.

The flaws are real. It's long, dense, and unevenly useful: a web developer will have a great time in chapters 1 through 8, but Hadoop-era batch processing (chapter 10) now reads like ancient history: MapReduce is dead and buried. The technologies have aged ten years: Riak is gone, Kafka no longer needs ZooKeeper. The principles, though, haven't moved a millimeter: LSM vs B-tree, replication lag, write skew and fencing tokens are exactly the same in 2026.

And then there's the final chapter, which almost nobody mentions. The book is dedicated "to everyone working toward the good", and its last section offers a chilling thought experiment: replace the word "data" with "surveillance" in your sentences ("our surveillance warehouse", "our surveillance scientists") and listen to how it sounds (p. 537). Written before the GDPR and before LLMs, that chapter has aged better than everything else. A database book that ends on users' dignity: I don't know another one.

Odilon

Still relevant in 2026?

The principles, yes, entirely: they were already twenty years old when the book came out. The technology snapshot is 2017 vintage, and it shows in places (Riak, MapReduce, ZooKeeper). Note that this page covers the first edition: a second edition co-written with Chris Riccomini came out in late 2025. If you're buying it today, buy that one; the spine of the book is the same.

Who is it for?

Read it if

  • You've done backend for 2-3 years and want to understand what happens beneath the ORM, the replica and the message queue you already use
  • You pick infrastructure (SQL vs NoSQL, queue vs log) and want criteria, not fashions
  • You've already lost a day to a concurrency or stale-data bug you couldn't name
  • You're preparing system design interviews: this is THE reference book of the field

Skip it if

  • You're a beginner: without prior SQL practice and an app in production, the trade-offs will stay abstract
  • You only do frontend: half the book covers problems you'll never face directly
  • You want a tutorial: there isn't a single line of "how to install PostgreSQL" in the whole book, by design

For going further

The storage and transaction concepts connect with the SQL course on this site. Chapter 8's concurrency mindset is practiced hands-on in the Go course (goroutines, channels, race conditions). And the request path from client to server is mapped in the HTTP course.

Comments (0)

Browse the whole library

More book notes coming: one book at a time, the marrow only.