Library · Summary & review

Understanding the Machine

Write Great Code, vol. 1, by Randall Hyde. What the machine really does with your code, and why it still matters.

FR EN
Write Great Code Volume 1 book cover, 2nd Edition

Write Great Code, vol. 1

Write Great Code, Volume 1: Understanding the Machine (2nd Edition)

7.2 /10

« The architecture class self-taught devs never got: the numbers aged, the ratios didn't budge an inch. »

  • AuthorRandall Hyde · auteur de The Art of Assembly Language
  • OriginalNo Starch Press, 2020 · 472 pages
  • EditionNotes based on the 1st ed. (2004) + verified 2nd ed. additions
  • This page~10 min read
Book rating across 5 dimensionsIdeas9/10Practical7/10Readability6/10Aged well6/10Examples8/10

What the machine really does with your variables: floats, caches, pipelines, explained without assembly.

Why this book

Most web developers never took the computer architecture class. We learned PHP, JavaScript or Python on top of abstractions, and the machine underneath stayed a black box that "just works". This book is that missing class: how the machine actually represents your numbers, your strings, your objects, and what it really costs. Without asking you to learn assembly, which is its whole selling point.

Hyde's bar is set in the first pages: "While efficiency is not the only attribute great code possesses, inefficient code is never great" (p. 3). The book closes the loop at the end: "There is nothing preventing you from thinking in low-level terms while writing high-level code" (p. 405). Everything in between is the equipment for doing exactly that.

The ideas that stick

1Think low-level, write high-level#

The book refuses the false choice between elegant high-level code and fast low-level code. Great code, by Hyde's definition, is "software that is written using a consistent and prioritized set of good software characteristics" (p. 8): you pick your priorities, then hold them consistently. One attribute, though, is non-negotiable: not being wasteful for no reason.

The inefficiencies that hurt are rarely deliberate. They come from a mismatch between what the developer imagines and what the machine actually does:

// you IMAGINE: "dividing by 8 is one operation"
y = x / 8;     // an integer DIVISION: often 20 to 40 CPU cycles

// the machine, and so good code, prefers:
y = x >> 3;    // a 3-bit SHIFT: 1 cycle (for an unsigned integer)

Fix the mental model, and the code improves without a single line of assembly.

A developer typing serenely at a tidy desk while a cutaway below the floor reveals a busy machine room of gears, conveyor belts and pipes doing the real work
You write the top floor. The machine runs the basement.

2A number is not its representation#

"A number is an intangible, abstract, concept" (p. 10): what lives in memory is always a chosen representation. Hexadecimal, for instance, is not really a different number system; it is binary with glasses on, one digit for four bits.

And the representation has very concrete traps. Two examples with two's complement, the universal signed-integer format:

  • its asymmetry is built in: on 8 bits, the range runs from -128 to +127. So +128 does not exist, and negating -128 gives you... -128;
  • in this format, the leftmost bit carries the sign: -1 on 8 bits is written 11111111. To widen it to 16 bits, you must repeat that sign bit on the left (11111111 11111111, still -1). Pad with zeros instead, and 00000000 11111111 now reads as +255: the negative number has silently become a positive one.

And these are not textbook curiosities. In 2014, YouTube moved to a 64-bit view counter because Gangnam Style was passing 2,147,483,647 views, the maximum of a signed 32-bit integer. The same ceiling stalks every auto-increment id declared INT in your database, and it is what sets the Year 2038 bug: Unix timestamps stored on 32 bits overflow on January 19, 2038. When you pick a column type, you pick a representation, and therefore a ceiling.

3Floats are approximations with survival rules#

A float is stored like scientific notation: 1.23e3 means 1.23 × 10³. The digits part (1.23) is called the mantissa, the scale part (e3) the exponent. Crucial detail: the mantissa has a fixed number of digits. Everything that does not fit is dropped.

That is why 0.1 + 0.2 does not equal 0.3: in binary, 0.1 and 0.2 are repeating fractions, like 1/3 in decimal. The mantissa stores a truncated approximation, and the errors add up.

The mechanism is easiest to see in decimal, with a 3-digit mantissa. Adding 1.00e0 to 1.23e3 means aligning exponents first, and the small value slides right out of the 3-digit window. Result: 1.23e3, unchanged. The addition silently did nothing.

Hyde turns this into survival rules (ch. 4). The absolute one: "you should never compare two floating-point values to see if they are equal" (p. 70); test abs(a - b) <= epsilon instead. The others:

  • add values of similar magnitude together first: the order of evaluation changes the result;
  • subtracting two nearly equal values destroys precision (cancellation);
  • a 32-bit float carries about 6.5 decimal digits; a 64-bit double about 14.5. That is little, across a chain of calculations;
  • for money: scaled integers (cents), never floats.

4Bit operations are tools, not wizardry#

Minimal scenery first. Inside the machine, every number is a row of bits, cells holding 0 or 1. Each cell weighs a power of 2 (from the right: 1, 2, 4, 8, 16…), and the number is the sum of the lit cells: 0010 1011 is 32 + 8 + 2 + 1 = 43. Bitwise operators work column by column, like long addition: AND (&) keeps a 1 only when both cells have one, OR (|) lights the cell as soon as either is 1, XOR (^) lights it when the two differ. And it is native: the CPU does this in a single instruction, the cheapest operation there is.

Chapter 3 then reads like a toolbox. AND masks bits, OR forces them, XOR flips them, and these patterns work identically in JS, PHP and Python:

// even/odd without modulo
if (n & 1) { /* odd */ }

// circular buffer index, when size is a power of 2
i = (i + 1) & (size - 1);   // replaces (i + 1) % size

// ASCII: upper and lower case differ by a single bit (bit 5)
'A' | 0x20  // → 'a'
'a' & 0x5F  // → 'A'

The deeper lesson is about designing packed formats: order fields from most to least significant (year, month, day) and two dates compare with a plain integer comparison. The representation does the work the code would otherwise do.

5Memory is a hierarchy, not a flat array#

The mental model that pays for the whole book: memory is not one uniform speed. Reading data already next to the CPU takes a nanosecond. The same read takes several milliseconds if the data has to climb up from disk. That is what the book calls "over six orders of magnitude" (p. 300) of gap: a factor of a million. At our scale, that is the difference between one second and twelve days.

Two consequences worth memorizing:

  • misalignment: data straddling bus boundaries doubles the memory accesses, invisibly;
  • wait states: "Running with a wait state on every memory access is almost like cutting the processor clock frequency in half" (p. 152).

The whole hierarchy exists to exploit one statistical fact: programs spend their time on a small fraction of their data. That fact has a name: locality.

And if this picture feels familiar, it should: the whole web replays the same hierarchy one floor up. Redis in front of MySQL, OPcache in front of the disk, a CDN in front of your server. Each time, the same idea: put a small fast memory in front of a big slow one, and bet that the requested data is already in the small one. Understand the CPU version, and you understand in one shot why all those layers exist, and why they work.

6The cache line decides your loop's speed#

Below the L1 cache, memory never moves one byte at a time: it moves in cache lines of 16 to 64 consecutive bytes. Touch one element, and its neighbors arrive for free. Which is why the order of two nested loops, with identical logic, can change everything:

// slow: jumps 1,024 bytes at every iteration (column-wise)
for (i = 0; i < 256; ++i)
  for (j = 0; j < 256; ++j)
    array[j][i] = i * j;

// fast: walks memory sequentially (row-wise)
for (i = 0; i < 256; ++i)
  for (j = 0; j < 256; ++j)
    array[i][j] = i * j;

Hyde's measurement: "This small modification can be responsible for an order of magnitude (or two) difference in the run time of these two code sequences!" (p. 315). Same algorithm, ten to a hundred times faster, just by walking memory in the direction it is laid out.

The habits that go with it:

  • declare together the variables you use together;
  • prefer local variables: the stack is almost always in cache;
  • allocate in large blocks: tiny allocations can burn most of their bytes in allocator overhead.

7The pipeline hates your jumps#

A CPU does not execute one instruction at a time: it runs an assembly line (fetch, decode, address, load, compute, store), with one instruction at each station. At cruising speed, one instruction completes per cycle. The price: a taken branch throws away everything already on the line, and the pipeline refills from zero. Hyde's advice is blunt: "If you want to write fast code, avoid jumping around in your program as much as possible" (p. 242).

Same logic for data hazards: an instruction that reads a value the previous instruction is still writing stalls the line.

Modern CPUs fight back with branch prediction, out-of-order execution and register renaming, all invisible to your code. But they only soften the rule, they do not repeal it: predictable branches and independent computations are still what the machine likes to eat.

The most telling example of the phenomenon: walking a big array testing if (value > threshold) can be several times faster when the array is sorted. Same data, same loop. But sorted, the predictor guesses right almost every turn; shuffled, it is wrong half the time, and every miss empties the assembly line. This is the subject of one of the most upvoted questions in Stack Overflow history ("Why is processing a sorted array faster than processing an unsorted array?").

8What the 2nd edition (2020) fixes#

I read the 1st edition (2004), and its age shows in two places. The character chapter describes Unicode as fixed 16-bit, and never mentions UTF-8, the web standard that encodes each character at a variable width:

// what the 1st ed. teaches: "one character = 16 bits, period"
// the UTF-8 reality: width depends on the character
'A'   → 1 byte     (41)
'é'   → 2 bytes    (C3 A9)        // hence strlen("é") = 2 in PHP
'😀'  → 4 bytes    (F0 9F 98 80)

The I/O chapter, meanwhile, is a museum of SCSI, IDE and parallel ports. The fundamental mechanisms it teaches through them (interrupts versus polling, DMA, sequential beats random) are intact, but the hardware is gone.

The 2nd edition (No Starch, July 2020, 472 p.) is the one to buy. Per the publisher, it adds what two decades changed:

  • code generation on modern 64-bit CPUs, and ARM processors;
  • examples drawn from Swift and Java alongside C;
  • newer peripherals, and large memory systems with SSDs.

The skeleton of the book, the part that made it last, is the same.

Three things I didn't know

My take, honestly

Let me say it upfront: I am a self-taught web developer, nobody had ever explained to me what a cache line is, and I do not have the level to judge this book as a low-level expert. I can only say what it did to me. Before, 0.1 + 0.2 was a JavaScript joke I repeated without understanding it; after the floating-point chapter, it is a mechanism I can explain. Before, a slow loop was just bad luck; after the memory chapters, I at least have a diagnostic lead.

Two honest limits. It is a textbook: dense, methodical, sometimes dry, and I skimmed whole passages (the Y86 instruction encodings, the digital-design details). And I read the 1st edition, where the dated chapters are genuinely dated; buy the 2nd edition (2020), which fixes precisely those.

What earns the 7.2, seen from my desk: the ratios. The absolute numbers from 2004 are museum pieces, but the relationships between levels (cache to RAM, RAM to disk, sequential to random) have barely moved in twenty years. And they are what you reason with, even in PHP: a file re-read a hundred times, a loop jumping all over a big array, I now know why that is slow.

Odilon

Still relevant in 2026?

Yes, and the AI era sharpens it. When an assistant writes your code, the remaining human job is judging it, and this book provides the cost model: which loop order, which data layout, why this float comparison is a bug. The mechanisms it teaches still run everything: the event loop in Node.js is the interrupt model, io_uring is the war on syscall overhead, SSDs still reward sequential access. What to skip in the 1st edition: the character chapter (read about UTF-8 anywhere else) and the 2004 hardware catalog.

Who is it for?

Read it if

  • You are self-taught and the machine under your language is a black box
  • 0.1 + 0.2 ≠ 0.3 annoys you but you could not explain it to a junior
  • You optimize by guessing; this gives you the cost model to optimize by reasoning
  • You want the architecture class without learning assembly

Skip it if

  • You have a CS degree with a real architecture course: you know most of this
  • You want immediately applicable patterns for daily web work: this is foundations, not recipes
  • You only find the 1st edition: a third of it describes 2004 hardware

Going further

The mechanism-first mindset is the same one behind my courses, where every lesson explains what actually happens under the abstraction. The direct sequel is volume 2, which prices your statements once compiled. In the library, Designing Data-Intensive Applications applies the same memory-hierarchy reasoning at datacenter scale, and Fluent Python does for one language what Hyde does for the machine.

Comments (0)

Browse the whole library

More book notes coming: one book at a time, the marrow only.