Article: The Network is Reliable

One of my favorite excerpts from the article:

DRBD Split-brain

When a two-node cluster partitions, there are no cases in which a node can reliably declare itself to be the primary. When this happens to a DRBD file system, as one user reported (http://serverfault.com/questions/485545/dual-primary-ocfs2-drbd-encountered-split-brain-is-recovery-always-going-to-be), both nodes can remain online and accept writes, leading to divergent file system-level changes.

In this article, Peter Bailis and Kyle Kingsbury provide an awesome sampling of real life network failures in a variety of actively deployed distributed systems.

It’s definitely worth the read, especially interesting given the data from Google, Amazon, and Microsoft. You can view the living version of the document on github: https://github.com/aphyr/partitions-post.