This series is a terse summary of the book Building Secure and Reliable Systems by Heather Adkins, Betsy Beyer, Paul Blankinship, Ana Oprea, Piotr Lewandowski, Adam Stubblefield. Experts from Google share best practices to help design scalable and reliable systems that are fundamentally secure.

Chapter 1: Intersection of Security and Reliability

Reliability issues usually have nonmalicious origin while security issues are typically caused by adversaries. That is, reliability incidents are usually unintentional with indirect failures while security issues are caused by an adversary who might actively lead the system to fail. Fail safe (open) vs. fail secure (closed) principle is a handy way to prioritize issues you defend from.

  • System failure mode: fail safe or fail secure?

Reliability incidents are better managed by a team of multiple responders who have diverse perspectives that might come into help. Security incidents are better managed by the smallest possible number of competent people, in order to avoid alerting the adversary.

  • Who is your response team and why?

Confidentiality, Integrity, Availablity trio (aka “CIA”). Both secure and reliable systems desire these properties but for different reasons with different intentions. They might even come into conflict because of that.

Both reliability and security are hard to bolt onto existing system so it’s better to take them into account from the early design stages and maintain throughout the software lifecycle. These considerations are typically invisible until a (catastrophic) failure occurs so reliability and security are often dismissed, deprioritized, or debudgeted without proper risk assessment.

Designing simple systems is a decent way to provide some degree of reliability and security in an “accidental” manner, without dedicated design effort specifically for that. However, evolution might turn an initially simple system into a mess. Minor and seemingly innocuous changes might lead to a death by a thousand cuts or cause a butterfly effect.

Defence in depth provides independent interlocks which can be kept individually simple, with limited “blast radius” in case of failure. Principle of least privilege has a similar intent along with multiparty authorization approaches.

  • Do you need How much security and reliability do you need?
  • Answer this question during early design stage
  • Perform risk assessment for reliability and security

While design stage is definitely important, you should not limit your efforts to it alone. Code review and code reuse might prevent issues before QA stage. Testing and fuzzing might catch issues before deployment. Slow rollouts and canaries might catch issues before they hit users. That is, “defence in depth” applies to software lifecycle too.

  • Be aware of Secure Software Development Lifecycle (SSDLC)

Perfect security and reliability are impossible in practice. Preventive measures will fail, design with this in mind and have a recovery plan. Crisis response is usually hectic and tense, don’t turn it into a clueless panic. During a disaster it is essential to keep a cool head, maintain a clear chain of command, and have prepared a solid set of checklists, playbooks, protocols. These need to be prepared beforehand, not during the incident. Preventive measures tend to rot over time when stored not in use so drills and exercises should be used to keep them sharp. Patching a system during recovery might turn into a compromise between immediate risks caused by the failing system and future risks introduced by quickly applied changes. Have a plan for after the incident too.

  • Have a disaster recovery plan
  • Perform drills and exercises