Chapter 6: Design for Understandability

Designing an understandable system requires effort, as well as maintaining it that way. This effort is usually repaid with sustained velocity, but there are other common specific benefits. The likelihood of security vulnerabilities and resilience failures is decreased, since you are less likely to introduce a change that has not-so-clear effect. Incident responses are more effective and accurate as you can pin down root causes quicker. You are also more confident in assertions about system’s invariants, especially the one related to security. Testing is a good way to show presence of bugs but hopelessly inadequate for proving the absence, while invariants can guarantee that, that is if they hold up.

An invariant is a property of the system that is always true, no matter the state and behavior. You are usually interested in valuable invariants, like “only authenticated users are able to access persistent storage”, or “if the load on the system is greater that it is able to handle, the overloaded component will return overload errors rather than crashing”.

Not all such properties are worth turning into invariants. Invariants don’t grow on trees. If the potential harm of not meeting the invariant is greater than the effort you have to spend to turn it into an invariant for your system, then it might not be worth doing it. On one end of the spectrum is runninng a couple of tests once and asserting that you’re safe, on the other end is going full out with formal verification of the system’s behavior. You’d like to be in the sweet spot somewhere in the middle. Especially if you target common vulnerabilities.

Highly complex systems are hard to reason about so people usually resort to mental models that present a simplified, abstracted view of the system. It makes the reasoning simpler but keep in mind that the model is not the system. Unusual situations might not be modelled well, but reliability and security issues are all about operating in unusual environment. You can be misled if you keep trusting the model where it no longer applies (e.g., memory thrashing when under pressure may be a surprise to throughput). However, it’s still worth to consider mental models, and ideally design systems so that their typical model made by a typical engineer is more or less accurate and useful, ideally even under unusual conditions (e.g., avoid thrashinng by disabling swap and failing allocations).

The primary enemy of understandability is unmanaged complexity. Note the different between inherent and accidental complexity. If you deal with hard problems or scaled problems, it may be easy to introduce more and more complexity by adding more features, more instances, more moving parts, etc., even while they all do add genuine value to the users. This might become a problem – but only if you do not manage this issue. The inherent complexity is not going away but you can compartmentalize the system so that it is possible to reason about specific properties and behaviors with sufficient fidelity.

Humans are not good at keeping up with large mental models. By composing systems of smaller components you can still reason about each of them individually, and if they can be combined in a predictable way then you can reason about combined effects too. It is not that straighforward, but it is a way to derive whole-system properties without loading the entire system into your head.

Take security and reliability requirements for example, they are typically horizonally applied thoughout the entire system. If each individual component is responsible for its audit logging it is hard to reason whether your system as a whole does audit logging properly. However, if there is a centralized component – like a library or a framework that you use everywhere – then you may be able to guarantee and easily review that requirement. Reliability often has similar opportunities, such as with using retries and backoffs to prevent cascading failures under load: having a common library for that makes such situations easier to handle. Centralized responsibility improves understandability of the system and increases the likelihood that it actually behaves correctly. The development efforts can typically be amortized across multiple projects.

Another useful approach is breaking up the system into layers. However, be aware of coupling: tightly coupled layers hardly make things easier. Furthermore, layers are natural trust boundaries and ideally a layer should not have any assumptions about its users: that is, treat them as potentially malicious. This simplifies reasoning as you can be sure that the layer treats every caller the same way.

Structured interfaces also improve understandability. Consider frameworks for exposing API: RESTful JSON requests, gRPC, DCOM, etc. Free-form requests as JSON objects are more flexible but approaches like gRPC provide more structure. They have built-in confornmance checks, validation, enhance discoverability, often have other features for versioning and backwards compatbility. They also provide explicit API specifications.

Common object model simplifies handling of multiple resource types. Each object in the system can be expected to satisfy certain invariants. There may be standard ways to scope, annotate, and reference any objects. Operations are consistent across all object types. Custom object types may be created where necessary while keeping the same overall mental model. Google provides guidelines for designing resource-oriented APIs.

Idempotent operations yield the same result when applied multiple times. This is powerful design pattern for distributed systems where requests may be duplicated or retried for reliability reasons, come out-of-order, or just be dropped due to network connectivity outage. Idempotant operations form a simpler mental model where you can keep retrying the operation until you get a confirmation that it has succeeded, without risking to apply a change twice because you thought that operation has not completed when it has actually completed. Some operations are naturally idempotent while other may be made idempotent, say, by associating a UUID with each request which will help the server identify duplicates and react accordingly.

Authentication and access control impact understandability as well. Obviously, you need to clearly understand who has access to sensitive data. This starts with identities and credentials of accessing entities, which present their credentials to prove their identity in an authentication protocol.

Note that modern systems have both human and machine users which have their own identities. You may like to identify humans, software components, hardware components. Back in the days it was common to use IP addresses and ports for that but The Cloud™ makes it mostly obsolete since the services and users might migrate and the same host might serve mutiple users at once.

The identifiers should be meaningful, understandable. store-frontent-prod is better than {98a33ffc-10e5-4a9c-92f1-31bf69faab4d} because humans are much more likely to spot mistakes that way.

Identifiers should also be robust to spoofing, in a way specific to the identifier. Passwords sent over cleartext channel are easy to spoof. X.509 certificates with private keys on hardware tokens are harder to spoof.

Finally, identifiers must be nonreusable. This way it is less likely for any previous configuration to be mistakingly applied to new entities. For example, suppose you reuse administrator’s email bob@acme.corp as an identifier. Your previous Bob quits, you hire Carol to take over Bob’s duties. Then a year later you hire another Bob for a data entry job — which now comes with administrative access to your system that you did not clear out.

For example, Google has the following identity model:

  • Administrators are humans that can change the system configuration.
  • Machines are physical machines residing in datacenters. They run various services, both productive and for internal bookkeeping.
  • Workloads can be scheduled by Borg to run on machines. They typically have identities distinct from the machine they are going to be run on.
  • Customers are users that access services.

Administrators cannot directly modify the state of the system (Zero Touch Prod) but they can start workloads that act on machines. All these actions are audited with identities so it is clear who started what and why this workload was authorized to run on this machine, who can manage a machine. Since machines are identified, it is possible to ensure that production machines never run testing workflows which are more risky. Customers are identified too so can be sure that actions performed on the user’s behalf can be properly authenticated and authorized throughout the whole system.

Authentication and transport security are hard. You should have specialized knowledge on topics like cryptography, protocol design, operating systems, in order to do that properly. Google has an Application Layer Transport Security (ALTS) which provides automagical, zero-configuration, service-to-service authentication and transport security so that application developers do not have to deal with that at all.

Using frameworks for access control also makes systems easier to understand. They reinforce common knowledge and provide a unified way to describe policies. Interactions with networked systems are inherently complex, there are multiple machines and layers between the user accessing the application from their browser and the machine doing database access. All of these chain links must be able to determine whether the request is authorized and on whose behalf they are acting. Unified framework applied throughout the entire system makes it much easier to reason about.

Deep dive starts here, you might want to reread the book for more details. A trusted computing base (TCB) of the system is the set of components whose correct functioning is sufficient for security of the system — or rather, whose failure to function is a breach of the security. TCB must uphold to this promise even if arbitrary actors outside of it misbehave, possibly maliciously.

The interface between TCB and the rest of the world is called security boundary. Anything crossing it must be treated with suspicious (that does not mean that you can trust anything that has crossed it). It is useful to think about it in layers: operating system does not allow processed to read memory or other processes or access files owned by other users, your containerization framework might add another layer of restrictions on top, and the application might impose its own checks for user identity as well.

Sometimes it may be hard to define where the security (or trust boundary) lies. For example, consider that one company selling widgets. With monolithic architecture effective all your of system probably has to be within the boundary as things like SQL injections and RCE vulnerabilities can lead to sensitive data exposure. Splitting the applications into microservices and using several database schemas allows to shrink the security boundary, making it easier to defend. Some of the data which is not sensitive can be made accessible through a different path that has no way to compromise the sensitive data.

The shape of TCB and the boundary depends on the security properties you want go guarantee and the threat model you use. Suppose your backend accepts any request from authenticated frontend. While it may be deemed impossible for attacker to access backend directly, it might be possible to exploit a vulnerability in the frontend and use it a proxy. In that case your frontend should also be withinth the trust boundary with adequate protection. This might be mitigated by different architectural decisions, such as using distinct web origins for payment and browsing frontends. [Read the book if you don’t remember what this is talking about.]

Now, back to design. Once you have defined a suitable boundary, you still need to reason about the software within it to make sure that it actually uphols all the security invariants you need. Google has found application frameworks to be a good fit for this task. As said before, frameworks make life a bit easier by providing opinionated solutions. However, you still have to choose authentication framework, autorization framework, RPC framework, orchestration, monitoring, release, and multiple other frameworks. Instead, you could use a single framework to rule them all, batteries included. That way developers will not forget about some minor detail that their should have handled:

  • Request dispatching, forwarding, and deadline tracking
  • User input sanitization, locale detection
  • Authentication, authorization, data access auditing
  • Logging, error reporting
  • Health management, monitoring, diagnostics
  • Quota enforcement
  • Load balancing, traffic management
  • Binary and configuration deployment
  • Integration, prerelease, load testing
  • Dashboards, alerting
  • Capacity planning and provisioning
  • Handling of planned infrastructure outages

[Yup, Google-sized planning.]

Application-level framework allows engineers to speak that same language which increases understandability of the system.

Another thing is data flow. Many security concerns revolve around pieces of data, such as URLs which are commonly expected to be well-formed and trusted when in fact they may not be. It may be hard to track values like that if you use a plain string data type throughout your system. By introducing a purposeful type like WellFormedURL you make that task easier. Now there is one place that does validation and after that you can easily inspect where you deal with well-formed data and where you have to validate it beforehand. This isolated type is more understandable and makes the system more understandable too. (Though, this does assume that the data handling code is correct and nonmalicious. That’s a whole different can of worms, but you should still keep that in mind.)

Furthermore, you should consider usability as well. For example, an autoescaping HTML templating system can make it impossible to have an XSS vulnerability. Cryptography is also on this side, being a thing that’s very hard to non-experts to do correctly. Google has developed Tink framework providing a secure-by-default basis for encryption, avoiding common subtle mistakes, providing highly usable API, readable and auditable code, which is nevertheless extensible as necessary and interoperable on many platforms. API design decisions can also lead to more secure applications. For example, Tink does not accept raw key material, encouraging usage of key management services. Though, it does not prevent high-level mistakes, such as using an unsalted hash to ‘protect’ credit card numbers.