Chapter 4: Design Tradeoffs

There are functional requirements which define how exactly the system is useful to its users. Typically they take a form of use cases or user stories. Among those it is convenient to distinguish critical requirements, without which the product as a whole is not of much use. It is hard to compromise of critical requirements but other functional requirements may allow tradeoffs. Some functional requirements are implicit and pervasive, such accessibility.

The other type of requirements focuses on attributes and behaviors of the system. Security and reliability are examples of nonfunctional requirements. They usually have wide-ranging impact of the system, not only on the interaction of the user with the system but on things like developer efficiency and deployment velocity.

Feature requirements are usually straightforwardly connected to the code which implements them and tests that verify the implementation. There are specifications define exactly what is needed of the system. Implementation consists of concrete, tangible programming entities responsible for particular requirements. Validation can be performed with reasonably detailed integration tests for the feature as a whole, as well as with a host of unit tests for individual steps and components. In contrast, nonfunctional requirements are rarely clear-cut like that, not easily pinned down. There is no single class or module responsible for reliability, there is no unit test for security.

Reliability is an emergent property of the system design as well as development, deployment, and operation processes. Many factors play their role here: how the responsibility is split between the components; what are dependencies and their availability; what communication mechanisms are used; how testing is performed, which approaches are used and not used; monitoring, logging, alerting are their own can of worms.

Similarly, security is an emergent property too, defined by the trust boundaries between systems and components; implemetation languages and platforms used; how security design, testing, and validation are intergrated into the workflow; coverage of security monitoring and auditing; etc.

Google’s Software Design Document template can be used as a check list for new features. [I’m giving an excerpt from the book because a link will inevitably rot.]

Here are the reliability- and security-related sections of the Google design document template:


How does your system scale? Consider both data size increase (if applicable) and traffic increase (if applicable).

Consider the current hardware situation: adding more resources might take much longer than you think, or might be too expensive for your project. What initial resources will you need? You should plan for high utilization, but be aware that using more resources than you need will block expansion of your service.

Redundancy and reliability

Discuss how the system will handle local data loss and transient errors (e.g., temporary outages), and how each affects your system.

Which systems or components require data backup? How is the data backed up? How is it restored? What happens between the time data is lost and the time it’s restored?

In the case of a partial loss, can you keep serving? Can you restore only missing portions of your backups to your serving data store?

Dependency considerations

What happens if your dependencies on other services are unavailable for a period of time?

Which services must be running in order for your application to start? Don’t forget subtle dependencies like resolving names using DNS or checking the local time.

Are you introducing any dependency cycles, such as blocking on a system that can’t run if your application isn’t already up? If you have doubts, discuss your use case with the team that owns the system you depend on.

Data integrity

How will you find out about data corruption or loss in your data stores?

What sources of data loss are detected (user error, application bugs, storage platform bugs, site/replica disasters)?

How long will it take to notice each of these types of losses?

What is your plan to recover from each of these types of losses?

SLA requirements

What mechanisms are in place for auditing and monitoring the service level guarantees of your application?

How can you guarantee the stated level of reliability?

Security and privacy considerations

Our systems get attacked regularly. Think about potential attacks relevant for this design and describe the worst-case impact it would have, along with the countermeasures you have in place to prevent or mitigate each attack.

List any known vulnerabilities or potentially insecure dependencies.

If, for some reason, your application doesn’t have security or privacy considerations, explicitly state so and why.

Once your design document is finalized, file a quick security design review. The design review will help avoid systemic security issues that can delay or block your final security review.

Balancing all design aspects is hard, prone to mistakes and biases. Emergent nature of security and reliability means that they are somewhat fundamental to system design. It‘s hard to simply “bolt on” these things onto existing complex system which was not designed in an appropriate way that takes these requirements into account, if it is a tangled mess of dependencies sharing data uncontrollably. Refactoring might take a lot of time and this is not pretty when you are under a lot of pressure from recent incident. (Let’s be fair, it is the incident which usually fuels a sudden need for refactoring.) That, combined with already existing pressure to deliver features on time, can lead to more mistakes. Repeating this again and again, security and reliability are best considered early.

An example is dissected: suppose you sell widgets to people and need to process their payments. This is a functional requirement. There are nonfunctional requirements for security and reliability too. Payment information is a textbook example of personal information (PI) or personally identifiable information (PII). Handling it needs special treatment and care, proper safeguards, it is often regulated by a whole host of industry standards, government laws and procedures, like PCI DSS. Compromise of sensitive data, especially PII, can have serious consequences for the project and likely for the organization as a whole. You risk losing users‘ trust, losing money on trials, penalties and reparations, and it is not particularly uncommon to go out of business after a serious security breach.

There are high-level tradeoffs – like using ad-based or community funding – which can allow you to avoid the problematic issue of payment processing entirely, but let‘s assume that it is a critical requirement to accept payments from individual users. One common approach to mitigating risks here is to outsource this issue and avoid handling payment information in your system by using a third-party payment processing service. You may not completely avoid the matter, but at least you avoid the need to persistently store sensitive data in your systems. Relying on a third-party expertise has a bunch of benefits:

  • You no longer deal with sensitive data. However, keep in mind that they still do and they can be compromised too.
  • Contractual and regulatory obligations for your organization may become easier to fulfull.
  • No need to build (expensive) infrastructure for protecting sensitive data at rest.
  • Third-party providers usually have ancillary benefits, such as providing fraud protection service which you don’t have to build yourself too.

Naturally, there are costs and new risks too. Processing service is not free and the cut they take may be too big for you. For big companies it might make economic sense to build an in-house solution after all. (Conversely, the opposite is true for small companies.)

External service becomes a reliability risk: if it goes down, you no longer fulfill critical requirements. Redundancy has its direct and indirects costs. You will have to pay fixes costs for having an alternate provider. They APIs are likely to be different, multiplying development costs too. Trying to balance it all out with some queuing on your side introduces its own complexity which comes with reliability risks. Mitigating these risks might require persistent storage. At this point you are going to (re)introduce the same problem you were trying to avoid: storage of sensitive data in your systems. Furthermore, local storage with third-party provider may introduce a new attack vector where bringing down access to external service is used by an insider to acquire local data.

Using third-party services comes with new specific security risks. First of all, you entrust data of your users to a third-party and expect them to care for the data at least as good as you do. You need to carefully evaluate your choice of provider and continuously monitor their performance. Then there is a possibility that you may need to run their code for integration with their platform, run their possibly closed-source code in your infrastructure, or make your users run that code on their machines in the age of web applications. Proper sandboxing might be necessary to mitigate these risks which has its development costs. Using open formats and protocols alleviates the need for running proprietary code. But any third-party libraries might still be an attack vector themselves.

Planning ahead and being prepared might save some costs and satisfy nonfunctional requirements without having to give up features for that. Often the goals here are aligned with general quality attibutes of software and best engineering practices. For example, Google has an internal framework for web applications and microservices. The framework incorporates many static and dynamic conformance checks, ensuring compliance with best practices and guidelines. It is possible to achieve conformance from day one for new projects because the framework has been designed with reliability and security from day one. Many common security and reliability concerns are already handled by this shared framework. Not only that, it can actively prevent developers from introducing common vulnerabilities. It also provides a baseline for other features, such as monitoring, health checks, etc. [Caveat emptor: Google can afford to develop an in-house framework at their scale, you are probably not Google.]

That framework is an example of how good engineering is aligned with security and reliability. Other concerns may also be aligned, such as maintainability of code, clarity of system invariants, design for recoverability, change, and adaptability.

There is a tendency (especially in small teams) to defer security and reliability to “some later point”, “when we have paying customers”, etc., stating the need to focus on “velocity” as reason for the decision because delaying the first release is unacceptable. There is an important distinction between initial and sustained velocity. Disregading critical requirements of security, maintainability, reliability at first may indeed increase early project velocity, but experience teaches that accrued debt will inevitably have to be paid back by slowing down development once you have to deal with accumulated issues. Invasive late-stage changes also carry their own risks.

Internet is an example of deferring concerns. Protocols like IP, TCP, DNS, BGP have excellent track record of reliability, mostly because this requirement has been incorporated in their design since ARPANET times. On the other hand, security was never a concern in the early days because everything happened in the same trusted lab network. Now this is not the case and we have to retrofit HTTPS everywhere, use VPNs like IPSec, deal with DNS hijacking, etc. Internet is close to 50 years old, secure Internet is about 25–30 years in but still not ubiquitious.

Another example is Agile development process which buys increased developement and deployment velocity for upfront investment in continuous integration (CI) infrastructure and continued maintenance of extensive integration and unit test suite. Reliable build pipelines, robust test suites, infrastructure for staggered rollouts and rollbacks, software architecture that enables the above — it’s a modest investment if done early, maintenance is easy and spread out in time. However, the cost may be enormous if you suddently want to get all of that one day in a project that has never considered such a development. Bigger organizations can amortize the cost over multiple projects by sharing expertise, code, infrastructure. Similarly, secure-by-construction frameworks allow to incrementally build a more secure product, without having giant security audits and all surprises they discover.