Chapter 7: Design for Changing Landscape
There are multiple types of changes made to improve security and resilience. They have different reasons behind them: response to security incidents; response to discovered vulnerabilities; feature changes; internally motivated improvements; externally motivated changes (e.g., due to new regulations). Changes may have other considerations, such as optionality.
Security changes are not that different from other changes [quote Linus on that, for example] and you general engineering principles apply to them just fine. Here are some characteristics that all changes should have.
Incremental, as small and standalone as possible.
Documented with sufficient detail to understand the change and its urgency. Mention requirements, scope, lessons learned, rationale, contact people.
Tested for confidence, at least with unit tests, prefererably with integration tests as well. Successfuly peer review also increases confidence.
Isolated from other changes to avoid release incompatibility. Feature flags come really helpful here.
Qualified to handle production data. Don’t just drop it into production right away without following the usual rollout process.
Staged rollouts are the best, making it possible to monitor the change.
This process does slow down the rollout somewhat but you don’t really want to increase downtime risks caused by a broken change.
Now let’s discuss some architectural decisions for infrastructure and processes that prepare you to handle changes with minimal friction.
Keep dependencies up-to-date, that makes your system less susceptible to vulnerabilities. It’s especially important for open-source projects. Big ones like OpenSSL and Linux have established security responses so your work can boil down to just updating without having to backport fixes yourself. Also, make sure to rebuild and redeploy stuff as necessary to actually apply the fix. In fact, do it even if there are no changes to be sure you can rely on it working.
Have regular releases, it is beneficial for emergency changes. Splitting big release into a bunch of small ones makes it all less risky. If you need a rollback then it will be easier to arrange and perform. Potential issues with changes are easier to pinpoint in smaller releases. Automated testing and validation of releases makes you even more confident.
Containers make dependency tracking much easier. The applications do not have to depend on the host OS configuration that way. Also, the host OS can be made smaller and more hardened if it does not have to provide shared dependencies for multiple applications. Containers are also meant to be immutable which facilitates frequent updates. You have to rebuild and retest the entire thing to deploy it, so you’d want it to be painless – which is a win-win. Immutability simplifies tracking what exactly you’re running so you have easier time to pinpoint what parts of the infrastructure are vulnerable.
Microservicees split work into more manageable pieces while at the same time making the system easier to scale, providing more visibility into performance bottlenecks. They also naturally facilitate zero-trust networking allowing for more flexible firewall configurations. Another point is that microservices improve code reuse which is good from security perspective.
Example: Google Front End (GFE) protecting Google’s sevices from the wild Internet. It provides the usual benefits of splitting your stuff into backend and frontend.
Load balancing to redirect traffic between instances, helpful during outages.
Ability to have multiple layers within backend and frontend layers themselves, making general and rapid changes easier.
GFE can serve as mitigation point in case of overload. This means that not every backend layer needs its own protection.
Migration to new protocols and requirements becomes easier. You start with the frontend, accepting new data formats and applying new restriction there. After that you can progress at a different pace in the backend, knowing that your users are already using the new thing successfully.
Microservices can achieve similar benefits due to layered mesh architecture. For example, there is a concept of separation of the data plane and the control plane. This entails managing the data for requests and processing details separately. Load balancing, security, observability is handled by the data plane. Policies and configuration are handled by the control plane on its own scale.
Not all changes have the same speed, multiple factors define how quickly you want to proceed.
Severity of the issues plays a great role. Not all vulnerabilities are simultaneously critical, actively used, and applicable to your infrastructure. If all of that is true then you might need to act immediately (accepting the risks of accelerated rollouts). Otherwise you might want to stick to the regular update routine.
Dependencies, both internal and external, affect the pace too. You might be blocked by something or someone else.
Sensitivity of changes are also important. Nonessential changes are less important and might not be worth the risk being applied at critical times for your business (think, holiday sales).
Deadlines should be considered too, like regulatory changes which typically come with specific compilance dates, or disclosure embargoes on vulnerabilities.
Known but undisclosed vulnerabilities can be tricky to handle. If a vulnerability is under embargo then you might break the embargo by applying a patch (indirect disclosure). Though sometimes you don’t have to wait for public announcement and instead need to complete the mitigations before that happens.
Zero-day vulnerabilities often require short-term changes. Such vulnerabilities are possibly known by some attackers, typically there are no patches available. Though, obviously, not all newly discovered vulnerabilities are critical just because they are new. Carefully triage the ones you come to be aware of. If you get your hands on a patch, verify that it actually appears to help. Having a working exploit comes handy here. Though, be aware that the patch might address only a part of the problem, or might introduce new vulnerabilities on its own. The patch should be applied gradually, gauging its effect first on the testing infrastructure before reaching your production. Sometimes it is not possible to directly fix the vulnerability so you might mitigate the risk by restricting access to the vulnerable component instead, either temporarily or permanently.
Example: Shellshock, the remote code execution vulnerability of 2014. First public reports were quickly followed by exploits in the wild so Google response team proceeded with their ‘black swan’ protocol for emergency mitigation:
- Identify the scope and relative risk for affected systems
- Communicate internally to potentially affected teams
- Patch all vulnerable systems as quickly as possible
- Communicate actions, including plans, externally to partners and customers
Thankfully, the patch has been already available. The team has assessed that production servers are low risk but could be easily patched with an automated rollout in an expedited manner. Large number of workstations are higher risk but are easy to patch quickly. Small number of nonstandard servers required some manual intervention but good communication made the response swift.
Simultaneously, the team has developed software to detect vulnerable systems which helped during the final stages of the rollout. It has been added to the standard security monitoring afterwards.
Takeaways for organizations:
Standardized software distributions are great, they make patching a simple and easy choice to remediate vulnerabilities. At the same time, make people understand the risk of choosing an alternate, nonsupported distribution.
Public distribution standards for patches are also helpful. If they can be applied right away your team can start doing useful things, instead of spending time on adapting the patch to their environment first.
Ensure that have an acceration mechanism to roll out emergency changes quicker, skipping some (but not all) validation steps.
Make sure you have monitoring in place to track the progress, identify unpatched systems, determine the point where you can slow down, etc.
Prepare external communications early, you don’t want to get bogged down by approvals when the media calls for a response.
Draft and have a reusable response plan for incidents and vulnerabilities before they happen. It should include the language for external communications. If not sure, start with your previous postmortem.
Known which parts of the infrastructure are nonstandard or need some special attention.
Now, onto the next example. Proactive changes to improve security posture of the organization are an example of medium-term changes. They tend to be driven by internal and external deadlines and rarely need to be rolled out suddently. You typically start with assessment of the affected systems and prepare a plan – designing your change.
Plans are typically phased, with success criterias to fulfill before moving on. You can set a percentage of rollouts, split the changes in groups, or something like that. With groups there are two competing philosophies:
- Start with the easiest case to get most traction and prove value
- Start with the hardest case to find most issues and edge cases
The path you take depends on how much you have to ‘sell’ the change, how much power within the organization you have, what risk mitigation strategy you have in mind. Either way, a successful proof-of-concept and dogfooding are helpful. Sometimes incremental rollouts are possible, making the changes opt-in at first, with dry runs, in alerting mode exercised before switching to enforement mode, etc.
Example: FIDO security keys. While 2FA with one-time-passwords (OTPs) is widely deployed at Google, it is still susceptible to phishing attacks. Your users may be well prepared for those attacks but the attacks may be elaborate too. After some evaluation of possible improvements Google decided to use univeral two-factor (U2F) hardware tokens to improve the situation. This decision is also a part of the change process. They have carefully defined the desired security, usability, privacy properties, how that would affect day-to-day workflows. Usability was particularly important. For example, you don’t want your SRE team to slow down response because 2FA misbehaves. You want your developers to easily integrate 2FA into their APIs. You want the users to be able to quickly recover if they lost the key.
Google started the process in 2011 and by 2013 they started rolling out the keys. The process has been mostly self-service to ensure wide adoption.
Many initial users opted into keys voluntarity because they were much easier to use that their existing OTP tools: no codes to constantly type in, just plug this nano key into your USB port.
The keys were readily available at IT helpdesks at any office. (And yes, it was a minor legal nightmare to distribute them to global offices.)
The users enrolled their security keys themselves, with assistance available from the helpdesk people. Trust on the first use (TOFU) makes it secure – and the last time you have to use the previous OTP.
Multiple keys could be acquired and enrolled so losing one was not an issue.
Making the hardware available is only half of the problem, software has to follow suit too. In 2013, many applications did not readily support this novel technology. The team first focused on addinng security key support to applications that were in daily use, like code review tools and dashboards. Sometimes Google had to contact vendors and request them to add support. There was a long tail of applications but they were trackable due to centrally generated OTPs. It was easy to see which application to focus on next.
By 2015 the rollout reached the stage where the deprecations can start. Remaining users started getting reminders to switch to security keys when they were using OTPs. Eventually access via OTP was mostly blocked. Some applications still required OTPs and for those cases Google had a fallback: web-based OTP generator where you have to use a security key to prove your identity.
Takeways and lessons learned:
- Make sure it works for all users (e.g., blind people)
- Make the change easy to learn and ideally effortless
- Self-service reduces burned on the central IT team
- Tangible proof and receving feedback are important
- Giving quick feedback on noncompliance is important too
- Track the progress and be ready to the long tail
So, moving on to long-term changes. Typically these are massive architectural changes or externally demanded ones, such as industry-wise regulatory changes. They can take multiple years to implement. It is critial to define and measure progress to make sure you are on track. Documentation is also essential as people may come and leave during the change. Continued leadership support and encouragement is indispensable too as people are in a marathon with a long tail.
Example: increased HTTPS adoption. This is an important technology providing integrity and confidentiality guarantees. Thanks to a decade of combined effort of companies like Google and Let’s Encrypt HTTPS is getting wider adoption and becomes the new norm.
Ecosystem-wide changes required an extensive prior research to build a strategy, prepare outreach channels, lay out incentives for people to migrate.
Continuous monitoring and metrics were imporatant as well, since Google can control Chrome and has access to user behavior data.
Overcommunication helped too, ensuring that no one is caught by surprise with all those pushes to use HTTPS.
Finally, Google has focused on creating and emphasizing incentives to provide business reasons for migration. [Migrate or your company’s site won’t open in Chrome and will be deranked in search, ha-ha!] Though, there are also genuine features like Service Workers which are enabled by using HTTPS.
It’s not only Google. Many other vendors followed suit and multiple ecosystems has shown increase of HTTPS adoption. Everybody shared information, struggles, etc. while making the shift happen.
- Undestand the ecosystem before committing to a strategy
- Overcommunicate to maximize outreach
- Tie security changes to business incentives
- Build an industry consensus
Now let’s talk about complications, when things go… not as planned. Sometimes you need to accelerate a rollout, say, because a vulnerability is actively exploited. But you need to aware that this also increases the risk. Maybe you could change the order of rollouts to cover high-risk systems sooner, or restrict access to the vulnerable system.
A slowdown is also possible, usually caused by some issues with the patch which will negatively impact the business. There are other possible reasons, such as a need to support older protocols despite them being vulnerable – yet important from business perspective (TLS 1.0 anyone?)
Ideally the plans can change before you start implementing them. Here are some reasons and appropriate tactics for change of plans:
Delayed due to external factors, such as embargo: communicate with affected teams, see if they change the timeline.
Speed up due to a public announcement, such as a leak: have a plan for expedited rollout in case the embargo is broken.
Lack of impact, at least the severe one: patch the critical infrastructure quickly but slow down for other parts.
Dependency on external parties is usually inevitable to some amount. Sometimes you just have to wait for your vendor to prepare a mitigation.
Example: Heartbleed, that one vulnerability from 2014 in OpenSSL’s implementation of TLS heartbeat which could be used to exfiltrate peer’s memory content (e.g., private keys). This was the first time a vulnerability got particularly novel media coverage, and from that point many use the same approach of explanatory websites. Google was a codiscoverer of the vulnerability so some critical infrastructure has been already stealth-patched before the public disclosure (with no unrelated internal teams knowing about it). Automated checks for vulnerabilities allowed Google to speed up patching after the official disclosure. Affected teams were notified and the progress has been tracked and documented.
Some takeaways here:
- Plan for the worst case of embargo being broken or lifted
- Prepare for rapid deployment at scale
- Rotate your keys and secrets regularly
- Make sure to have an established communication channel