Chapter 5: Design for Least Privilege
Things can and will go wrong. People will make mistakes, fat-finger commands, get compromised, and fall for phishing emails. Expecting perfection is an unrealistic assumption. In other words — to quote an SRE maxim — hope is not a strategy.
Least privilege is a well-established concept in security. It is applicable to humans and machines alike. Google likes to avoid implicit permissions and ambient authority; e.g., extending root access of the tools you run to the system access as well as you having the ability to gain root access in the first place.
Zero trust networking means that network location does not grant privileges by itself. Plugging into an office network socket or accessing the office network via VPN does not authorize any actions other than ability to attempt sending and receiving packets. Proper user and device credentials are still necessary.
Zero Touch is another Google concept that disallows direct human access to automatic systems. Changes in automatic systems are performed by other automatic systems, in controlled and predictable way.
Risk reduction always comes with tradeoffs. Different data has different properties and tolerance to risks and tradeoffs, and not only data but things like API access as well. It is better to avoid all-or-nothing approaches, where you either not allowed to do anything or have unrestricted admin access otherwise. This starts with classification of data, users, interfaces, entities. It might be two-three broad classes or a programmable multilevel system, depending on your organization structure and needs. After you are done with classification, you should think of appropriate controls: who can access what, how tight should be the control over access, what type of access is required (read/write/admin), what infrastructure controls are already in place. Here is an example of 3x3 classification:
|Description||Risk associated with access|
Open to anyone in the company
Restricted by business purpose
Temporary access on need-to-know
There is certain interplay with reliability here. You might want to start with controlling access to your infrastructure since it has wide impact on the system, like configuring ACLs. However, just reading sensitive data can be damaging as well.
One of the best practices is small functional APIs: do one thing and do it well, like the UNIX way. This applies not only to individual API endpoints but to API surface as a whole too. Also, take care to differentiate between user and administrative API, with latter requiring more attention from security standpoint.
Consider POSIX API, it is widely used and featureful.
Every interactive session via SSH potentially exposes effectively the entire POSIX API.
A simple approach to audit interactive sessions by capturing commands with
seems to be sufficient, at least for unexperienced people.
However, there are many tricks to circumvent it, like using Vim’s command console.
You can audit and control syscalls and library functions but that rabbit hole is deep.
Exposing a dedicated decomposed administrative API is often easier to control.
Breakglass mechanism is an emergency access measure to bypass regular authorization. It might be necessary in advanced systems to prevent accidental lockouts by automatic systems.
Auditing is used to detect incorrect authorization usage; either malicious, compromised, or simply erroneous. Usefulness of audit logs depends on the design of the system: how granular the logs are (what is accessed and where), how clear the metadata is (who, when, and why accessed). Another important factor is culture around auditing.
Small functional APIs usually provide good, granular series of actions. It also makes sense to think about how you would explain the audit log to your customer. Focus on useful things there. And note that exceptional situations might throw a wrench into the works. It is common to provide breakglass access as an interactive session or direct access credentials for administrative use. Both these scenarios preclude granular audit trail, and this can be exploited by a knowledgeable insider. Even full capture of interactive session can be hard to audit due to sheer amount of data.
Broad APIs and frequent breakglass usage can be mitigated with a culture of careful auditing. This is important for security and reliability alike: two pairs of eyes spot more mistakes and no one should have unilateral access to user data. The culture supports building auditable API, teaches value of collaboration and usefulness of audit logs. Otherwise, even when in place, audits may become simple rubber stamps.
Good audit logs are not enough, you also need to choose an appropriate auditor to inspect your (hopefully rare) audit events. They need to know what audit entries mean and why they may be needed. You have to strike balance between context and objectivity to counter biases and shadow goals.
The purpose of audit plays significant role here. At Google, audits are typically performed either to monitor best practice usage, or to investigate a possible security incident. “Best practice” reviews are usually targeting reliability objectives, it is typically a good idea to make them public to the affected teams to facilitate social norming. The “security” audits are better done in a centralized manner, across the entire organization, since individual teams may not see all the connections. Structured justifications are widely used: basically, associate ticket number with actions to enable programmatic checks (e.g., access to customer data granted to respond to a ticket by the customer).
Testing is essential and fundamental for well-designed systems. Testing of least privilege verifies that clearly defined users have access they need, but not more. You should be able to describe what users need to do for their jobs, which scenarios they use (e.g., single/bulk read/write access), record and then compare actual actions and their impact with expected results. Ideally this test suite is run for each code and ACL change before they hit production. Incomplete coverage can be mitigated by additional monitoring and alerting.
Another aspect is testing with least privilege. It should be possible to test changes without affecting production services or assets. Testing an experimental configuration change should not bring down your production. Testing new data processing path should not corrupt data in production or be a possible vector for production data leaks. Specialized or anonymized data sets are approaches to consider.
Instead of mirroring production it might be tempting to just have a specialized test account. It might work in some situations, but has drawbacks as well such as interfering with audit logging and granularity of access. If you have to, it is wise to approach this tradeoff incrementally, starting with specialized credentials, then moving to limited data access. It can be expensive to build a proper testing infrastructure, but not having it is a risk of its own, with tests performed directly on production. And if your infrastructure is hard to use, people may fall back to testing on production too.
Enforced least privilege with cause access denials. In a properly designed systems denials may have different causes and contexts. It might be a legit denial working as intended. The user might actually have an option to get access they need, with some additional actions (e.g., autorization by a second party). It might also be seemingly incorrect denial with users may want to contest in the support system. In all cases the default should be to not inform the user about the details, in absence of a clear reason, because it might get exploited. However, once the user has a certain authorization and privileges more options become possible. Typically, a token is provided to track down individual access denials by submitting it to the support system for later investigation. Advanced users might be given remediation instructions so that they might perform necessary actions and try again. This requires careful balance, so initially it is wise to use zero trust approach and to not disclose anything other than a token.
Some more advice on use and deployment of breakglass mechanisms. It should generally be very restricted, typically to your SRE team responsible for operation of the system. It is also a good idea to restrict acccess based on physical location, requiring to connect via a particular network endpoint. (Yes, it is like zero trust networking. However, it’s more like negative trust in breakglass case.) Usage of breakglass must be closely monitored and audited. It should also be regularly tested to ensure that it’s operational during an actual emergency.
An example: configuration distribution. It is a typical small API with some general best practices applicable: version control, change review, canary rollouts. There are different approaches to providing an API for actual update, even when you have it completely automated.
POSIX API via SSH. Automation has shell access to the system, typically as the user under which the service is running. It is simple, common, can reuse existing system access controls. But it also has unnecessary wide permissions, such as ability to shut down the service completely, which might be exploitable.
Regular software update mechanisms, such as binary packages, containers, or whatever is used to deploy software. It is often convenient and sufficiently limited. However, using it for configuration updates may not align well.
SSH can be configured with limited command set, which is often flexible enough and integrates well with surrounding ecosystem. However, this starts to look like a weird custom-made RPC so you might as well consider using something more appropriate.
Custom API, either as a “sidecar” daemon for configuration updates, or self-update ability integrated directly into the service. This is the most flexible approach, but it requires additional development.
There are various tradeoffs in these schemes. Ideas of multiple authorizations may come handy, requiring multiple signatures on the pushed configuration to ensure that it’s the right thing coming though the right channels to be rightfully applied. A bearer token might be required in addition to regular authentication, allowing to rate-limit certain changes to configuration and infrastructure.
Access control is split into two distinct steps: authentication and authorization. Authentication tells the identity of the requestor. Authorization decides whether a particular role – result of authentication – is authorized to perform an action they want to perform. Authorization may be based on the action, its arguments, source of the request, metadata and surrounding context, etc.
Designing complex access policies is difficult. Make it too simplistic and it fails to give enough flexibility, people will have to dish out unnecessarily wide permissions. Make it too generic and no one will be able to reason about permissions, especially after a couple years of evolution. It is similar to software API and similar approaches, like iterative design, work here too. Authorization configuration typically needs to be updated independently of the binary as it’s a particularly sensitive information. Finally, consider that people will need to be taught and trained to use the configuration, often with assistance and cooperation of the security and development teams.
Deep dive section discussing advanced authorization controls. The main idea here is augmenting the usual binary “granted/denied” approach with a “maybe” result.
Multi-party authorization (MPA) has many benefit: prevents mistakes, discourages bad actors, increases the cost of attack, provides audit trail, comforts customers. It is often performed on a broad level: for example, by requiring approval to join administrative group. This may also be an approach to implement breakglass mechanism. More granular, per-action mechanisms are better from security standpoint as well as reliability, but they typically require additional effort to implement.
Potential pitfalls with MPA include providing sufficient context for approvers to make an informed decision. Social pressure might be a factor (difficult to refuse your boss, or a manager standing near your desk). Make sure that people can say “no” to authorization requests, or raise a flag by other means.
MPA in big organizations might also have a weakness if all workstations are centrally managed. If you can compromise Alice’s computer then Bob’s one is likely to be crackable too, granting the necessary two-person approval. Classic mitigation is to have separate workstations, one for for regular tasks and one for controlled interaction with production environment. However, it is expensive and both machine present comparable attack surface. Google suggests to use so called Three-Factor Authorization (3FA): require authorization via a mobile device in addition to the regular workstation. (Yes, they know it’s a silly and not quite accurate name.) The reason is that it is easier to harden a dedicated mobile device which is used as an additional factor for authorization, protecting against compromise of workstations. Note that it is not 2FA (or MFA, for that matter) as 3FA is about authorization, not authentication.
Another approach is to enforce authorization via business justification, with access requests supported by a particular ticket ID, etc. It may be hard to validate automatically, but it is a strong method when used properly. Though note that if it gets difficult to find an approver, justifications might degenerate into generic “Team A needs access to do work”. An idea to consider: alarming on pattern-matched generic requests.
Risks of authorization decisions might be lowered by granting temporary authorization.
It is also especially useful if fine-grained controls are not available.
Access might be structured and scheduled:
e.g., on-call team has it for the duration of their shift.
This approach combines nicely with others, like MPA and justifications.
It is also well auditable.
sudo(8) as an example,
it grants temporary access for the duration of the command,
as opposed to being constantly logged it as root.
If fine-grained controls are not available you can make use of a hardened proxy (aka bastion), accepting requests originating only from it. The proxy can then perform access checks, rate limiting, additional logging, etc. Though, mind the development and operational costs of this approach.
Security usually requires tradeoffs in usability. Some examples:
Granular security configuration is powerful but increases complexity which can overflow if unchecked. A good litmus test is being able to quickly answer from your head whether a user should have access to something and which other users should have access.
Collaboration. For example, giving developers access to almost all source code is a risk. However, it also has a benefit of improved learning via cross-pollination. Grouping access by sensitivity may help to keep balance here.
Quality of the data used for authentication decision is important. Low quality can lead to incorrect decisions, causing lossed and/or friction.
Security measures may affect productivity. Users are typically not happy with additional hoops they need to jump through as it slows down actual business processes, so try to reduce friction. “Unreasonable” access denials tend to cause frustration, so it is a good idea to provide a fast track for their resolution.
Developer complexity will also grow as people will need to be educated and trained how to use security controls effectively.