Chapter 3: Case Study: Safe Proxies

Safe proxies are used at Google to limit the ability of privileged administrators to impact production environment. They are also used to review and approve commands issued for servers, VMs, etc. Safe proxies enable audited and controlled access while protecting automated systems from human mistakes. Zero Touch Prod is a project at Google to pursue this approach and operation philosophy. [Heh, a typical Google approach.]

Clients are configured to always talk to servers via proxies and services are configured to accept commands only via proxies. This guarantees that proxy has a chance to enforce access control lists (ACLs). Being a centralised “seams”, proxies have multiple benefits:

  • enforce multiparty authorization where necessary
  • audit activity: who did what, when and why
  • rate limiting to restrict the impact of mistakes
  • useful for interfacing with third-party, closed-source systems
  • continuous improvements and security benefits applied to all systems at once

This approach has its share of downsides as well:

  • increased operation cost
  • single point of failure, redundancy increases costs further
  • ACLs may contain errors themself
  • adversary may take control of the single proxy instead of multiple independent services; this can be mitigated by credential forwarding instead of turning the proxy into superuser
  • proxy introduces friction as people may dislike mandatory indirect interaction

At Google, many (most) administrative tasks are performed via some sort of command-line interface (CLI). There are many CLI tools around, some of them may be dangerous (e.g., shut down servers). Instrumenting each and every tool is hard. Google Tool Proxy is an RPC service that accepts and executes command lines like a shell, after applying neccessary policies, logging and auditing, enforcing proper authorization. It has fine-grainded configuration for access control. [Reminds me of sudo, actually.]

Another idea is a breakglass – a controlled and audited way to circumvent automated checks. They come useful when necessary, such as during recovery procedures after an outage, when the situation needs to be resolved quickly and some divergence from the usual protocol is necessary.