How to win with the AWS Well Architected Framework
Over the years, we've perfected our standards regarding product management, scrum management, code quality, peer review, continuous deployments, and others. With each new learning of our own that we standardize, we also take a look at the latest standards from the broader ecosystem and attempt to incorporate them as well. For best practices with cloud environments, we've been leaning into the AWS Well Architected Framework (WAF) and continue to learn new approaches. Our clients are constantly motivated to achieve world class products, and below is an overview of what WAF is and how to apply it to a WAF Audit.
The AWS Well Architected Framework consists of five pillars, each containing 5-6 core principals (that can be measured), and six overall general design principles.
The Five Pillars
The Five Pillars are clear enough divisions that each could be assigned a leader or SME in your organization. Additionally, metrics applied to their related principles could easily roll up to an overall score.
- Operational Excellence
- The ability to run and monitor systems to deliver business value and to continually improve supporting processes and procedures.
- The ability to protect information, systems, and assets while delivering business value through risk assessments and mitigation strategies.
- The ability of a system to recover from infrastructure or service disruptions, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues.
- Efficiency. The ability to use computing resources efficiently to meet system requirements, and to maintain that efficiency as demand changes and technologies evolve.
- Cost Optimization
- The ability to run systems to deliver business value at the lowest price point.
The Six General Design Principles
The six principles below are horizontally applied across the five principles, and should normally be the first round completed during a WAF Audit. The six principles are:
- Stop guessing your capacity needs: Eliminate guessing about your infrastructure capacity needs. When you make a capacity decision before you deploy a system, you might end up sitting on expensive idle resources or dealing with the performance implications of limited capacity. With cloud computing, these problems can go away. You can use as much or as little capacity as you need, and scale up and down automatically.
- Test systems at production scale: In the cloud, you can create a production-scale test environment on demand, complete your testing, and then decommission the resources. Because you only pay for the test environment when it's running, you can simulate your live environment for a fraction of the cost of testing on premises.
- Automate to make architectural experimentation easier: Automation allows you to create and replicate your systems at low cost and avoid the expense of manual effort. You can track changes to your automation, audit the impact, and revert to previous parameters when necessary.
- Allow for evolutionary architectures: Allow for evolutionary architectures. In a traditional environment, architectural decisions are often implemented as static, one-time events, with a few major versions of a system during its lifetime. As a business and its context continue to change, these initial decisions might hinder the system's ability to deliver changing business requirements. In the cloud, the capability to automate and test on demand lowers the risk of impact from design changes. This allows systems to evolve over time so that businesses can take advantage of innovations as a standard practice.
- Drive architectures using data: In the cloud you can collect data on how your architectural choices affect the behavior of your workload. This lets you make fact-based decisions on how to improve your workload. Your cloud infrastructure is code, so you can use that data to inform your architecture choices and improvements over time.
- Improve through game days: Test how your architecture and processes perform by regularly scheduling game days to simulate events in production. This will help you understand where improvements can be made and can help develop organizational experience in dealing with events.
Note: The source of the above, along with a deeper dive, is available in the AWS Well Architected White Papers.
The Six Principles of Operational Excellence
- Perform operations as code: The secret to reducing human error in cloud operations.
- Annotated documentation: Autogeneration of documentation (including architectural diagrams) keeps humans and systems on the same page.
- Make frequent, small, reversible changes: With greater automation comes a much more rapid and contained roll back methodology.
- Refine operations procedures frequently: Operations as code extends to governance and policies as well. This level of process control isn't possible with human-only approaches.
- Anticipate failure: The “pre-mortem” is a mandatory exercise that we employ to exhaust all failure modes before we start.
- Learn from all operational failures: Recurring retrospectives, just as in Scrum, are the best ongoing forum for continuous improvement.
The Seven Principles of Security
- Implement a strong identity foundation: The foundation are the classics — the principle of least privilege, strict separation of duty, elimination of long term credentials, and centralization of privilege management.
- Enable traceability: This is one of my favorite features of cloud technology. The ability to automate alerts on every imaginable transaction, and store those transactions for the long haul, has transformed infrastructure security.
- Apply security at all layers: The 'defense-in-depth' approach of a security practice applied beyond the outer layers, and deep into the ecosystem (especially in places you think are unreachable!)
- Automate security best practices: DevSecOps is the new buzzword, and it carries a lot of potential with enterprise adoption of cloud technologies.
- Protect data in transit and at rest: Much like government data, each transaction and data type should be classified into sensitivity levels, and those levels coded into every operation.
- Keep people away from data: Much like DevOps resulted in no-click deployments, high standard cloud security has a no-eyes data policy.
- Prepare for security events: Just like fire drills, role play and test the human operations of a security event on a regular basis.
The Five Principles of Reliability
- Test recovery procedures: Hold a 'break it' day - where cloud engineers intentionally break a cloned environment, and attempt to recover. It's fun and often very insightful.
- Automatically recover from failure: With clever automation, rollovers and even predicted failures can trigger recovery code before the first human engineer answers the support call.
- Scale horizontally to increase aggregate system availability: This is a core value proposition of cloud technologies - the ability to geographically distribute risk.
- Stop guessing capacity: Along with classic performance and load testing, leaning into autoscaling and other bursting techniques is key to removing the guesswork of capacity.
- Manage change in automation: Infrastructure changes, when architected well, should read like an edit to a wiki article. All interested parties have clear documentation on what changed, who changed it, and when it was deployed.
The Five Principles of Performance Efficiency
- Democratize advanced technologies: The key tactic for IT Teams to partner with business needs is to enable the direct utilization of specific cloud services.
- Go global in minutes: We often use minutes as the KPI for several cloud services. Organizational efficiency measured in minutes is a cloud sweet spot.
- Use serverless architectures: This is a large driver of modern software design. By reducing fat middleware layers with small, serverless functions, costs drop and regressions reduce substantially.
- Experiment more often: Building and tearing down performance test environments is cheap and quick with cloud technologies. Developers and cloud engineers can easily run micro-experiments in their normal Scrum schedules.
- Mechanical sympathy: A hallmark here is thinking outside the relational database assumption and incorporating a variety of storage and data streaming solutions to solve today's challenges.
The Five Principles of Cost Optimization
- Adopt a consumption model: Think in terms of meters, much like an electricity bill. Forecast costs by looking at consumption needs first, not target spends.
- Measure overall efficiency: There are several KPIs to apply here, including finding a profit model based on labor needs and consumption trends.
- Stop spending money on data center operations: There are few (very few) organizations on this world who should be investing in their own data centers anymore.
- Analyze and attribute expenditure: Cost portfolios, expense sharing, and tagged reporting provide an immense about of cost controls and structures to operations.
- Use managed services to reduce cost of ownership: Companies like Metal Toad spend a substantial amount of time perfecting their cloud practice, and know the quickest route to cost optimization.
Well Architected Labs
The WAF Labs are a repository of code and documentation to help quick start developers and cloud engineers into the framework and are a must-have resource for vendors seeking to join the WAF Partner Program.
AWS Well Architected Tool
The WAF Tool provides a workload-specific checklist system to verify compliance. This is an easy way for management to set a goal across workloads, and a recurring step in the ongoing hygiene of a cloud practice.
AWS Well Architected Partner Program
The WAF Partner Program provides a badge to Advanced or Premier tier vendors who conduct formal audits and reviews of client systems several times per year. A list of these vendors can be found in the AWS Partner Finder.
Amazon Leadership Principles
One interesting note is the relationship between the WAF Five Pillars and the Amazon Leadership Principles. For example, the principle of Ownership is exemplified by the 'democratize advanced technologies' principle, and the principle of 'Invent and Simplify' is represented by several of the above principles recommended experiment environments, small changes, and automated controls. The integrity of the AWS Well Architected Framework with the company's leadership principles is one of the reasons I love working in this ecosystem.