To keep up with demand during its record-breaking Prime Day shopping event this July, Amazon started to prepare its infrastructure a full year in advance, making sure that it was ready to scale and recover in the event of a database failure.
In a blog post on Thursday, AWS evangelist Jeff Barr shared some best practices and statistics around AWS cloud usage that supported Prime Day 2017, its third annual 30-hour shopping event where Prime members get access to hundreds of thousands of deals. Amazon said that it received more orders on Prime Day than Black Friday and Cyber Monday, though it didn’t provide specific numbers.
Best Practices in Preparing for Prime Day
Barr said that the AWS teams started preparation for this year’s Prime Day by looking at best practices and lessons learned from previous years before implementation and stress testing. Amazon also details some of its best practices in a new white paper on planned infrastructure events.
Two best practices Barr highlights in his blog are auditing and GameDay. Auditing requires each team to respond to a series of detailed questions designed to determine their readiness, including technical questions around time to recovery after a database failure, and operational questions around schedules for on-call personnel and points of contact.
GameDay validates capacity planning and preparation, and verifies all necessary operational practices are in place and work as expected, Barr said.
“It introduces simulated failures and helps to train the team to identify and quickly resolve issues, building muscle memory in the process. It also tests failover and recovery capabilities, and can expose latent defects that are lurking under the covers. GameDays help teams to understand scaling drivers (page views, orders, and so forth) and gives them an opportunity to test their scaling practices.”
In a whitepaper, AWS shares a timeline of activities in the four weeks leading up to an infrastructure event, which it defines as "a business-driven, anticipated, and scheduled event window during which it is business critical to maintain a highly responsive, highly scalable, and fault-tolerant web service" such as a product launch or marketing-driven event like Prime Day:
• Nominate a team to drive planning and engineering for the infrastructure event.
• Conduct meetings between stakeholders to understand the parameters of the event (scale, duration, time, geo reach, affected workloads) and the success criteria.
• Engage any downstream or upstream partners and vendors.
• Review architecture and make adjustments as needed.
• Conduct operational review; make adjustments as needed.
• Follow best practices described in this paper and in footnoted references.
• Identify risks and develop mitigation plans.
• Develop a planned event runbook.
• Review all cloud vendor services that require scaling based on expected load.
• Check service limits, and increase limits as needed.
• Set up monitoring dashboard and alerts on defined thresholds.
Amazon EBS Usage Up 40 Percent over Last Prime Day
AWS teams used Amazon Elastic Block Store (EBS), Amazon DynamoDB, AWS CloudFormation, AWS CloudTrail, and AWS Config to support Prime Day, noting a significant increase in Amazon EBS usage in particular over last year.
According to log files and dashboards, use of Amazon EBS grew 40 percent year-over-year, with aggregate data transfer up 50 percent to 52 petabytes for the day. Total I/O requests grew 30 percent year-over-year to 835 million. With the elasticity of EBS, the team was able to “ramp down on capacity after Prime Day concluded instead of being stuck with it,” Barr said.
Throughout Prime Day, Amazon DynamoDB requests from Alexa, the Amazon.com sites, and the Amazon fulfillment centers reached 3.4 trillion, peaking at 12.9 million per second.
In addition, AWS teams created nearly 31,000 AWS CloudFormation stacks for Prime Day in order to bring on more AWS resources.