Lessons from Disney+ Launch
We have been waiting, like many people, for the launch of Disney+. We were genuinely disappointed by the outages and management issues that were covered so widely in the press.
From our perspective, the outages could have been caused by a combination of three things:
- Infrastructure failing to scale on demand during load.
- Architecture of the platform causing bottlenecks.
- Business failing to anticipate the demand for the launch.
The assumption is that Disney+ is running on Azure (compared to Netflix, which runs on AWS). Regardless, the infrastructure failed to scale to handle the demand in ways that truly should not exist in 2020. A robust infrastructure should be able to grow to handle the demand requested, and for the people involved to have plenty of tools to know there is a problem in advance.
We are humble and curious about these types of challenges, and offer a few gotchas that we respect such as the architecture causing a bottleneck preventing scaling we’ll touch on this in a second.), non-self healing infrastructure (which is unacceptable for this product), or a cloud provider imposed resource limit (which is worthy of contract termination). To detect and fix these, a robust load test pushes the scenario by periodically killing key resources. Nothing is cooler than running a load test, killing half your servers, and watching them recover.
Platform architecture is important because you want to make sure you aren’t creating bottlenecks for users to go through. The Theory of Constraints say that any scaling you do that isn’t at the most restrictive point won’t yield any improvements. The could ideally be identified by running a load test against real-world scenarios — including beyond-expectation launches. Once detected, the architecture could be changed and corrected.
Launching new products is a tricky business. We need to balance cost, with projected interest in the product. We don’t want to over provision and cost more money, and depending on the product and industry, gaging interest at any given moment can be tricky. For instance, a product might have one million people pre-sign up. But the initial interest spikes to 2 million. Adding to the challenge, platform problems often hit social media and can spiral out of control. What do you do? This is probably the hardest one to fix. Market surveys, pre-registration, and historic performance of platforms all help. But the best fix is to ensure your infrastructure and platform are designed to scale and handle the load.
The cloud puts the ability to do big product launches into the hands of every company. Because of that, everytime there is a rocky launch, we re-evaluate how Metal Toad handles product launches for large high profile events to ensure we aren’t missing something. So far our checklist is holding up against a range of products from CES launches to the Emmys, but we’re continuing to watch the story around Disney+ to see if there are other lessons to learn and improvements to make.