Streaming fail

Lessons from Disney+ Launch

We have been waiting, like many people, for the launch of Disney+.  We were genuinely disappointed by the outages and management issues that were covered so widely in the press.  

From our perspective, the outages could have been caused by a combination of three things:

  1. Infrastructure failing to scale on demand during load. 
  2. Architecture of the platform causing bottlenecks. 
  3. Business failing to anticipate the demand for the launch. 

The assumption is that Disney+ is running on Azure (compared to Netflix, which runs on AWS).  Regardless,  the infrastructure failed to scale to handle the demand in ways that truly should not exist in 2020. A robust infrastructure should be able to grow to handle the demand requested, and for the people involved to have plenty of tools to know there is a problem in advance. 

We are humble and curious about these types of challenges, and offer a few gotchas that we respect such as the architecture causing a bottleneck preventing scaling we’ll touch on this in a second.), non-self healing infrastructure (which is unacceptable for this product), or a cloud provider imposed resource limit (which is worthy of contract termination). To detect and fix these, a robust load test pushes the scenario by periodically killing key resources. Nothing is cooler than running a load test, killing half your servers, and watching them recover. 

Platform architecture is important because you want to make sure you aren’t creating bottlenecks for users to go through. The Theory of Constraints say that any scaling you do that isn’t at the most restrictive point won’t yield any improvements. The could ideally be identified by running a load test against real-world scenarios — including beyond-expectation launches.  Once detected, the architecture could be changed and corrected. 

Launching new products is a tricky business. We need to balance cost, with projected interest in the product. We don’t want to over provision and cost more money, and depending on the product and industry, gaging interest at any given moment can be tricky. For instance, a product might have one million people pre-sign up. But the initial interest spikes to 2 million. Adding to the challenge, platform problems often hit social media and can spiral out of control. What do you do? This is probably the hardest one to fix. Market surveys, pre-registration, and historic performance of platforms all help. But the best fix is to ensure your infrastructure and platform are designed to scale and handle the load. 

The cloud puts the ability to do big product launches into the hands of every company. Because of that, everytime there is a rocky launch, we re-evaluate how Metal Toad handles product launches for large high profile events to ensure we aren’t missing something. So far our checklist is holding up against a range of products from CES launches to the Emmys, but we’re continuing to watch the story around Disney+ to see if there are other lessons to learn and improvements to make.

Date posted: November 15, 2019

Add new comment

Restricted HTML

  • Allowed HTML tags: <a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • You can enable syntax highlighting of source code with the following tags: <code>, <blockcode>, <cpp>, <java>, <php>. The supported tag styles are: <foo>, [foo].
  • Web page addresses and email addresses turn into links automatically.
  • Lines and paragraphs break automatically.

Metal Toad is an Advanced AWS Consulting Partner. Learn more about our AWS Managed Services

About the Author

Nathan Wilkerson, VP of Engineering

Nathan started building computers, programming and networking with a home IPX network at age 13. Since then he has had a love of all things computer; working in programming, system administration, devops, and Cloud Computing. Over the years he's enriched his knowledge of computers with hands on experience and earning his AWS Certified Solutions Architect – Professional.

Recently, Nathan has transitioned to a Cloud Operations Manager role. He helps clients and internal teams interface with the Cloud Team using the best practices of Kanban to ensure a speedy response and resolution to tickets.

Have questions?