Machine Learning

Mean Time Between Loss of Sleep

“MTBLS”: I first encountered this phrase on a New Relic blog.

Jan 26, 2018

Filed under:

“MTBLS”: I first encountered this phrase on a New Relic blog. It's a half-joking reference to a concept used by reliability engineers, Mean Time Between Failures (MTBF). I was intrigued though, and thought it would be an interesting metric to track.

We have high-resolution data about our machines' health – down to the smallest minutia – but precious little about the health of our people.

Sleep deprivation has been implicated in the recent fatal collisions of the Navy destroyer John S. McCain and the Fitzgerald. It causes frightening increases in medical errors. Analyses of many historic disasters, including Chernobyl and the Challenger explosion named sleep loss as a factor.

Medical research shows the cumulative long-term effects of sleep loss and sleep disorders have been associated with a wide range of deleterious health consequences including an increased risk of hypertension, diabetes, obesity, depression, heart attack, and stroke.

Losing sleep is bad.

Using my favorite data visualization toolkit, SciPy and Jupyter, I was able to generate a plot from the PagerDuty API:

This counts consecutive runs of nights with uninterrupted sleep (multiple incidents in the same night don't change the value). A value of 1 means the on-call engineer is woken up at least once every night. The chart is smoothed with a one-sided Hann window, such that the most recent interruptions are weighted more heavily.

Self-determination

While on-call is a fact of life for the managed cloud services we offer, I think there's definitely room for improvement. One thing I have always believed strongly is that alarms must be actionable, and the team must have enough control over the affected system to meaningfully improve it in the future. There's nothing worse than being paged in the middle of the night over something you can't even fix! This may sound obvious but I have seen many, many poorly configured monitoring systems over the years and getting this right takes a lot of careful tuning. It also means the team needs the authority to disable alarms in areas where the business isn't willing to fund such reliability improvements.

Try it out

If you want to experiment with your own data, download the Jupyter notebook:

-----

If you're interested in our AWS Services, take a look at what we can do for you in the following industry:

AWS Healthcare & Life Sciences

If this doesn't apply to you, check out our industries for more!

Machine Learning Artificial Intelligence Business AWS DevOps

Mean Time Between Loss of Sleep

Self-determination

Try it out

Similar posts

Best Practices For a Secure Cloud Part 1

Building an Expandable IoT Data Pipeline - Part 2

Building an Expandable IoT Data Pipeline - Part 1

Mean Time Between Loss of Sleep

Self-determination

Try it out

Similar posts

Best Practices For a Secure Cloud Part 1

Building an Expandable IoT Data Pipeline - Part 2

Building an Expandable IoT Data Pipeline - Part 1

Get notified on new marketing insights