Artificial Intelligence

Mean Time Between Loss of Sleep

“MTBLS”: I first encountered this phrase on a New Relic blog.


Filed under:

“MTBLS”: I first encountered this phrase on a New Relic blog. It's a half-joking reference to a concept used by reliability engineers, Mean Time Between Failures (MTBF). I was intrigued though, and thought it would be an interesting metric to track.

We have high-resolution data about our machines' health – down to the smallest minutia – but precious little about the health of our people.

Sleep deprivation has been implicated in the recent fatal collisions of the Navy destroyer John S. McCain and the Fitzgerald. It causes frightening increases in medical errors. Analyses of many historic disasters, including Chernobyl and the Challenger explosion named sleep loss as a factor.

Medical research shows the cumulative long-term effects of sleep loss and sleep disorders have been associated with a wide range of deleterious health consequences including an increased risk of hypertension, diabetes, obesity, depression, heart attack, and stroke.

Losing sleep is bad.

Using my favorite data visualization toolkit, SciPy and Jupyter, I was able to generate a plot from the PagerDuty API:

This counts consecutive runs of nights with uninterrupted sleep (multiple incidents in the same night don't change the value). A value of 1 means the on-call engineer is woken up at least once every night. The chart is smoothed with a one-sided Hann window, such that the most recent interruptions are weighted more heavily.

Self-determination

While on-call is a fact of life for the managed cloud services we offer, I think there's definitely room for improvement. One thing I have always believed strongly is that alarms must be actionable, and the team must have enough control over the affected system to meaningfully improve it in the future. There's nothing worse than being paged in the middle of the night over something you can't even fix! This may sound obvious but I have seen many, many poorly configured monitoring systems over the years and getting this right takes a lot of careful tuning. It also means the team needs the authority to disable alarms in areas where the business isn't willing to fund such reliability improvements.

Try it out

If you want to experiment with your own data, download the Jupyter notebook:

Similar posts

Get notified on new marketing insights

Be the first to know about new B2B SaaS Marketing insights to build or refine your marketing function with the tools and knowledge of today’s industry.