Mean Time Between Loss of Sleep
“MTBLS”: I first encountered this phrase on a New Relic blog. It's a half-joking reference to a concept used by reliability engineers, Mean Time Between Failures (MTBF). I was intrigued though, and thought it would be an interesting metric to track.
We have high-resolution data about our machines' health – down to the smallest minutia – but precious little about the health of our people.
Sleep deprivation has been implicated in the recent fatal collisions of the Navy destroyer John S. McCain and the Fitzgerald. It causes frightening increases in medical errors. Analyses of many historic disasters, including Chernobyl and the Challenger explosion named sleep loss as a factor.
Medical research shows
the cumulative long-term effects of sleep loss and sleep disorders have been associated with a wide range of deleterious health consequences including an increased risk of hypertension, diabetes, obesity, depression, heart attack, and stroke.
Losing sleep is bad.
Using my favorite data visualization toolkit, SciPy and Jupyter, I was able to generate a plot from the PagerDuty API:
This counts consecutive runs of nights with uninterrupted sleep (multiple incidents in the same night don't change the value). A value of 1 means the on-call engineer is woken up at least once every night. The chart is smoothed with a one-sided Hann window, such that the most recent interruptions are weighted more heavily.
While on-call is a fact of life for the managed cloud services we offer, I think there's definitely room for improvement. One thing I have always believed strongly is that alarms must be actionable, and the team must have enough control over the affected system to meaningfully improve it in the future. There's nothing worse than being paged in the middle of the night over something you can't even fix! This may sound obvious but I have seen many, many poorly configured monitoring systems over the years and getting this right takes a lot of careful tuning. It also means the team needs the authority to disable alarms in areas where the business isn't willing to fund such reliability improvements.
Try it out
If you want to experiment with your own data, download the Jupyter notebook:
Have you seen Alice Goldfuss' talk about on call? She gave it at Monitorama last year, it was really good: https://speakerdeck.com/alicegoldfuss/martyrs-on-film-learning-to-hate-the-number-oncallselfie?slide=70
Thu, 03/15/2018 - 20:03
Tue, 01/30/2018 - 21:34