Advertisement

LISA19 - Blameless Incidents: Learning from Failure at Scale

LISA19 - Blameless Incidents: Learning from Failure at Scale Blameless Incidents: Learning from Failure at Scale

Chip Turner, Facebook, Inc.

How a company handles outages is a conscious decision, and being intentional about the mindset you cultivate is critical to long-term reliability and operability. Building a culture that embraces crises as learning opportunities rather than failures is a crucial component of healthy Incident Management.

Facebook’s blameless, reflective approach tries to make the most from every outage, large and small. Our scalable Incident Management program is designed to be used for incidents of all size, from full site issues to minor, localized problems affecting small, non-critical services. This talk will discuss the cultural and technical challenges to having an open culture that focuses on moving fast while keeping a high bar for operational excellence and reliability. We will explore the principles, tools, and processes we use to accomplish the above goals, how we scale communication during incidents, and how our open-door review culture reinforces our blameless approach while still maintaining high standards.


usenix,technology,conference,open access,

Post a Comment

0 Comments