What Constitutes a Service Failure?
The center of excellence in the company’s global IT operations division comprises ML engineers and data scientists who handle data onboarding, pipelining, modeling, and building reusable frameworks. According to the team, service failure isn’t easily encapsulated by “service down.” Most systems fail in a series of steps of degradation and don’t just turn off like an outage. They started off returning, for example, HTTP 500 responses (from a server to customer to define failure) but soon realized that it was not always actually a service problem but sometimes a user problem.
Alongside the technical challenge of training and deploying complex models, it is imperative to get the right collaboration from the operations subject matter experts (SMEs). In the service failure prediction use case, the SME defines the “failures” they want to predict, which is a challenging task as the failures differ from architecture to architecture and solution to solution.
Using Dataiku, the data team was able to put potential failure characteristics into a catalog to increase the efficiency of the operations SMEs’ job of identifying and clarifying what “failure” or “degradation” means. Eventually, the team got the collaboration process efficient enough that SMEs could define 20 models in hours — all just to get the meaning of “failure,” which is what is getting predicted in the use case.
The team knew they wanted this to become a self-service initiative over time. To make that a reality, the IT operations managers and SMEs in the global IT division have access to the reusable frameworks from the central product team. Now, they can extract insights autonomously and collaborate with the technical experts to establish a scalable solution for defining (and predicting) a failure.