ja

How a Multinational Telecommunications Company Developed AI-Enabled Service
Failure Prediction

The issue of service downtime and failure (which can lead to significant revenue loss) is specifically prevalent amongst telcos, but this problem can be solved by using machine learning to predict what might cause service failures and prevent them from happening.

40

models produced
within 6 weeks

40

IT components monitored simultaneously

50

minutes advance notice on incidents

動画を視る

What Constitutes a Service Failure?

The center of excellence in the company’s global IT operations division comprises ML engineers and data scientists who handle data onboarding, pipelining, modeling, and building reusable frameworks. According to the team, service failure isn’t easily encapsulated by “service down.” Most systems fail in a series of steps of degradation and don’t just turn off like an outage. They started off returning, for example, HTTP 500 responses (from a server to customer to define failure) but soon realized that it was not always actually a service problem but sometimes a user problem. 

Alongside the technical challenge of training and deploying complex models, it is imperative to get the right collaboration from the operations subject matter experts (SMEs). In the service failure prediction use case, the SME defines the “failures” they want to predict, which is a challenging task as the failures differ from architecture to architecture and solution to solution. 

Using Dataiku, the data team was able to put potential failure characteristics into a catalog to increase the efficiency of the operations SMEs’ job of identifying and clarifying what “failure” or “degradation” means. Eventually, the team got the collaboration process efficient enough that SMEs could define 20 models in hours — all just to get the meaning of “failure,” which is what is getting predicted in the use case.

The team knew they wanted this to become a self-service initiative over time. To make that a reality, the IT operations managers and SMEs in the global IT division have access to the reusable frameworks from the central product team. Now, they can extract insights autonomously and collaborate with the technical experts to establish a scalable solution for defining (and predicting) a failure.

"Previously, one model took the team six months and, with Dataiku, they can now produce 40 models within 6 weeks (meaning 40 IT components are now being monitored in this innovative way)."

Results: Reduced Service Failures, Faster Model Development, & More

Before using Dataiku, the data scientists were using manual feature engineering for each model which was very time consuming. They started using deep learning approaches and found better accuracy, which simultaneously created time that data scientists could spend on other high-priority projects. A deep learning model in Dataiku can now take less than 20 minutes to train on months of data. The catalog of indicators described above did a lot of heavy lifting to help the team identify the right failures to go after, with input of SMEs balancing failure rate with impact. 

The team can now generate models for about 20 components in less than a month, covering everything from data preprocessing, data transfer from logs, modeling, automation, and testing. The models themselves and production are monitored with Dataiku, and the company also has a business layer of monitoring, ensuring that the actionable data created for the business owners of each IT service is useful and understandable.

Additionally, the team has seen:

  • Accelerated speed to market for model development, from when the data becomes available to deployment into the live environment: Previously, one model took the team six months and, with Dataiku, they can now produce 40 models within 6 weeks (meaning 40 IT components are now being monitored in this innovative way) 
  • Reduced MTTR (mean time to resolution/restore), which enables the team to fix failures faster 
  • Increased service availability 
  • Reduced P1 service failures and enabled a quicker intervention time (i.e., the average major incident is now predicted 50 minutes in advance, giving the business time to proactively address it)  
  • Efficiencies with time saved across the end-to-end failure process, so the team can focus on the priorities of IT operations, investigating further preventative measures using the time released from manually monitoring service failures
  • Greater agility upon leveraging the power of Dataiku and the cloud in unison (without having to wait for on-prem infrastructure) and, therefore, the ability to build models in a more extensible way 

The team is looking forward to scaling out the service failure prediction use case and experimenting with auto-diagnostics to avoid creating panic with services that have been identified as on their way to failure via prescriptive resolutions.

Orange: Building a Sustainable
Data Practice

Orange has accelerated their data science practice with Dataiku, building a call load detection and triage model in just one month, with more use cases on the horizon.

Read more

Go Further:

Making Enterprise AI an Organizational Asset

How can your company become an AI enterprise? Dataiku enables organizations across all industries to embed machine learning methodology into the very core of their business to bring real value.

Learn More

Vestas: Propelling Sustainable Energy Solutions With Dataiku

Though the savings generated by the express shipping recommendation model will only fully materialize over time, the tool when globally implemented is estimated to reduce express shipment costs by 11-36%.

Learn More

Bouygues Télécom: Creating Value by Improving Fraud Detection

Bouygues Télécom uses Dataiku to mitigate risk by building, validating, and retraining their models.

Learn More

ENGIE IT: Democratizing Data With Capgemini & Dataiku

Hear how Engie's IT team partners with Capgemini and Dataiku to support business teams across the company on their data journeys.

Learn More