The past few years in AI have been marked by the great migration from on-premises data centers to public cloud infrastructure. Organizations are drawn to the cloud’s promise of pay-as-you-go, usage-based pricing and capacity easy to scale up and down based on needs; however, in practice, leveraging big data with good performance and at reasonable costs is easier said than done.
In reality, processing large amounts of data in pipelines that deliver real business value via data science, machine learning, and AI projects requires not only serious computational power, but also optimized resource consumption and isolated environments for development and production. On top of all of this, businesses need to put best practices in place that drive efficiency and cost monitoring — clearly, managing all of these moving parts can get complex quickly for organizations of any size, digital native or not.
The Challenge at Heetch: „Data Warehouse Costs Spiraling Out of Control“
Heetch has gathered troves of data from drivers, passengers, global operations, and more since its launch, yet they struggled to scale their ability to actually leverage that data.
Five years in, data warehouse costs were spiraling out of control, and performance was suffering as the amount of data grew. The company needed to find a solution that would allow anyone across the organization to work with large amounts of data while also ensuring optimized resource allocation.
In 2019, Heetch chose Dataiku as their single platform for building data pipelines and processing raw data, paired with Looker for the seamless visualization and exploration of those flows.
What Dataiku Brings & Long-Term Results
In addition to serving as a platform where Heetch could centralize knowledge and best practices, the team also leveraged Dataiku and Kubernetes to address their primary paint point: leveraging data while maintaining good performance and reasonable costs.
Thanks to Dataiku’s native integration with major cloud vendors’ managed Kubernetes services, Heetch was able to integrate their AWS EKS cluster very quickly and saw a drastic increase in value from their data. Teams can now easily offload resource-intensive workloads, like big Python and R jobs, as well as leverage the EKS cluster to distribute compute and run Spark jobs. Using Dataiku means this functionality is available and accessible to any Heetch employee, no matter his or her knowledge in distributed computing — Dataiku abstracts away the complexity.
However, unlimited power does not mean the organization wanted unlimited spending and surprise AWS bills — calculating ROI on data projects means including both hardware and software costs, so they also wanted to leverage Dataiku for resource consumption optimization. Heetch therefore put in place the ability to differentiate CPU-vore clusters from Memory-vore clusters to optimize user experience and computation speed depending on the type of job launched.
Since moving to Dataiku, Heetch saw both a decrease in frustration from the previous bottleneck caused by the data warehouse as well as a notably faster time to market for Heetch data projects coupled with a greater ROI thanks to cost control. The team has launched hundreds of data projects with Dataiku, ranging from features stores and ETL projects to real-time fraud detection, churn prediction, route optimization, passenger/driver segmentation, pricing models, marketing attribution tracking, and more.
Ultimately, Dataiku has allowed Heetch not only to transform its ability to leverage elastic resources, but to uplevel its overall AI maturity:
- Heetch now has a unified data platform on which different people work in parallel on data projects. From data engineers optimizing flow execution to data scientists working on advanced machine learning and deep learning models.
- Data is accessible thanks to Dataiku’s abstraction layer and leverageable with appropriate infrastructure (EKS) that is robust, elastic, and scalable.
- Collaboration and knowledge sharing has been drastically improved, which was especially important for Heetch during remote working in 2020.
- Operationalization has benefited the entire organization with more than 100 projects in production on the automation node, running on a regular basis and driving daily business processes.