The prediction techniques we developed in this project are advancing the state-of-the-art in three ways:
- by providing prediction mechanisms for a class of iterative analytics that were not empirically addressed before and are widely used today in analytical workflows;
- by providing hybrid prediction models for different categories of data analytics (i.e., iterative machine learning, data pre-processing, and reporting SQL) and by analyzing the trade-offs at varying levels of model granularities;
- by providing mechanisms to reduce the training cost (in terms of running benchmark queries) while maintaining a competitive level of accuracy for the models.
PREDIcT improves the accuracy of analytical upper bounds for estimating iterations for PageRank from a relative error of [104, 168]% to [0,11]%. Overall, the runtime estimates have an error of [10-30]% for all scale-free graph analyzed. Our prediction techniques proposed in the context of automated workload deployment for reporting analytics can be used to answer resource allocation questions, i.e., identifying the resource allocation(s) that can satisfy a target performance goal.
Today’s analytical workloads include a mix of declarative queries and workflows of iterative, machine learning algorithms executing on large-scale infrastructures. While SQL-like query languages offer the possibility to execute traditional analytical queries at large scale (e.g., JAQL, Pig, Hive), user defined functions written in high level languages give users the opportunity to execute customized operators. Within the last category, machine learning algorithms are prevalent today and used to filter out irrelevant data or to find correlations into the ever increasing datasets. For instance, Facebook uses machine learning to order stories in the news feed (i.e., ranking), and to group users with similar interests together (i.e., clustering).
In this project we address the problem of estimating performance of mixed analytical workflows with the ultimate goal of providing the set of prediction models and techniques that can assist end users to perform automatic workload deployment in a cloud setting. Runtime estimates are a pre-requisite for optimizing cluster resource allocations in a similar manner as query cost estimates are a pre-requisite for DBMS optimizers. In particular, resource managers and schedulers that are used to optimize resource provisioning require as input runtime estimates for quantifying alternative resource allocations.
People involved: Adrian Popescu
Towards Predicting the Runtime of Iterative Analytics with PREDIcT