Analytical Processing Systems

The ongoing data explosion necessitates that the database software efficiently uses the available hardware and exploits data properties, to enable timely business intelligence. Additionally, data and hardware become increasingly heterogeneous: modern servers are adopting a variety of hardware accelerators to increase their energy efficiency while data scientists analyze a wide variety of heterogeneous datasets to gain insights. In this line of work, we are enabling query engines to adapt to the available hardware and data formats as well as automatically exploit domain-specific properties by generating specialized query engines on demand, achieving the performance of specialized engines without the extra development effort and time.

Data Cleaning

Data cleaning has become an indispensable part of data analysis due to the increasing amount of dirty data. Data scientists spend most of their time preparing dirty data before it can be used for data analysis. At the same time, the existing tools that attempt to automate the data cleaning procedure typically focus on a specific use case and operation and are unaware of the analysis that users perform. Thus, specialized tools exhibit long running times or fail to process large datasets. In this project, we focus on approaches that address the coverage and performance issues of data cleaning, while also integrating data cleaning tasks seamlessly into the data analysis process.

Elastic & Distributed Query Engines

We build transactional and analytical engines that leverage native cloud functionality, such as elasticity and distribution. We provide fine-grained elasticity through cross-cutting system designs, spanning throughout the whole software virtualization stack, whereas we build our distributed query processing systems on top of Spark and other parallel frameworks.

 

Modern Storage

Storage hardware has improved dramatically in the past decade. It is now possible to have storage bandwidths in the hundreds of GB/s on a single server, approaching memory bandwidth. Conventional analytical engines rely on in-memory caching to avoid disk accesses and provide timely responses by keeping the most frequently accessed data in memory. However, high bandwidth storage performance is sufficiently close to memory bandwidth so that storing the input data on HBS can be as fast as full in-memory processing for many workloads. In this line of work we explore how high performance analytical systems must be redesigned for the high bandwidth era.

Query Accelerators

Traditionally, query engines are optimized for CPUs, but nowadays modern servers are becoming increasingly heterogeneous and equipped with multiple hardware accelerators, like GPUs. In this line of work, we investigate how different accelerators can be used by the query engine to increase its performance as well as provide isolation between queries. We design new hardware-conscious algorithms, study how existing ones perform across different micro-architectures and investigate multi-device query execution. Lastly, we provide engine designs that generalize device-specific approaches to achieve efficient heterogeneous-device execution through just-in-time code generation.

Publications

HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines

P. Chrysogelos; M. Karpathiotakis; R. Appuswamy; A. Ailamaki 

2019. 45th International Conference on Very Large Data Bases, Los Angeles, California, USA, August 26-30, 2019. p. 544–556. DOI : 10.14778/3303753.3303760.

Holistic, Efficient, and Real-time Cleaning of Heterogeneous Data

S. A. Giannakopoulou / A. Ailamaki (Dir.)  

Lausanne, EPFL, 2021. 

CleanM: An Optimizable Query Language for Unified Scale-Out Data Cleaning

S. A. Giannakopoulou; M. Karpathiotakis; B. C. D. Gaidioz; A. Ailamaki 

2017. 43rd International Conference on Very Large Databases, Munich, Germany, August 28th to September 1, 2017. p. 1466–1477. DOI : 10.14778/3137628.3137654.

Hardware-conscious Query Processing in GPU-accelerated Analytical Engines

P. Chrysogelos; P. Sioulas; A. Ailamaki 

2019. 9th Biennial Conference on Innovative Data Systems Research, Asilomar, California, USA, January 13-16, 2019.

Fast Queries Over Heterogeneous Data Through Engine Customization

M. Karpathiotakis; I. Alagiannis; A. Ailamaki 

2016. 42nd International Conference on Very Large Databases, New Delhi, India, September 5-9, 2016. p. 972-983. DOI : 10.14778/2994509.2994516.

Slalom: Coasting Through Raw Data via Adaptive Partitioning and Indexing

M. Olma; M. Karpathiotakis; I. Alagiannis; M. Athanassoulis; A. Ailamaki 

2017-06-01.  p. 1106-1117. DOI : 10.14778/3115404.3115415.

Just-In-Time Data Virtualization: Lightweight Data Management with ViDa

M. Karpathiotakis; I. Alagiannis; T. Heinis; M. Branco; A. Ailamaki 

2015. 7th Biennial Conference on Innovative Data Systems Research (CIDR), Asilomar, California, USA, January 4-7, 2015.

NoDB: Efficient Query Execution on Raw Data Files

I. Alagiannis; R. Borovica-Gajic; M. Branco; S. Idreos; A. Ailamaki 

Communications of the ACM. 2015. Vol. 58, num. 12, p. 112-121. DOI : 10.1145/2830508.

Hardware-conscious Hash-Joins on GPUs

P. Sioulas; P. Chrysogelos; M. Karpathiotakis; R. Appuswamy; A. Ailamaki 

2019. IEEE International Conference on Data Engineering, Macau SAR, China, April 8-12, 2019. p. 698-709. DOI : 10.1109/ICDE.2019.00068.