The ongoing data explosion necessitates that the database software efficiently uses the available hardware and exploits data properties, to enable timely business intelligence. Additionally, data and hardware become increasingly heterogeneous: modern servers are adopting a variety of hardware accelerators to increase their energy efficiency while data scientists analyze a wide variety of heterogeneous datasets to gain insights. In this line of work, we are enabling query engines to adapt to the available hardware and data formats as well as automatically exploit domain-specific properties by generating specialized query engines on demand, achieving the performance of specialized engines without the extra development effort and time.
Data cleaning has become an indispensable part of data analysis due to the increasing amount of dirty data. Data scientists spend most of their time preparing dirty data before it can be used for data analysis. At the same time, the existing tools that attempt to automate the data cleaning procedure typically focus on a specific use case and operation and are unaware of the analysis that users perform. Thus, specialized tools exhibit long running times or fail to process large datasets. In this project, we focus on approaches that address the coverage and performance issues of data cleaning, while also integrating data cleaning tasks seamlessly into the data analysis process.
Elastic & Distributed Query Engines
We build transactional and analytical engines that leverage native cloud functionality, such as elasticity and distribution. We provide fine-grained elasticity through cross-cutting system designs, spanning throughout the whole software virtualization stack, whereas we build our distributed query processing systems on top of Spark and other parallel frameworks.
Traditionally, query engines are optimized for CPUs, but nowadays modern servers are becoming increasingly heterogeneous and equipped with multiple hardware accelerators, like GPUs. In this line of work, we investigate how different accelerators can be used by the query engine to increase its performance as well as provide isolation between queries. We design new hardware-conscious algorithms, study how existing ones perform across different micro-architectures and investigate multi-device query execution. Lastly, we provide engine designs that generalize device-specific approaches to achieve efficient heterogeneous-device execution through just-in-time code generation.