EPFL

Over the years, the DIAS lab has produced several pieces of software with various applicabilities. Below are their descriptions and the link to our software repository.

Proteus is an analytical query engine designed for fast analytics over data in heterogeneous formats.
Proteus provides:

In-situ query processing of heterogeneous data formats, including multi-format support — mixing flat (eg. binary, CSV) and hierarchical (eg. JSON) data formats.
Direct querying of data without pre-loading them into the database. The engine can directly query the raw data and any necessary conversion happens on-the-fly.
A caching infrastructure that uses positional maps to reduce the cost of repeatedly accessing data on their original format. As a by-product of query execution, parsed values of accessed attributes are saved in positional maps to reduce the cost of successive passes.
A code generation infrastructure that generates code specialized to the query and data formats.

website: https://proteusdb.com

Visual exploration of spatial data relies heavily on spatial aggregation queries that slice and summarize the data over different regions. These queries comprise computationally-intensive point-in-polygon tests that associate data points to polygonal regions, challenging the responsiveness of visualization tools. This challenge is compounded by the sheer amounts of data, requiring a large number of such tests to be performed. Traditional pre-aggregation approaches are unsuitable in this setting since they fix the query constraints and support only rectangular regions. On the other hand, query constraints are defined interactively in visual analytics systems, and polygons can be of arbitrary shapes.We convert a spatial aggregation query into a set of drawing operations on a canvas and leverage the rendering pipeline of the graphics hardware (GPU) to enable interactive response times. Our technique trades-off accuracy for response time by adjusting the canvas resolution, and can even provide accurate results when combined with a polygon index. We evaluate our technique on two large real-world data sets, exhibiting superior performance compared to index-based approaches.

publication

code

Slalom is an in-situ query processing system which adaptively partitions data files and enables partial indexes over subsets of the datasets. The prototype is used as a testbed for a variety of partitioning and indexing strategies for in-situ query processing.

The implemented prototype enables the execution of in situ queries containing point and range queries over CSV files and binary tables.

To reduce the cost of raw data access, it implements positional maps and binary caches, and SIMD operations for token identification.
To access only relevant data, the protoype implements a variety of indexes: (I) an in memory B+Tree, (ii) a Bloom Filter, (iii) Hash index and Iiv) a bitmap. As we compare Slalom with Database Cracking approaches, the prototype contains also the code for the original Cracking as well as Stochastic Cracking.

Finally, to reduce the cost of updates, the prototype enables the recognition of changes in the data file, by calculating hash value checksums and uses inotify to recognize updates to the file. This option has been tested only on Linux systems.

PostgresRAW, is implemented by modifying PostgreSQL 9.0. It overrides the scan operator of the query plan produced by the PostgreSQL optimizer with the ability to access raw data files directly and efficiently.

The modifications are geared towards improving access on raw data by:

Speeding up raw data accesses using an adaptive positional map. It maintains information such as positions of attributes in the raw file.
Reducing the tokenizing and parsing costs by aborting tokenization of tuples as soon as possible and creating tuples after the select operator, when it is known they are relevant.
Eliminating the need to access hot raw data via a cache which holds binary data and always gives priority to the most costly attributes to convert.

The positional map, cache and statistics are generated on-the-fly, only for the attributes accessed by the current query. Whenever needed elements are dropped based on LRU policies.

PostgresRAW also supports (a) external updates , a.k.a a user adds or modifies data files directly or (b) internal updates which are triggered by SQL statements.

code

Shore-MT is an experiment test-bed library for use by researchers who wish to write multi-threaded software that manages persistent data. Shore-MT as a storage engine provides the following capabilities:

transactions with ACID properties, with ARIES-based logging and recovery, primitives for partial rollback, transaction chaining, and early lock release
prepared-transaction support for two-phased commit (2PC)
persistent storage structures : B+ tree indexes, R* trees (spatial indexes), and files of untyped records
fine-grained locking for records and B+ tree indexes with deadlock detection, optional lock escalation and optional coarse-grained locking
in-memory buffer management with optional prefetching
extensible statistics-gathering, option-processing, and error-handling facilities

This software runs on Pthreads, thereby providing its client software (e.g., a database server) multi-threading capabilities and resulting scalability from modern SMP and NUMA architectures, and has been used on Linux/x86-64 and Solaris/Niagara architectures.

code