Practical Course: Database Implementation
Content
Short version: In this practical course, you will implement a main-memory database system from scratch.
Relational database systems are ubiquitous. Despite being designed decades ago, the fundamental architecture of these general-purpose systems has been virtually unchanged. In particular, these systems have been optimized to minimize the number of disk I/O operations. In recent times the increasing amount of RAM and fundamental changes to the hardware landscape, like the proliferation of multi-core CPUs, have changed the underlying performance tradeoffs. For example, the standard query evaluation model ("iterator model" based on interpretation) incurs huge overheads when data primarily resides in RAM. While modern column stores, which enable blazingly fast query processing but are not suitable for online transaction processing, have mostly converged on a common design, this is not the case for hybrid main-memory OLAP/OLTP systems. In this project, we focus on hybrid systems, where the architecture differs more between the different proposals.
Due to these trends, core database systems research focusing on optimizing database systems for modern hardware has proliferated. Two fundamental high-level approaches exist to design a high-performance database system for modern hardware. Some researchers start with traditional systems and try to find and optimize the bottlenecks. The advantage of this approach is that one has a fully-featured system as a basis (e.g., the open-source PostgreSQL system or the storage manager Shore). The disadvantage is that one is still constrained by the old architecture, risking getting stuck in a "local minimum."
In this course, we take the second, opposite approach. We start with a very fast but limited database functionality kernel and gradually extend it to make it more "database-like". Specifically, we start by hand-coding the core parts of TPC-C, a typical transaction processing (OLTP) benchmark, in C++. This enables very high performance from the start and allows us to focus on fundamental design decisions like how to store and index data most efficiently. It also allows easy experimentation with the tradeoffs between row and column stores. We then focus on efficient analytical query processing. To this end, we again manually implement queries based on the previously designed physical database storage format and again without being constrained by architectural design decisions.
In the second phase, we use compilation as our primary technique to automatically generate the code that was previously hand-coded. Instead of using the iterator model, this approach compiles queries directly to fast code that consists of tight loops and directly operates on the data representation. We use compilation both to generate the database storage data structures and access methods as well as code for individual queries. To compile complex, analytical multi-operator queries, which are generally represented as trees of algebra operators, we implement a generic query compiler that compiles a tree of algebra operators to machine code. We use C++ as the intermediate language for fast prototyping, which is compiled at runtime.
Prerequisites
- Lecture Fundamentals of Databases (IN0008) or similar course
- Systems programming experience in C/C++
- Beneficial: lecture Database Systems on Modern CPU Architectures (IN2118)
- The course will be in English
Organization
- Kickoff meeting: Wednesday, 10.07.2024 at 11:00 in 02.11.018
- In order to take this course, you have to complete a simple task, which is meant to test if you have the
prerequisites for the course and to introduce you to our build infrastructure.
- Register an account in our GitLab.
- Request access to the GitLab group and join the Mattermost team.
- Fork the task.
- Commit your solution as early as possible but latest on 21st of July, we will check the forked repositories after that deadline.
- Don't forget to register for the course in the matching system!