Practical Course: Database Implementation
Content
In this practical course you will implement a main-memory database system from scratch.
Prerequisites
- lecture Fundamentals of Databases (IN0008) or similar course
- beneficial: lecture Database Systems on Modern CPU Architectures (IN2118)
- systems programming experience in C/C++ (or lots of motivation)
Organization
- Pre-meeting: July 8, 5pm in room MI 02.11.058
General Notes
Please note the following suggestions for your project: Notes on Programming Assignments. Here is a summary:- Use a build system, preferably make. Make is fairly easy to get started with, you find a simple tutorial e.g. here.
- Use a source code control system, e.g. Git. You can find an interactive tutorial e.g. here
- Use a memory checker to find (weird) bugs in your program, we recommend Valgrind
- If you prefer to use a (graphical) IDE, we can recommend Eclipse with CDT or KDevelop
- Use a debugger, e.g. gdb. Many IDEs have debugging features built in, e.g. Eclipse
- If you use GCC (g++) as a compiler, here are some important compiler flags/options that may help you to find bugs:
- -g generate debug symbols for better debugging
- -O0 disable optimizations to allow for more reliable debugging, use -O3 instead when you measure performance
- -Wall generate helpful warnings
- -std=c++11 to enable C++11
- profiling with perf:
- record where time is spent: perf record -e cycles:pp -p PID (PID is the process id of the process you want to benchmark)
- show report of recorded events: perf report
- count number of cache misses: perf stat -e cache-misses -p PID
- list of available events: perf list
- monitor CPU-specific perf events (not listed in perf list)
- find event for your CPU in Intel Software Developer's Manual Volume 3 Chapter 19
- you need the "Event Num." (e.g. D2) and "Umask Value" (e.g. 01)
- the perf event name starts with r, then concatenate umask and event (umask first!): perf record -e r01D2
- perf tutorial
Task 1
Implement the TPC-C "New Order" transaction and storage for all relations as a standalone C++ program.- Write a load procedure to populate the database with the data provided in the *.tbl files.
- Translate the "New Order" (neworder.script) transaction to C++ using the following mapping from SQL to C data types:
- integer: int32_t
- char,varchar: length plus character array (do not use pointers or std::string)
- timestamp: uint64_t (seconds since 01.01.1970)
- numeric(x,y): int64_t (x is the total number of digits, y is number of digits after the decimal separator, for example numeric(10,2) x=12.3 is represented as 1230)
- Test your implementation by translating newOrderRandom (from oltp.script) to C++ and executing it 1 million times; how many newOrder transactions per second can your implementation execute?
Task 2
- Write a simple parser for SQL schema files. It should handle the TPC-C schema file. You can assume that all attributes are declared NOT NULL, and you can ignore additional non-primary key indexes for now. This code may be helpful, but it does not support all needed features.
- Write a generator that generates C++ code from the schema. It must be possible to store, insert, delete, access attributes, and find (using the primary key) tuples.
- (Manually) re-implement the newOrder transaction using the generated code.
- Add the delivery transaction to your existing TPC-C implementation.
- Execute 1 million transactions composed of a mix of newOrder (90%) and delivery (10%) transactions (see oltp.cpp); measure your performance. Send your solution to Viktor Leis until 30 Oct 2013, 2pm.
Task 3
- Implement the following SQL query as a C++ program using STL data structures for the join implementation. Do not use any existing index structures (primary key indexes):
select sum(ol_quantity*ol_amount) from customer, "order", orderline where o_w_id = c_w_id and o_d_id = c_d_id and o_c_id = c_id and o_w_id = ol_w_id and o_d_id = ol_d_id and o_id = ol_o_id and c_last like 'B%'
- Run the query 10 times in sequence. For each run, print the result of the query and measure the execution times.
- Execute 1 million newOrder transactions. Using the fork system call, run the query concurrently. Whenever a query is finished, create a new snapshot using fork, so that exactly one snapshot and query is active at any given time. This example for using fork may be helpful.
- Measure how many transactions per second you can process using this snapshotting model, how long the queries take on average, and how long the fork takes.
Task 4
- Implement query compilation of relational algebra trees to C++ code. Use the produce/consume (push) model. You need to support the table scan, selection, print, and inner join operators.
- Manually build an algebra tree (including IU references) for the following SQL query:
select c_first,c_last,o_all_local,ol_amount from customer, "order", orderline where o_w_id = c_w_id and o_d_id = c_d_id and o_c_id = c_id and o_w_id = ol_w_id and o_d_id = ol_d_id and o_id = ol_o_id and c_id = 322 and c_w_id = 1 and c_d_id = 1
- Compile the expression to C++ code and execute the resulting code (including a print operator at the root).
Task 5
- Implement a simple parser for the following subset of SQL:
- The select clause contains one or more attribute names.
- The from clause contains one or more relation names.
- The where clause is optional and can contain one or more selections that are connected by "and". You only need to support selections of the form "attribute = attribute" and "attribute = constant".
- You can assume that each relation is referenced only once, and that all attribute names are unique.
- Implement semantic analysis, which takes the parsed SQL statement and the schema as input and computes an algebra tree that can be used as input to your code generator. Report errors when non-existing attributes or tables are referenced. Also, you should report an error if a table has no join condition (i.e., a cross product would be necessary). Build left-deep join trees based on the order of the relations in the from clause.
-
Implement a "read-eval-print-loop" for your database system:
- read one line from standard input
- analyze it (report any errors and abort if necessary)
- generate c++ code and store it in a temporary file
- call a c++ compiler and generate a temporary shared library (.so file)
- using dlopen, dlsym, and dlclose you can now load the shared library and execute the query in it (have a look at the example)
- Measure and report the compilation and the execution times separately.
Related Papers
- Motivation for main-memory database systems: OLTP Through the Looking Glass, and What We Found There
- Snapshots using fork: HyPer - Hybrid OLTP&OLAP High Performance Database System
- Code generation: Efficiently Compiling Efficient Query Plans for Modern Hardware