Here are some notes that are intended to help you to sucessfully complete the projects and programming assignments in our classes. They cover various aspects of software development, e.g. the choice of development platform, source code management tools and testing strategy. This page does not teach you e.g. C++ (how could it?), but is rather intended to refer you to material and tools that are helpful and prevent you from making bad decisions when starting a new project.
This page is work in progress, but please contact me if you have any comments.
You can use whatever operating system you want for development (of course), as long as your programs can be built and run on Linux. Especially when low-level code is involved (as e.g. in the
Please write your programs in C++ (C++11, to be precise). I would claim that C++ is the best choice for writing a database system as it gives you control over low-level details such as memory allocation and layout and allows you to write high-performance programs (unlike Java and other managed languages). Additionally, it is more expressive than C and has various features that simplify your life (unlike low-level languages). The downside is its steep learing curve. Nevertheless, I would recommend to use the project/assignments as an opportunity to learn C++ or improve your skills. C++ skills are rare and and in demand.
auto
type inferencefor
loopsunique_ptr
s. There are typically very few cases, where you cannot use either a reference or a unique_ptr
(also important to mention in this context: move semantics).unordered_set
, unordered_map
, unordered_multiset
, unordered_multimap
Please refrain from using any libraries other than the STL (and googletest for unit testing) in your projects/assignments unless you have checked with me first.
Please comment your code. Comment all class definitions, non-trivial member functions and variables, and steps in your algorithms. Also, please use a consistent coding style accross you project. I don't want to overspecify this, but be consistent with indentation (tabs or spaces) and naming schemes (e.g. UpperCaseCamelCase
for classes, lowerCaseCamelCase
for variables/methods/functions).
CppQuiz.org is a great resource for testing your understanding of the language!
Keep your project tidy on a file system level by using subfolders for different parts of you code. Example:
MyDBMS +-Makefile +-.gitignore +-README.md +-bin +-index | +-HashTable.hpp | +-HashTable.cpp | +-BTree.hpp | +-BTree.cpp +-buffer | +-BufferManager.hpp | +-BufferManager.cpp | +-AsyncWriter.hpp | +-AsyncWriter.cpp +-testing +-BufferManagerTest.cpp +-HashTableTest.cpp
You don't have to replicate the exact same structure as depicted above, but should ensure the following: Separate binary files from source files, split your code into different components (here: index structures and buffer management) and separate production code from unit tests. If you build libraries, you should also separate .hpp
files from .cpp
files (this is not done in the example).
There are some files that make sense to keep at the top level: Makefile
is a file used by the build system make
and .gitignore
an ignore-list of your Git repository. README.md
does not need to include a lengthly description of what your project is about (I know this anyways), but should briefly specify how your project can be built and run (including which parameters), how you tested it (platform, configuration, parameters, etc) and what issues you are aware of.
There are many source code management systems out there -- I have a clear favorite: Git. It has numerous advantages compared to its competitors and is lightweight and easy to set up & use. There is a great free book, a free, interactive tutorial and there are great cheat sheets available to get you started. Plus, github and Bitbucket give out free, private repositories to university students. Even though you do not need these in order to use Git, they can be helpful for collaboration and as a backup.
Once installed, setting up Git for you project is as easy as git init
(make this directory a Git repository), git add <file1> <file2> ... <fileN>
(add file1 to fileN to repository) and git commit -m "initial commit"
(committing the changes, i.e. the addition of file1 to fileN). Refer to the resources mentioned above to learn how to take it from here. Please maintain a .gitignore file to exclude any unwanted file (e.g. the directory with the binary files, backup/temporary files of your editor/IDE, large files containing generated testdata, ...) from your repository.
When submitting your code or your solution of an assignment, instead of emailing me your (compressed) project files, you can simply refer me to your Git repository (I prefer that). Make sure I have read access to your repository, then send me an email (before the project is due) with the repository information and indicate which branch and commit ID you want me grade.
Any publicly available build system is okay with me (as long as it runs on Linux), but especially for single-platform projects, good old make is an excellent choice. In a Makefile
, you specify how your system can be built in an easy format:
target: dependency1 dependency2 dependencyN command1 command2 commandM
This tells make
that the target
(usually an object file or a binary) depends on the files dependency1
to dependencyN
. I.e. if one of these dependencies has been updated, since the last built, target
has to be rebuilt and this can be done by invoking the commands command1
to commandM
. The only pitfall is, that the commands have to be indented using tabs, not spaces.
I have not found the perfect make
tutorial yet, but this one seems decent and brief.
make
has many powerful functions that help you keep your Makefile
concise. However, if you are a make
-novice, it's best to stay away from stuff you don't fully understand, as debugging Makefile
s is no fun.
GCC's g++
is popular and proven, while LLVM's clang++
is also a great, free C++ compiler and a promising challenger to g++
(especially its comprehensible error and warning messages are compelling). For both, I recommend the latest version, in particular because C++11 support is constantly being improved.
Useful compiler flags:
-std=c++11
/-std=c++0x
: Enable (experimental) C++11 (C++0x) support-g
: Denerate debug symbols-O0
: Disable optimizations to allow for more reliable debugging (update: use -Og
, if supported by your compiler). Use -O3
when running bechmarks.-Wall
(GCC) and resp. -Weverything
(Clang): Generate helpful warnings. Do not ignore them! In fact, force yourself to deal with warnings by turning them into errors with -Werror
.Use a debugger to find bugs, don't rely on debug output. Good debuggers: GDB (the GNU debugger) and LLVM's LLDB. Most IDEs have a graphical debugger front-end, but the command line can already be very helpful when your program crashes. There's a curses-based interface for gdb
, called cgdb
that I can recommend. Little known fact: GDB now supports (limited) reverse debugging.
If you have never used a debugger, check out this toy example:
#include <iostream> #include <cstdlib> int bar(int len, char* args[]) { int sum = 0; for (unsigned i=0; i<len; ++i) sum += std::atoi(args[i]); return sum; } void foo(int len, char* array[]) { if (len > 1) std::cout << bar(len,array+1) << std::endl; else std::cout << 0 << std::endl; } int main(int argc, char* argv[]) { foo(argc, argv); return 0; }
me@mymachine:/tmp$ ./test 123 456 789 Segmentation fault (core dumped)
me@mymachine:/tmp$ gdb --args ./test 123 456 789 GNU gdb (GDB) 7.5-ubuntu Copyright (C) 2012 Free Software Foundation, Inc. ... Reading symbols from /tmp/test...done. (gdb) run Starting program: /tmp/test 123 456 789 Program received signal SIGSEGV, Segmentation fault. ____strtol_l_internal (nptr=0x0, endptr=0x0, base=10, group=<optimized out>, loc=0x7ffff7ad1040 <_nl_global_locale>) at ../stdlib/strtol_l.c:298 (gdb) bt #0 ____strtol_l_internal (nptr=0x0, endptr=0x0, base=10, group=<optimized out>, loc=0x7ffff7ad1040 <_nl_global_locale>) at ../stdlib/strtol_l.c:298 #1 0x00007ffff77519e0 in atoi (nptr=<optimized out>) at atoi.c:28 #2 0x0000000000400898 in bar (len=4, args=0x7fffffffe040) at test.cpp:7 #3 0x00000000004008db in foo (len=4, array=0x7fffffffe038) at test.cpp:13 #4 0x0000000000400934 in main (argc=4, argv=0x7fffffffe038) at test.cpp:19 (gdb) f 2 #2 0x0000000000400898 in bar (len=4, args=0x7fffffffe040) at test.cpp:7 7 sum += std::atoi(args[i]); (gdb) p i $1 = 3 (gdb) quit
By default, printing objects works well for built-in types in GDB, but is often less helpful for STL data structures (e.g. std::string
or std::vector
) as you only see memory addresses, not the content of the container. Here is some information on how to change this for GDB. LLDB has a similar feature called synthetic children.
If your program behaves somehow "indeterministic" or "mysterious", Valgrind is your friend. Valgrind's memcheck finds illegal accesses to memory, uninitialized reads and much more. The option --db-attach=yes
starts the debugger when an error is found. Check out this blog post on the interaction between GDB and Valgrind.
Valgrind's Helgrind and DRD can help you find thread-related problems. This short blog post gives some helpful advise on how to detect the cause of a deadlock. A significantly faster and only marginally less thorough alternative to Valgrind's memcheck
is AddressSanitizer.
Before making a commit in your SCM system, make sure your program is memcheck
ed and passes all unit tests.
If you prefer an IDE over a setup with just an editor and a command line, Eclipse with CDT is a good (but heavyweight) cross-platform IDE. KDevelop is also a good choice for KDE users. Both have the advantage, that you can easily import make
-based projects and build your programs from within the IDE using make. C++ guru (and Microsoft employee) Herb Sutter recommends the free version of Microsoft Visual C++ for Windows users.
Unit tests help to improve the correctness of your code and prevent regression. googletest a great unit-testing framework for C++. Use it to write testcases for each class/algorithm that actually try to
bcov is a code coverage analysis tool that tells you how much of your code is covered by your unit tests. Using a code coverage tool is probably a case of using a sledgehammer to crack a nut for your (smallish) project, but I found it worth mentioning... especially since my boss wrote it.
Profilers help you to understand the performance of your program (and the environment it is running in). As profiling is probably not required for assignments/projects (but may help!), I keep this section short. To put it bluntly: Avoid oprofile
(and gprof
), prefer perf
. It is easy to use perf
and it helps you understand where you spend your CPU cycles, how many cache misses you produce and much more.
If you qualify for a student license, you can get Intel® VTune™ for free. Check it out, it's complex but amazing.
Great! Talk to me about a student job.