Notes on Programming Assignments

Intro

Here are some notes that are intended to help you to sucessfully complete the projects and programming assignments in our classes. They cover various aspects of software development, e.g. the choice of development platform, source code management tools and testing strategy. This page does not teach you e.g. C++ (how could it?), but is rather intended to refer you to material and tools that are helpful and prevent you from making bad decisions when starting a new project.

This page is work in progress, but please contact me if you have any comments.

Operating System

You can use whatever operating system you want for development (of course), as long as your programs can be built and run on Linux. Especially when low-level code is involved (as e.g. in the Database Implementation class), developing on Windows is not the best choice for our classes (even you can alleviate the pain): It is not POSIX complient and lacks many of the tools suggested here (it might have other advantages, though). Thus, if you run Windows, I recommend you install Linux (e.g. Kubuntu) either in a VM or in parallel to your Windows installation and use this installation for your development. OS X users are usually fine and can install missing stuff like the latest GCC e.g. via Homebrew.

C++

Please write your programs in C++ (C++11, to be precise). I would claim that C++ is the best choice for writing a database system as it gives you control over low-level details such as memory allocation and layout and allows you to write high-performance programs (unlike Java and other managed languages). Additionally, it is more expressive than C and has various features that simplify your life (unlike low-level languages). The downside is its steep learing curve. Nevertheless, I would recommend to use the project/assignments as an opportunity to learn C++ or improve your skills. C++ skills are rare and and in demand.

There are many good books about C++ (and even more crappy ones...). I can recommend the Addison-Wesley's C++ in Depth series. Get started with Stroustrup's A Tour of C++ which is a great introduction to language and STL using C++11. Get started with Koenig/Moo's Accelerated C++, which is didactically excellent and helps you to get a good understanding of the language and the Standard Template Library (STL), but is a little bit outdated (no C++11 features). Scott Mayer's Effective C++: 55 Specific Ways to Improve Your Programs and Designs and Herb Sutter's C++ Coding Standards are also very helpful for intermediate programmers.
All C++11 language features are now supported by current versions of Clang and GCC (but the STL implementations are still lagging behind). Here are some helpful things you might want to begin with:
- auto type inference
- range-based for loops
- Smart pointers, in particular unique_ptrs. There are typically very few cases, where you cannot use either a reference or a unique_ptr (also important to mention in this context: move semantics).
- Hash table data structures: unordered_set, unordered_map, unordered_multiset, unordered_multimap
- Many others: Threads, deleted member functions, lambdas, ...
References:

The 4th edition of Stroustrup's "The C++ Programming Language" covers C++11.
The C++11 standard is not available for free, but the latest draft is and both are contentwise virtually identically. It is a good reference, but bad for learning the language.
A useful online reference of C++(11)'s powerful Standard Template Library that has various helpful data structures and alogrithms
There is a great new FAQ on the Standard C++ Foundation's website covering lots of topics from basics and how to get started over OOP to advanced stuff and a preview of C++14.

Please refrain from using any libraries other than the STL (and googletest for unit testing) in your projects/assignments unless you have checked with me first.

Please comment your code. Comment all class definitions, non-trivial member functions and variables, and steps in your algorithms. Also, please use a consistent coding style accross you project. I don't want to overspecify this, but be consistent with indentation (tabs or spaces) and naming schemes (e.g. UpperCaseCamelCase for classes, lowerCaseCamelCase for variables/methods/functions).

CppQuiz.org is a great resource for testing your understanding of the language!

Directory Structure

Keep your project tidy on a file system level by using subfolders for different parts of you code. Example:

MyDBMS
  +-Makefile
  +-.gitignore
  +-README.md
  +-bin
  +-index
  |  +-HashTable.hpp
  |  +-HashTable.cpp
  |  +-BTree.hpp
  |  +-BTree.cpp
  +-buffer
  |  +-BufferManager.hpp
  |  +-BufferManager.cpp
  |  +-AsyncWriter.hpp
  |  +-AsyncWriter.cpp
  +-testing
     +-BufferManagerTest.cpp
     +-HashTableTest.cpp

You don't have to replicate the exact same structure as depicted above, but should ensure the following: Separate binary files from source files, split your code into different components (here: index structures and buffer management) and separate production code from unit tests. If you build libraries, you should also separate .hpp files from .cpp files (this is not done in the example).

There are some files that make sense to keep at the top level: Makefile is a file used by the build system make and .gitignore an ignore-list of your Git repository. README.md does not need to include a lengthly description of what your project is about (I know this anyways), but should briefly specify how your project can be built and run (including which parameters), how you tested it (platform, configuration, parameters, etc) and what issues you are aware of.

Source Code Management

There are many source code management systems out there -- I have a clear favorite: Git. It has numerous advantages compared to its competitors and is lightweight and easy to set up & use. There is a great free book, a free, interactive tutorial and there are great cheat sheets available to get you started. Plus, github and Bitbucket give out free, private repositories to university students. Even though you do not need these in order to use Git, they can be helpful for collaboration and as a backup.

Once installed, setting up Git for you project is as easy as git init (make this directory a Git repository), git add <file1> <file2> ... <fileN> (add file1 to fileN to repository) and git commit -m "initial commit" (committing the changes, i.e. the addition of file1 to fileN). Refer to the resources mentioned above to learn how to take it from here. Please maintain a .gitignore file to exclude any unwanted file (e.g. the directory with the binary files, backup/temporary files of your editor/IDE, large files containing generated testdata, ...) from your repository.

When submitting your code or your solution of an assignment, instead of emailing me your (compressed) project files, you can simply refer me to your Git repository (I prefer that). Make sure I have read access to your repository, then send me an email (before the project is due) with the repository information and indicate which branch and commit ID you want me grade.

Build System

Any publicly available build system is okay with me (as long as it runs on Linux), but especially for single-platform projects, good old make is an excellent choice. In a Makefile, you specify how your system can be built in an easy format:

target: dependency1 dependency2 dependencyN
	command1
	command2
	commandM

This tells make that the target (usually an object file or a binary) depends on the files dependency1 to dependencyN. I.e. if one of these dependencies has been updated, since the last built, target has to be rebuilt and this can be done by invoking the commands command1 to commandM. The only pitfall is, that the commands have to be indented using tabs, not spaces.

I have not found the perfect make tutorial yet, but this one seems decent and brief.

make has many powerful functions that help you keep your Makefile concise. However, if you are a make-novice, it's best to stay away from stuff you don't fully understand, as debugging Makefiles is no fun.

Compiler

GCC's g++ is popular and proven, while LLVM's clang++ is also a great, free C++ compiler and a promising challenger to g++ (especially its comprehensible error and warning messages are compelling). For both, I recommend the latest version, in particular because C++11 support is constantly being improved.

Useful compiler flags:

-std=c++11/-std=c++0x: Enable (experimental) C++11 (C++0x) support
-g: Denerate debug symbols
-O0: Disable optimizations to allow for more reliable debugging (update: use -Og, if supported by your compiler). Use -O3 when running bechmarks.
-Wall (GCC) and resp. -Weverything (Clang): Generate helpful warnings. Do not ignore them! In fact, force yourself to deal with warnings by turning them into errors with -Werror.

Debugging Tools

Use a debugger to find bugs, don't rely on debug output. Good debuggers: GDB (the GNU debugger) and LLVM's LLDB. Most IDEs have a graphical debugger front-end, but the command line can already be very helpful when your program crashes. There's a curses-based interface for gdb, called cgdb that I can recommend. Little known fact: GDB now supports (limited) reverse debugging.
If you have never used a debugger, check out this toy example:

A buggy program

#include <iostream>
#include <cstdlib>

int bar(int len, char* args[]) {
  int sum = 0;
  for (unsigned i=0; i<len; ++i)
    sum += std::atoi(args[i]);
  return sum;
}

void foo(int len, char* array[]) {
  if (len > 1)
    std::cout << bar(len,array+1) << std::endl; 
  else
    std::cout << 0 << std::endl;
}

int main(int argc, char* argv[]) {
  foo(argc, argv);
  return 0;
}

Program crashes

me@mymachine:/tmp$ ./test 123 456 789
Segmentation fault (core dumped)

Debugging the program with GDB

me@mymachine:/tmp$ gdb --args ./test 123 456 789
GNU gdb (GDB) 7.5-ubuntu
Copyright (C) 2012 Free Software Foundation, Inc.
...
Reading symbols from /tmp/test...done.
(gdb) run
Starting program: /tmp/test 123 456 789

Program received signal SIGSEGV, Segmentation fault.
____strtol_l_internal (nptr=0x0, endptr=0x0, base=10, group=<optimized out>, loc=0x7ffff7ad1040 <_nl_global_locale>) at ../stdlib/strtol_l.c:298
(gdb) bt
#0  ____strtol_l_internal (nptr=0x0, endptr=0x0, base=10, group=<optimized out>, loc=0x7ffff7ad1040 <_nl_global_locale>) at ../stdlib/strtol_l.c:298
#1  0x00007ffff77519e0 in atoi (nptr=<optimized out>) at atoi.c:28
#2  0x0000000000400898 in bar (len=4, args=0x7fffffffe040) at test.cpp:7
#3  0x00000000004008db in foo (len=4, array=0x7fffffffe038) at test.cpp:13
#4  0x0000000000400934 in main (argc=4, argv=0x7fffffffe038) at test.cpp:19
(gdb) f 2
#2  0x0000000000400898 in bar (len=4, args=0x7fffffffe040) at test.cpp:7
7                       sum += std::atoi(args[i]);
(gdb) p i
$1 = 3
(gdb) quit

By default, printing objects works well for built-in types in GDB, but is often less helpful for STL data structures (e.g. std::string or std::vector) as you only see memory addresses, not the content of the container. Here is some information on how to change this for GDB. LLDB has a similar feature called synthetic children.

If your program behaves somehow "indeterministic" or "mysterious", Valgrind is your friend. Valgrind's memcheck finds illegal accesses to memory, uninitialized reads and much more. The option --db-attach=yes starts the debugger when an error is found. Check out this blog post on the interaction between GDB and Valgrind.
Valgrind's Helgrind and DRD can help you find thread-related problems. This short blog post gives some helpful advise on how to detect the cause of a deadlock. A significantly faster and only marginally less thorough alternative to Valgrind's memcheck is AddressSanitizer.

Before making a commit in your SCM system, make sure your program is memchecked and passes all unit tests.

Integrated Development Environments (IDEs)

If you prefer an IDE over a setup with just an editor and a command line, Eclipse with CDT is a good (but heavyweight) cross-platform IDE. KDevelop is also a good choice for KDE users. Both have the advantage, that you can easily import make-based projects and build your programs from within the IDE using make. C++ guru (and Microsoft employee) Herb Sutter recommends the free version of Microsoft Visual C++ for Windows users.

Testing

Unit tests help to improve the correctness of your code and prevent regression. googletest a great unit-testing framework for C++. Use it to write testcases for each class/algorithm that actually try to break it (this is easier if you write your unit tests before you implement the code itself). Include corner cases and try to find off-by-one errors. Use realistic parameters, e.g. dozens of threads and millions of elements in your data structures.
bcov is a code coverage analysis tool that tells you how much of your code is covered by your unit tests. Using a code coverage tool is probably a case of using a sledgehammer to crack a nut for your (smallish) project, but I found it worth mentioning... especially since my boss wrote it.

Profiling

Profilers help you to understand the performance of your program (and the environment it is running in). As profiling is probably not required for assignments/projects (but may help!), I keep this section short. To put it bluntly: Avoid oprofile (and gprof), prefer perf. It is easy to use perf and it helps you understand where you spend your CPU cycles, how many cache misses you produce and much more.
If you qualify for a student license, you can get Intel® VTune™ for free. Check it out, it's complex but amazing.

"I already knew all this (and much more) ... "

Great! Talk to me about a student job.