Intel's Threading Building Blocks (TBB) open source library looks really interesting. Even though there's even an O'Reilly Book about the subject I don't hear about a lot of people using it. I'm interested in using it for some multi-level parallel applications (MPI + threads) in Unix (Mac, Linux, etc.) environments. For what it's worth, I'm interested in high performance computing / numerical methods kinds of applications.
Does anyone have experiences with TBB? Does it work well? Is it fairly portable (including GCC and other compilers)? Does the paradigm work well for programs you've written? Are there other libraries I should look into?
I've introduced it into our code base because we needed a bettor malloc to use when we moved to a 16 core machine. With 8 and under it wasn't a significant issue. It has worked well for us. We plan on using the fine grained concurrent containers next. Ideally we can make use of the real meat of the product, but that requires rethinking how we build our code. I really like the ideas in TBB, but it's not easy to retrofit onto a code base.
You can't think of TBB as another threading library. They have a whole new model that really sits on top of threads and abstracts the threads away. You learn to think in task, parallel_for type operations and pipelines. If I were to build a new project I would probably try to model it in this fashion.
We work in Visual Studio and it works just fine. It was originally written for linux/pthreads so it runs just fine over there also.
ZThread is LGPL, you are limited to use the library in dynamic linkage if not working in a open source project.
The Threading Building Blocks (TBB) in the open source version, (there is a new commercial version, $299 , don't know the differences yet) is GNU General Public License version 2 with a so-called “Runtime Exception”
(that is specific to the use only on creating free software.)I've seen other Runtime Exceptions that attempt to approach LGPL but enabling commercial use and static linking thisis notis now the case.I'm only writing this because I took the chance to examine the libraries licenses and those should be also a consideration for selection based on the use one intends to give them.
Txs, Jihn for pointing out this update...
Portability
TBB is portable. It supports Intel and AMD (i.e. x86) processors, IBM PowerPC and POWER processors, ARM processors, and possibly others. If you look in the build directory, you can see all the configurations the build system support, which include a wide range of operating systems (Linux, Windows, Android, MacOS, iOS, FreeBSD, AIX, etc.) and compilers (GCC, Intel, Clang/LLVM, IBM XL, etc.). I have not tried TBB with the PGI C++ compiler and know that it does not work with the Cray C++ compiler (as of 2017).
A few years ago, I was part of the effort to port TBB to IBM Blue Gene systems. Static linking was a challenge, but is now addressed by the big_iron.inc build system helper. The other issues were supporting relatively ancient versions of GCC (4.1 and 4.4) and ensuring the PowerPC atomics were working. I expect that porting to any currently unsupported architecture would be relatively straightforward on platforms that provide or are compatible with GCC and POSIX.
Usage in community codes
I am aware of at least two HPC application frameworks that uses TBB:
I do not know how MOOSE uses TBB, but MADNESS uses TBB for its task queue and memory allocator.
Performance versus other threading models
I have personally used TBB in the Parallel Research Kernels project, within which I have compared TBB to OpenMP, OpenCL, Kokkos, RAJA, C++17 Parallel STL, and other models. See the C++ subdirectory for details.
The following figure shows the relative performance of the aforementioned models on an Intel Xeon Phi 7250 processor (the details aren't important - all models used the same settings). As you can see, TBB does quite well except for smaller problem sizes, where the overhead of adaptive scheduling is more relevant. TBB has tuning knobs that will affect these results.
Full disclosure: I work for Intel in a research/pathfinding capacity.
I use TBB in one project. It seemed to be easier to use it than threads. There are tasks which can be run in parallel. A task is just a call to your parallelized subroutine. Load balancing is done automatically. That is why I accept it as a higher level parallelization library. I achieved 2.5x speed up without much work on a 4 core intel processor. There are examples, they answer questions on forums and it is maintained and it is free.
Have you looked at boost library with its thread API?
I'm not doing numerical computing but I work with data mining (think clustering and classification), and our workloads are probably similar: all the data is static and you have it at the beginning of the program. I have briefly investigated Intel's TBB and found them overkill for my needs. After starting with raw pthread-based code, I switched to OPENMP and got the right mix between readability and performance.