I'm currently developing an OpenCL-application for a very heterogeneous set of computers (using JavaCL to be specific). In order to maximize performance I want to use a GPU if it's available otherwise I want to fall back to the CPU and use SIMD-instructions. My plan is to implement the OpenCL-code using vector-types because my understanding is that this allows CPUs to vectorize the instructions and use SIMD-instructions.
My question however is regarding which OpenCL-implementation to use. E.g. if the computer has a Nvidia GPU I assume it's best to use Nvidia's library but if no GPU is available I want to use Intel's library to use the SIMD-instructions.
How do I achieve this? Is this handled automatically or do I have to include all libraries and implement some logic to pick the right one? It feels like this is a problem that more people than I are facing.
Update After testing the different OpenCL-drivers this is my experience so far:
Intel: crashed the JVM when JavaCL tried to call it. After a restart it didn't crash the JVM but it also didn't return any usable devices (I was using an Intel I7-CPU). When I compiled the OpenCL-code offline it seemed to be able to do some auto-vectorization so Intel's compiler seems quite nice.
Nvidia: Refused to install their WHQL-drivers because it claimed I didn't have Nvidia-card (that computer has a Geforce GT 330M). When I tried it on a different computer I managed to get all the way to create a kernel but at the first execution it crashed the drivers (the screen flickered for a while and Windows 7 said it had to restart the drivers). The second execution caused a bluee-screen of death.
AMD/ATI: Refused to install 32-bit SDK (I tried that since I will be using a 32-bit JVM) but 64-bit SDK worked well. This is the only driver which I've managed to execute the code on (after a restart because at first it gave a cryptic error-message when compiling). However it doesn't seem to be able to do any implicit vectorization and since I don't have any ATI GPU I didn't get any performance increase compared to the Java-implementation. If I use vector-types I might see some improvements though.
TL;DR None of the drivers seem ready for commercial use. I'm probably better of creating JNI-module with C-code compiled to use SSE-instructions.