I'm trying to understand the integrate_functor
in particles_kernel.cu
from CUDA examples:
struct integrate_functor
{
float deltaTime;
//constructor for functor
//...
template <typename Tuple>
__device__
void operator()(Tuple t)
{
volatile float4 posData = thrust::get<2>(t);
volatile float4 velData = thrust::get<3>(t);
float3 pos = make_float3(posData.x, posData.y, posData.z);
float3 vel = make_float3(velData.x, velData.y, velData.z);
// update position and velocity
// ...
// store new position and velocity
thrust::get<0>(t) = make_float4(pos, posData.w);
thrust::get<1>(t) = make_float4(vel, velData.w);
}
};
We call make_float4(pos, age)
but make_float4
is defined in vector_functions.h
as
static __inline__ __host__ __device__ float4 make_float4(float x, float y, float z, float w)
{
float4 t; t.x = x; t.y = y; t.z = z; t.w = w; return t;
}
Are CUDA vector types (float3
and float4
) more efficient for the GPU and how does the compiler know how to overload the function make_float4
?
I'm expanding njuffa's comment into a worked example. In that example, I'm simply adding two arrays in three different ways: loading the data as
float
,float2
orfloat4
.These are the timings on a GT540M and on a Kepler K20c card:
As it can be seen, loading the data as
float4
is the fastest approach.Below are the disassembled codes for the three kernels (compilation for compute capability
2.1
).add_float
add_float2
add_float4
As it can be seen and as mentioned by njuffa, different load instructions are used for the three cases:
LD
,LD.64
andLD.128
, respectively.Finally, the code: