In my OpenCL kernel I need to use what should normally be a small array of 4 entries, but because of my concerns over how that array would be stored (probably in a much slower kind of memory than regular variables) I'm instead using 4 separate variables and a switch-case statement to access the correct one based on an index.
Is there a way to make a small array of 4 x float4 work as fast and seamlessly as 4 separate float4 variables?
Here's what I'm trying to do: my kernel is meant to generate a single float4 variable v
by going through a list of operations to apply to v
. It runs sequentially, with operation after operation in the list being applied to v
, however in that list there can be sort of brackets/parentheses, which just like in arithmetic isolate a group of operations for them to be done in isolation before the result of that bracket being brought back in with the rest.
So if a bracket is being opened then I should temporarily store the value of v
into let's say v0
(to represent the current value at the bracket depth of 0), then v
can be reset to 0 and perform the operations inside the bracket, and if there's yet another bracket inside that bracket I'd put v
into v1
and so on with v2
and v3
as we go deeper into nested brackets. This is so that I can for instance apply a multiplication inside a bracket that would only affect the other things created inside that bracket and not the rest.
And once a bracket closes I would retrieve e.g. v3
and add v
to it, and in the end all brackets would close and v
would represent the final desired value of the series of operations and be written to a global buffer. This is doable using switch-case statements to select the correct variable according to the current bracket depth, but this is quite absurd as this is what arrays are for. So I'm not sure what the best thing to do is.