Small array stored like variables in a kernel?

2019-09-14 10:51发布

问题:

In my OpenCL kernel I need to use what should normally be a small array of 4 entries, but because of my concerns over how that array would be stored (probably in a much slower kind of memory than regular variables) I'm instead using 4 separate variables and a switch-case statement to access the correct one based on an index.

Is there a way to make a small array of 4 x float4 work as fast and seamlessly as 4 separate float4 variables?

Here's what I'm trying to do: my kernel is meant to generate a single float4 variable v by going through a list of operations to apply to v. It runs sequentially, with operation after operation in the list being applied to v, however in that list there can be sort of brackets/parentheses, which just like in arithmetic isolate a group of operations for them to be done in isolation before the result of that bracket being brought back in with the rest.

So if a bracket is being opened then I should temporarily store the value of v into let's say v0 (to represent the current value at the bracket depth of 0), then v can be reset to 0 and perform the operations inside the bracket, and if there's yet another bracket inside that bracket I'd put v into v1 and so on with v2 and v3 as we go deeper into nested brackets. This is so that I can for instance apply a multiplication inside a bracket that would only affect the other things created inside that bracket and not the rest.

And once a bracket closes I would retrieve e.g. v3 and add v to it, and in the end all brackets would close and v would represent the final desired value of the series of operations and be written to a global buffer. This is doable using switch-case statements to select the correct variable according to the current bracket depth, but this is quite absurd as this is what arrays are for. So I'm not sure what the best thing to do is.

回答1:

From what I've seen, compilers will usually put small arrays declared in the private address space directly in registers. Of course, this is not a guarantee and there are probably different parameters that intervene in the activation of that optimization, such as:

  • Array size;
  • Register pressure;
  • Cost of spilling;
  • And others.

As is usual with optimizations, the only way to be sure is to verify what the compiler is doing by checking the generated assembly.

So if a bracket is being opened then I should temporarily store the value of v into let's say v0 (to represent the current value at the bracket depth of 0), then v can be reset to 0 and perform the operations inside the bracket, and if there's yet another bracket inside that bracket I'd put v into v1 and so on with v2 and v3 as we go deeper into nested brackets. This is so that I can for instance apply a multiplication inside a bracket that would only affect the other things created inside that bracket and not the rest.

I don't think that would help. The compiler optimizes across scopes anyway. Just do the straightforward thing and let the optimizer do its job. Then, if you notice suboptimal codegen, you may start thinking about an alternate solution, but not before.



标签: opencl