Copying global on-device pointer address back and

I created a buffer on the OpenCL device (a GPU), and from the host I need to know the global on-device pointer address so that I can put that on-device address in another buffer so that the kernel can then read from that buffer that contains the address of the first buffer so that then it can access the contents of that buffer.

If that's confusing here's what I'm trying to do: I create a generic floats-containing buffer representing a 2D image, then from the host I create a todo list of all the things my kernel needs to draw, which lines, which circles, which images... So from that list the kernel has to know where to find that image, but the reference to that image cannot be passed as a kernel argument, because that kernel might draw no image, or a thousand different images, all depending on what the list says, so it has to be referenced in that buffer that serves as a todo list for my kernel.

The awkward way I've done it so far:

To do so I tried making a function that calls a kernel after the creation of the image buffer that gets the buffer and returns the global on-device address as a ulong in another buffer, then the host stores that value in a 64-bit integer, like this:

uint64_t get_clmem_device_address(clctx_t *clctx, cl_mem buf)
{
    const char kernel_source[] =
"kernel void get_global_ptr_address(global void *ptr, global ulong *devaddr)        \n"
"{                                          \n"
"   *devaddr = (ulong) ptr;                             \n"
"}                                          \n";

    int32_t i;
    cl_int ret;
    static int init=1;
    static cl_program program;
    static cl_kernel kernel;
    size_t global_work_size[1];
    static cl_mem ret_buffer;
    uint64_t devaddr;

    if (init)
    {
        init=0;
        ret = build_cl_program(clctx, &program, kernel_source);
        ret = create_cl_kernel(clctx, program, &kernel, "get_global_ptr_address");
        ret_buffer = clCreateBuffer(clctx->context, CL_MEM_WRITE_ONLY, 1*sizeof(uint64_t), NULL, &ret);
    }
    if (kernel==NULL)
        return ;

    // Run the kernel
    ret = clSetKernelArg(kernel, 0, sizeof(cl_mem), &buf);
    ret = clSetKernelArg(kernel, 1, sizeof(cl_mem), &ret_buffer);

    global_work_size[0] = 1;
    ret = clEnqueueNDRangeKernel(clctx->command_queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, NULL);         // enqueue the kernel
    ret = clEnqueueReadBuffer(clctx->command_queue, ret_buffer, CL_FALSE, 0, 1*sizeof(uint64_t), &devaddr, 0, NULL, NULL);      // copy the value
    ret = clFlush(clctx->command_queue);
    clFinish(clctx->command_queue);

    return devaddr;
}

Apparently this works (it does return a number, although it's hard to know if it's correct), but then I put this devaddr (a 64-bit integer on the host) in the todo list buffer that the kernel uses to know what to do, and then if necessary (according to the list) the kernel calls the function below, le here being a pointer to the relevant entry in the todo list, and the 64-bit address being the first element:

float4 blit_sprite(global uint *le, float4 pv)
{
    const int2 p = (int2) (get_global_id(0), get_global_id(1));
    ulong devaddr;
    global float4 *im;
    int2 im_dim;

    devaddr = ((global ulong *) le)[0];     // global address for the start of the image as a ulong
    im_dim.x = le[2];
    im_dim.y = le[3];

    im = (global float4 *) devaddr;     // ulong is turned into a proper global pointer

    if (p.x < im_dim.x)
        if (p.y < im_dim.y)
            pv += im[p.y * im_dim.x + p.x];     // this gives me a CL_OUT_OF_RESOURCES error, even when changing it to im[0]

    return pv;
}

~~but big surprise this doesn't work, it gives me a CL_OUT_OF_RESOURCES error, which I assume means my im pointer isn't valid.~~ Actually it works, it didn't work when I used two different contexts. But it's still pretty unwieldy.

Is there a less weird way to do what I want to do?

OpenCL standard doesn't guarantee that memory objects will not be physically reallocated between kernel calls. So, original Device-side address is valid only within single kernel NDRange. That's one of the reasons why OpenCL memory objects are represented on Host side as transparent structure pointers.

Though, you can save offset to memory object's first byte in 1st kernel and pass it to 2nd kernel. Every time you launch your kernel, you will obtain actual Device-side address within your kernel & increment it by saved shift value. That would be perfectly "legal".