OpenGL 4.0 GPU Draw Feature?

2019-05-27 01:43发布

问题:

In Wikipedia and other sources' description of OpenGL 4.0 I read about this feature:

Drawing of data generated by OpenGL or external APIs such as OpenCL, without CPU intervention.

What is this referring to?

Edit:

Seems like this must be referring to Draw_Indirect which I believe somehow extends the draw phase to include feedback from shader programs or programs from interop (OpenCL/CUDA basically) It looks as if there are a few caveats and tricks to getting the calls to keep staying on the GPU for any extended amount of time past the second run but it should be possible.

If anyone can provide any more info on using draw commands without CPU or can describe draw indirect better, please feel free to do so. It will be greatly appreciated.

回答1:

I believe that you may be refering to GL_ARB_draw_indirect functionality that allows OpenGL to source the DrawArrays or DrawElements parameters from a GPU buffer object, that can be filled by OpenGL or OpenCL.

If I'm not mistaken, it's included in core OpenGL 4.



回答2:

I haven't figured out how particularly OpenGL 4.0 makes this feature work, since it has existed before as well as far as I have understood. I'm not sure if this answers your question, but I'll tell what I know about the subject anyway.

It refers to a situation where some other library than OpenGL, such as OpenCL or CUDA, produces some data directly into the memory of the graphics card, and then OpenGL continues from where the other library left, and uses that data as

  • pixel buffer object (PBO) when they want to draw the data to the screen as it is
  • texture when they want to use the graphics data as a part of some other scene
  • vertex buffer object (VBO) when they want to use the produced data as some arbitrary attribute input for vertex shader. (one example of this might be a particle system which is simulated with CUDA and rendered with OpenGL)

In a situation like this, it's a very good idea to keep the data in the graphics card all the time and not copy it around, especially not copy it through CPU, because the PCIe bus is very slow when compared to the memory bus of the graphics card.

Here's some sample code to do the trick with CUDA and OpenGL for VBOs and PBOs:

// in the beginning
glGenBuffers(&id, 1);

// for every frame
cudaGLRegisterBufferObject(id);
CUdeviceptr ptr;
cudaGLMapBufferObject(&ptr, id);
// <launch kernel here>
cudaGLUnmapBufferObject(id);
// <now use the buffer "id" with OpenGL>
cudaGLUnregisterBufferObject(id);

And here's how you can load the data into a texture:

glBindBuffer(GL_PIXEL_UNPACK_BUFFER, id);
glBindTexture(GL_TEXTURE_2D, your_tex_id);
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, 256, 256, GL_RGBA, GL_UNSIGNED_BYTE, 0);

Also note that if you use some more unusual format instead of GL_RGBA it might be slower because it has to convert all the values.

I don't know OpenCL but the idea is the same. Only function names are different.

Another way to do the same thing is what is called host pinned memory. In that approach you map some CPU memory address range to the graphics card memory.



回答3:

To understand what this feature is, you must understand how things worked before.

Pre 4.0, OpenCL could fill OpenGL buffer objects with data. Indeed, regular OpenGL commands could fill OpenGL buffer objects with data, either with transform feedback or by rendering to a buffer texture. This data could be vertex data to be used for rendering.

Only the CPU can initiate the rendering of vertex data (by calling one of the glDraw* functions. Even so, there isn't a need for explicit synchronization here (outside of whatever OpenCL/OpenGL interop requires). Specifically, the CPU doesn't have to read data written by GPU operations.

But this leads to a problem. If OpenCL, or whatever GPU operation, always writes a known number of vertices to the buffer, then everything is fine. However, this does not have to be the case. It is often desirable for a GPU process to write an arbitrary number of vertices. Obviously there needs to be a maximum limit (the size of the buffer). But other than that, you want it to be able to write whatever it wants.

The problem is that OpenCL decided how many to write. But the CPU now needs that number in order to use one of the glDraw functions. If OpenCL wrote 22,000 vertices, then the CPU needs to pass 22,000 to glDrawArrays.

What ARB_draw_indirect (a core feature of GL 4.0) does is allow a GPU process to write values into a buffer object that represent the parameters you would pass to a glDraw* function. The only parameter not covered by this is the primitive type.

Note that the CPU still controls when the rendering happens. The CPU still decides what buffers vertex data are pulled from. So OpenCL can write several of these glDraw* commands, but until the CPU actually calls glDrawElementsIndirect for one of them, nothing actually gets rendered.

So what you can do is run an OpenCL process that will write some data to existing buffer objects. Then you bind those buffers using usual vertex setup, like with a VAO. The OpenCL process will write the appropriate rendering command data to other buffer objects, that you will bind as indirect buffers. And then you use glDraw*Indirect to render these commands.

At no time does the CPU have to read data back from the GPU.