OpenGL ES 2.0 : Seeking VBO Performance/Optimisati

In my ongoing attempt to convert to OpenGL ES 2.0 from ES 1.x I'm currently converting some code to use Vertex Buffer Objects ('VBOs') rather than the existing unbuffered glDrawArrays calls.

I've set up the VBOs and got them working, but have hit a design dilemma and would be grateful the advice of someone more experienced with OpenGL ES 2.0.

I want to draw a bunch of polygonal sprites which move often. This is because they're dynamic Box2D bodies, if you're familiar with Box2D. These polygon bodies are each generated by use of GL_TRIANGLE_FAN, which is somewhat critical since GL_POLYGON is unavailable on ES 2.0 .

The polygons have other attributes such as color which may be changed at some stage during the application lifecycle, but it's the vertex positions which are guaranteed to change almost every frame.

The polygons are grouped in-game, so it's my intention to manage and draw an interleaved vertex/colour array per-group rather than per-body, in an attempt to minimise GPU communication.

There are a number of routes to success here, I've been reading OpenGL ES 2.0 programming guide to seek as much info and optimisation tips as I can relating to VBOs and here's what they say:

Interleaved data is favourable since "attribute data for each vertex can be read in sequential fashion".

The book recommends that "if a subset of vertex attribute data needs to be modified..one can.. store vertex attributes that are dynamic in nature in a separate buffer".

The recommendation is to "use GL_HALF_FLOAT_OES wherever possible" most notably for colors since unprojected vertex locations may exceed this space requirement.

glMapBufferOES should only be used if the whole buffer is being updated, and even in this case the operation "can still be expensive compared to glBufferData".

Here are my questions:

If using GL_TRIANGLE_FAN as the drawing mode, does this force me to store a VBO per-body rather than per-group? Or will a common vertex location to 'end' the fan and current body cause a new fan to be drawn for the next body in a group VBO?

Should I interleave all of my data despite vertex locations being updated at a high frequency, or should I separate all of it/just the locations into their own VBO?

Following the book advice above, presumably I should glBufferData my vertex locations in their entirety every time I update the render, rather than using glMapBufferOES or glBufferSubData to update the buffered locations?

Are there any unmentioned functions/design choices I should be utilising to enhance performance in a many-moving-polygons context?

Should I attempt to use GL_HALF_FLOAT_OES for color storage, i.e. in the space of 2 floats I instead store 4 half-float numbers? If the answer is 'yes', would I just use any GL type which is half the size of GLfloat for each colour, then bitwise OR them, then insert into the appropriate attribute array?

Once I've created X many VBOs, are the only calls I need to make for each render glBindBuffer, glBufferData, and glDrawElements/Arrays, or must I also call glEnableVertexAttribArray and glVertexAttribPointer each time I use glBufferData?

I'd be extremely grateful for further advice on this, thank you.

I don't have any ES experience, but I think many things still apply

Partly, it doesn't force you to use one VBO per-body, but you have to do one glDrawArrays per-body. These might still source their data from the same buffer, but it's still not advisable. Instead I would move away from complex primitives like triangle fans or strips and use indexed triangle lists, this way everything can be drawn in a single call. I doubt that ES supports the primitve_restart extension. With this you could specify a special vertex index that restarts the primitve.
If you got many other static attributes, it would be a good idea to seperate the vertex positions into their own buffer (that of course has GL_DYNAMIC_DRAW or even GL_STREAM_DRAW usage). But if you only got an additional 4ub color or something the like, then the extra copying cost might not be that bad and you might better profit from interleaving, needs to be tested.
If you update them all every frame, then a complete glBufferData might be better than a glBufferSubData. Or you can also call glBufferData(..., NULL) and then glMapBuffer(..., GL_WRITE_ONLY) if you don't want to hold a CPU array for your data. This tells the driver that you don't care for the previous data anymore. This way the driver can allocate a completely new buffer for you, while the previous data is still being used for rendering. So you can uplaod the new data while the old is still in use (The old buffer is freed by the driver when not used anymore).
place-holder
For colors GL_UNSIGNED_BYTE might be even better, as these usually don't need that high a precision. This might also play well with alignment optimizations when you e.g. have 3 float coordinates and 4 byte color channels, which makes a vertex of 16 byte, which is very alignment friendly. In this case it might be advisable to keep vertices and colors in the same buffer.

EDIT: To clarify on point 3 a bit: If you have your data in a CPU array anyway, then you can just call glBufferData with your data. If you want the driver to allocate that storage for you, you can use glMapBuffer, which gives you a pointer to the buffer memory mapped into CPU address space (and of course you GL_WRITE_ONLY as you don't care for the previous data). But in this case glBufferData with a null pointer will allocate completely new storage for the buffer (without copying any data), which tells the driver, that we don't care for the previous contents (even they might currently still be used for rendering). The driver can optimize this case and allocate new storage under the hood, but still not free the previous storage, which is then freed when the previous data is finally not used anymore for rendering. But keep in mind that you don't create another buffer, it just goes on under the hood of the driver. So when you want to update the whole buffer, you can either do

updateData(dataArray);
glBufferData(GL_ARRAY_BUFFER, size, dataArray, usage);

if you got the data in your own CPU array anyway, or

glBufferData(GL_ARRAY_BUFFER, size, NULL, usage);
dataArray = glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY);
updateData(dataArray);
glUnmapBuffer(GL_ARRAY_BUFFER);

If you don't need a CPU copy of your and want the driver to take care of. But if update the data step-wise during the whole application, then the first solution might be better, as you cannot use a buffer for rendering as long as its mapped, of course, and you should only map a buffer for a short time.