We are using renderscript for audio dsp processing. It is simple and improves performance significantly for our use-case. But we run into an annoying issue with USAGE_SHARED
on devices that have custom driver with GPU execution enabled.
As you may know, USAGE_SHARED
flag makes the renderscript allocation to reuse the given memory without having to create a copy of it. As a consequence, it not only saves memory, in our case, improves performance to desired level.
The following code with USAGE_SHARED
works fine on default renderscript driver (libRSDriver.so
). With custom driver (libRSDriver_adreno.so
) USAGE_SHARED
does not reuse given memory and thus data.
This is the code that makes use of USAGE_SHARED
and calls renderscript kernel
void process(float* in1, float* in2, float* out, size_t size) {
sp<RS> rs = new RS();
rs->init(app_cache_dir);
sp<const Element> e = Element::F32(rs);
sp<const Type> t = Type::create(rs, e, size, 0, 0);
sp<Allocation> in1Alloc = Allocation::createTyped(
rs, t,
RS_ALLOCATION_MIPMAP_NONE,
RS_ALLOCATION_USAGE_SCRIPT | RS_ALLOCATION_USAGE_SHARED,
in1);
sp<Allocation> in2Alloc = Allocation::createTyped(
rs, t,
RS_ALLOCATION_MIPMAP_NONE,
RS_ALLOCATION_USAGE_SCRIPT | RS_ALLOCATION_USAGE_SHARED,
in2);
sp<Allocation> outAlloc = Allocation::createTyped(
rs, t,
RS_ALLOCATION_MIPMAP_NONE,
RS_ALLOCATION_USAGE_SCRIPT | RS_ALLOCATION_USAGE_SHARED,
out);
ScriptC_x* rsX = new ScriptC_x(rs);
rsX->set_in1Alloc(in1Alloc);
rsX->set_in2Alloc(in2Alloc);
rsX->set_size(size);
rsX->forEach_compute(in1Alloc, outAlloc);
}
NOTE: This variation of Allocation::createTyped()
is not mentioned in the documentation, but code rsCppStructs.h
has it. This is the allocation factory method that allows providing backing pointer and respects USAGE_SHARED
flag. This is how it is declared:
/**
* Creates an Allocation for use by scripts with a given Type and a backing pointer. For use
* with RS_ALLOCATION_USAGE_SHARED.
* @param[in] rs Context to which the Allocation will belong
* @param[in] type Type of the Allocation
* @param[in] mipmaps desired mipmap behavior for the Allocation
* @param[in] usage usage for the Allocation
* @param[in] pointer existing backing store to use for this Allocation if possible
* @return new Allocation
*/
static sp<Allocation> createTyped(
const sp<RS>& rs, const sp<const Type>& type,
RsAllocationMipmapControl mipmaps,
uint32_t usage,
void * pointer);
This is the renderscript kernel
rs_allocation in1Alloc, in2Alloc;
uint32_t size;
// JUST AN EXAMPLE KERNEL
// Not using reduction kernel since it is only available in later API levels.
// Not sure if support library helps here. Anyways, unrelated to the current problem
float compute(float ignored, uint32_t x) {
float result = 0.0f;
for (uint32_t i=0; i<size; i++) {
result += rsGetElementAt_float(in1Alloc, x) * rsGetElementAt_float(in2Alloc, size-i-1); // just an example computation
}
return result;
}
As mentioned, out
doesn't have any of the result of the calculation.
syncAll(RS_ALLOCATION_USAGE_SHARED)
also didn't help.
The following works though (but much slower)
void process(float* in1, float* in2, float* out, size_t size) {
sp<RS> rs = new RS();
rs->init(app_cache_dir);
sp<const Element> e = Element::F32(rs);
sp<const Type> t = Type::create(rs, e, size, 0, 0);
sp<Allocation> in1Alloc = Allocation::createTyped(rs, t);
in1Alloc->copy1DFrom(in1);
sp<Allocation> in2Alloc = Allocation::createTyped(rs, t);
in2Alloc->copy1DFrom(in2);
sp<Allocation> outAlloc = Allocation::createTyped(rs, t);
ScriptC_x* rsX = new ScriptC_x(rs);
rsX->set_in1Alloc(in1Alloc);
rsX->set_in2Alloc(in2Alloc);
rsX->set_size(size);
rsX->forEach_compute(in1Alloc, outAlloc);
outAlloc->copy1DTo(out);
}
Copying makes it to work, but in our testing, copying back and forth significantly degrades performance.
If we switch off GPU execution through debug.rs.default-CPU-driver
system property, we could see that custom driver works well with desired performance.
Aligning memory given to renderscript to 16,32,.., or 1024, etc did not help to make the custom driver respect USAGE_SHARED.
Question
So, our question is this: How to make this kernel work for devices that use custom renderscript driver that enables GPU execution?