Renderscript fails on GPU enabled driver if USAGE_

2019-02-22 01:56发布

问题:

We are using renderscript for audio dsp processing. It is simple and improves performance significantly for our use-case. But we run into an annoying issue with USAGE_SHARED on devices that have custom driver with GPU execution enabled.

As you may know, USAGE_SHARED flag makes the renderscript allocation to reuse the given memory without having to create a copy of it. As a consequence, it not only saves memory, in our case, improves performance to desired level.

The following code with USAGE_SHARED works fine on default renderscript driver (libRSDriver.so). With custom driver (libRSDriver_adreno.so) USAGE_SHARED does not reuse given memory and thus data.

This is the code that makes use of USAGE_SHARED and calls renderscript kernel

void process(float* in1, float* in2, float* out, size_t size) {
  sp<RS> rs = new RS();
  rs->init(app_cache_dir);

  sp<const Element> e = Element::F32(rs);
  sp<const Type> t = Type::create(rs, e, size, 0, 0);

  sp<Allocation> in1Alloc = Allocation::createTyped(
                rs, t,
                RS_ALLOCATION_MIPMAP_NONE, 
                RS_ALLOCATION_USAGE_SCRIPT | RS_ALLOCATION_USAGE_SHARED,
                in1);

  sp<Allocation> in2Alloc = Allocation::createTyped(
                rs, t,
                RS_ALLOCATION_MIPMAP_NONE, 
                RS_ALLOCATION_USAGE_SCRIPT | RS_ALLOCATION_USAGE_SHARED,
                in2);

  sp<Allocation> outAlloc = Allocation::createTyped(
                rs, t,
                RS_ALLOCATION_MIPMAP_NONE, 
                RS_ALLOCATION_USAGE_SCRIPT | RS_ALLOCATION_USAGE_SHARED,
                out);

  ScriptC_x* rsX = new ScriptC_x(rs);
  rsX->set_in1Alloc(in1Alloc);
  rsX->set_in2Alloc(in2Alloc);
  rsX->set_size(size);

  rsX->forEach_compute(in1Alloc, outAlloc);
}

NOTE: This variation of Allocation::createTyped() is not mentioned in the documentation, but code rsCppStructs.h has it. This is the allocation factory method that allows providing backing pointer and respects USAGE_SHARED flag. This is how it is declared:

/**
 * Creates an Allocation for use by scripts with a given Type and a backing pointer. For use
 * with RS_ALLOCATION_USAGE_SHARED.
 * @param[in] rs Context to which the Allocation will belong
 * @param[in] type Type of the Allocation
 * @param[in] mipmaps desired mipmap behavior for the Allocation
 * @param[in] usage usage for the Allocation
 * @param[in] pointer existing backing store to use for this Allocation if possible
 * @return new Allocation
 */
static sp<Allocation> createTyped(
            const sp<RS>& rs, const sp<const Type>& type,
            RsAllocationMipmapControl mipmaps, 
            uint32_t usage, 
            void * pointer);

This is the renderscript kernel

rs_allocation in1Alloc, in2Alloc;
uint32_t size;

// JUST AN EXAMPLE KERNEL
// Not using reduction kernel since it is only available in later API levels.
// Not sure if support library helps here. Anyways, unrelated to the current problem

float compute(float ignored, uint32_t x) {
  float result = 0.0f;
  for (uint32_t i=0; i<size; i++) {
    result += rsGetElementAt_float(in1Alloc, x) * rsGetElementAt_float(in2Alloc, size-i-1); // just an example computation
  }

  return result;
}

As mentioned, out doesn't have any of the result of the calculation. syncAll(RS_ALLOCATION_USAGE_SHARED) also didn't help.

The following works though (but much slower)

void process(float* in1, float* in2, float* out, size_t size) {
  sp<RS> rs = new RS();
  rs->init(app_cache_dir);

  sp<const Element> e = Element::F32(rs);
  sp<const Type> t = Type::create(rs, e, size, 0, 0);

  sp<Allocation> in1Alloc = Allocation::createTyped(rs, t);
  in1Alloc->copy1DFrom(in1);

  sp<Allocation> in2Alloc = Allocation::createTyped(rs, t);
  in2Alloc->copy1DFrom(in2);

  sp<Allocation> outAlloc = Allocation::createTyped(rs, t);

  ScriptC_x* rsX = new ScriptC_x(rs);
  rsX->set_in1Alloc(in1Alloc);
  rsX->set_in2Alloc(in2Alloc);
  rsX->set_size(size);

  rsX->forEach_compute(in1Alloc, outAlloc);
  outAlloc->copy1DTo(out);
}

Copying makes it to work, but in our testing, copying back and forth significantly degrades performance.

If we switch off GPU execution through debug.rs.default-CPU-driver system property, we could see that custom driver works well with desired performance.

Aligning memory given to renderscript to 16,32,.., or 1024, etc did not help to make the custom driver respect USAGE_SHARED.

Question

So, our question is this: How to make this kernel work for devices that use custom renderscript driver that enables GPU execution?

回答1:

You need to have the copy even if you use USAGE_SHARED.

USAGE_SHARED is just a hint to the driver, it doesn’t have to use it.

If the driver does share the memory the copy will be ignored and performance will be the same.