iOS Accelerate Framework vImage - Performance impr

2020-06-06 04:15发布

I've been working with OpenCV and Apple's Accelerate framework and find the performance of Accelerate to be slow and Apple's documentation limited. Let's take for example:

void equalizeHistogram(const cv::Mat &planar8Image, cv::Mat &equalizedImage)
{
    cv::Size size = planar8Image.size();
    vImage_Buffer planarImageBuffer = {
        .width = static_cast<vImagePixelCount>(size.width),
        .height = static_cast<vImagePixelCount>(size.height),
        .rowBytes = planar8Image.step,
        .data = planar8Image.data
    };

    vImage_Buffer equalizedImageBuffer = {
        .width = static_cast<vImagePixelCount>(size.width),
        .height = static_cast<vImagePixelCount>(size.height),
        .rowBytes = equalizedImage.step,
        .data = equalizedImage.data
    };

    TIME_START(VIMAGE_EQUALIZE_HISTOGRAM);
    vImage_Error error = vImageEqualization_Planar8(&planarImageBuffer, &equalizedImageBuffer, kvImageNoFlags);
    TIME_END(VIMAGE_EQUALIZE_HISTOGRAM);
    if (error != kvImageNoError) {
        NSLog(@"%s, vImage error %zd", __PRETTY_FUNCTION__, error);
    }
}

This call takes roughly 20ms. Which has the practical meaning of being unusable in my application. Maybe equalization of the histogram is inherently slow, but I've also tested BGRA->Grayscale and found OpenCV can do it in ~5ms and vImage takes ~20ms.

In testing of other functions I found a project that made a simple slider app with a blur function (gist) that I cleaned up to test. Roughly ~20ms as well.

Is there some trick to getting these functions to be faster?

3条回答
走好不送
2楼-- · 2020-06-06 04:46

Don't keep re-allocating vImage_Buffer if you can avoid it.

One thing that is critical to vImage accelerate performance is the reuse of vImage_Buffers. I can't say how many times I read in Apple's limited documentation hints to this effect, but I was definitely not listening.

In the aforementioned blur code example, I reworked the test app to setup the vImage_Buffer input and output buffers once per image rather than once for each call to boxBlur. I dropped <10ms per call which made a noticeable difference in response time.

This says that Accelerate needs time to warm-up before you start seeing performance improvements. The first call to this method took 34ms.

- (UIImage *)boxBlurWithSize:(int)boxSize
{
    vImage_Error error;
    error = vImageBoxConvolve_ARGB8888(&_inputImageBuffer,
                                       &_outputImageBuffer,
                                       NULL,
                                       0,
                                       0,
                                       boxSize,
                                       boxSize,
                                       NULL,
                                       kvImageEdgeExtend);
    if (error) {
        NSLog(@"vImage error %zd", error);
    }

    CGImageRef modifiedImageRef = vImageCreateCGImageFromBuffer(&_outputImageBuffer,
                                                                &_inputImageFormat,
                                                                NULL,
                                                                NULL,
                                                                kvImageNoFlags,
                                                                &error);

    UIImage *returnImage = [UIImage imageWithCGImage:modifiedImageRef];
    CGImageRelease(modifiedImageRef);

    return returnImage;
}
查看更多
别忘想泡老子
3楼-- · 2020-06-06 04:51

To use vImage with OpenCV, pass a reference to your OpenCV matrix to a method like this one:

long contrastStretch_Accelerate(const Mat& src, Mat& dst) {
    vImagePixelCount rows = static_cast<vImagePixelCount>(src.rows);
    vImagePixelCount cols = static_cast<vImagePixelCount>(src.cols);

    vImage_Buffer _src = { src.data, rows, cols, src.step };
    vImage_Buffer _dst = { dst.data, rows, cols, dst.step };

    vImage_Error err;

    err = vImageContrastStretch_ARGB8888( &_src, &_dst, 0 );
    return err;
}

The call to this method, from your OpenCV code block, looks like this:

- (void)processImage:(Mat&)image;
{
    contrastStretch_Accelerate(image, image);
}

It's that simple, and since these are all pointer references, there's no "deep copying" of any kind. It's as fast and efficient as it can possibly be, all questions of context and other related performance-considerations aside (I can help you with those, too).

SIDENOTE: Did you know that you have to change the channel permutation when mixing OpenCV with vImage? If not, prior to calling any vImage functions on an OpenCV matrix, call:

const uint8_t map[4] = { 3, 2, 1, 0 };
err = vImagePermuteChannels_ARGB8888(&_img, &_img, map, kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImagePermuteChannels_ARGB8888 error: %ld", err);

Perform the same call, map and all, to return the image to the channel order proper for an OpenCV matrix.

查看更多
放我归山
4楼-- · 2020-06-06 04:56

To get 30 frames per second using the equalizeHistogram function, you must deinterleave the image (convert from ARGBxxxx to PlanarX) and equalize ONLY R(ed)G(reen)B(lue); if you equalize A(lpha), the frame rate will drop to at least 24.

Here is the code that does exactly what you want, as fast as you want:

- (CVPixelBufferRef)copyRenderedPixelBuffer:(CVPixelBufferRef)pixelBuffer {

CVPixelBufferLockBaseAddress( pixelBuffer, 0 );

unsigned char *base = (unsigned char *)CVPixelBufferGetBaseAddress( pixelBuffer );
size_t width = CVPixelBufferGetWidth( pixelBuffer );
size_t height = CVPixelBufferGetHeight( pixelBuffer );
size_t stride = CVPixelBufferGetBytesPerRow( pixelBuffer );

vImage_Buffer _img = {
    .data = base,
    .height = height,
    .width = width,
    .rowBytes = stride
};

vImage_Error err;
vImage_Buffer _dstA, _dstR, _dstG, _dstB;

err = vImageBuffer_Init( &_dstA, height, width, 8 * sizeof( uint8_t ), kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageBuffer_Init (alpha) error: %ld", err);

err = vImageBuffer_Init( &_dstR, height, width, 8 * sizeof( uint8_t ), kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageBuffer_Init (red) error: %ld", err);

err = vImageBuffer_Init( &_dstG, height, width, 8 * sizeof( uint8_t ), kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageBuffer_Init (green) error: %ld", err);

err = vImageBuffer_Init( &_dstB, height, width, 8 * sizeof( uint8_t ), kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageBuffer_Init (blue) error: %ld", err);

err = vImageConvert_ARGB8888toPlanar8(&_img, &_dstA, &_dstR, &_dstG, &_dstB, kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageConvert_ARGB8888toPlanar8 error: %ld", err);

err = vImageEqualization_Planar8(&_dstR, &_dstR, kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageEqualization_Planar8 (red) error: %ld", err);

err = vImageEqualization_Planar8(&_dstG, &_dstG, kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageEqualization_Planar8 (green) error: %ld", err);

err = vImageEqualization_Planar8(&_dstB, &_dstB, kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageEqualization_Planar8 (blue) error: %ld", err);

err = vImageConvert_Planar8toARGB8888(&_dstA, &_dstR, &_dstG, &_dstB, &_img, kvImageNoFlags);
if (err != kvImageNoError)
    NSLog(@"vImageConvert_Planar8toARGB8888 error: %ld", err);

err = vImageContrastStretch_ARGB8888( &_img, &_img, kvImageNoError );
if (err != kvImageNoError)
    NSLog(@"vImageContrastStretch_ARGB8888 error: %ld", err);

free(_dstA.data);
free(_dstR.data);
free(_dstG.data);
free(_dstB.data);

CVPixelBufferUnlockBaseAddress( pixelBuffer, 0 );

return (CVPixelBufferRef)CFRetain( pixelBuffer );

}

Notice that I allocate the alpha channel, even though I perform nothing on it; that's simply because converting back and forth between ARGB8888 and Planar8 requires alpha-channel buffer allocation and reference. Same performance and quality enhancements, regardless.

Also note that I perform contrast stretching after converting the Planar8 buffers into a single ARGB8888 buffer; that's because it's faster than applying the function channel-by-channel, as I did with the histogram equalization function, and gets the same results as doing it individually (the contrast stretching function does not cause the same alpha-channel distortion as histogram equalization).

查看更多
登录 后发表回答