Speed up Matrix Addition in C#

2020-05-27 03:32发布

I'd like to optimize this piece of code :

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{            
        for (int x = 0; x < Width; x++)
        {
            for (int y = 0; y < Height; y++)
            {
                Byte  pixelValue = image.GetPixel(x, y).B;
                this.sumOfPixelValues[x, y] += pixelValue;
                this.sumOfPixelValuesSquared[x, y] += pixelValue * pixelValue;
            }
        }
}

This is to be used for image processing, and we're currently running this for about 200 images. We've optimized the GetPixel value to use unsafe code, and we're not using image.Width, or image.Height, as those properties were adding to our runtime costs.

However, we're still stuck at a low speed. The problem is that our images are 640x480, so the middle of the loop is being called about 640x480x200 times. I'd like to ask if there's a way to speed it up somehow, or convince me that it's fast enough as it is. Perhaps a way is through some fast Matrix Addition, or is Matrix Addition inherently an n^2 operation with no way to speed it up?

Perhaps doing array accesses via unsafe code would speed it up, but I'm not sure how to go about doing it, and whether it would be worth the time. Probably not. Thanks.

EDIT : Thank you for all your answers.

This is the GetPixel method we're using:

 public Color GetPixel(int x, int y)
    {
        int offsetFromOrigin = (y * this.stride) + (x * 3);
        unsafe
        {
            return Color.FromArgb(this.imagePtr[offsetFromOrigin + 2], this.imagePtr[offsetFromOrigin + 1], this.imagePtr[offsetFromOrigin]);
        }
    }

15条回答
爷的心禁止访问
2楼-- · 2020-05-27 04:10

Code profiling is the best place to start.

Matrix addition is a highly parallel operation and can be speed up by parallelizing the operation w/ multiple threads.

I would recommend using Intels IPP library that contains threaded highly optimized API for this sort of operation. Perhaps surprisingly it's only about $100 - but would add significant complexity to your project.

If you don't want to trouble yourself with mixed language programming and IPP, you could try out centerspace's C# math libraries. The NMath API contains easy to used, forward scaling, matrix operations.

Paul

查看更多
别忘想泡老子
3楼-- · 2020-05-27 04:10

matrix's addition complexity is O(n^2), in number of additions.

However, since there are no intermediate results, you can parallelize the additions using threads:

  1. it easy to proof that the resulting algorithm will be lock-free
  2. you can tune the optimal number of threads to use
查看更多
戒情不戒烟
4楼-- · 2020-05-27 04:13

System.Drawing.Color is a structure, which on current versions of .NET kills most optimizations. Since you're only interested in the blue component anyway, use a method that only gets the data you need.

public byte GetPixelBlue(int x, int y)
{
    int offsetFromOrigin = (y * this.stride) + (x * 3);
    unsafe
    {
        return this.imagePtr[offsetFromOrigin];
    }
}

Now, exchange the order of iteration of x and y:

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{            
    for (int y = 0; y < Height; y++)
    {
        for (int x = 0; x < Width; x++)
        {
            Byte  pixelValue = image.GetPixelBlue(x, y);
            this.sumOfPixelValues[y, x] += pixelValue;
            this.sumOfPixelValuesSquared[y, x] += pixelValue * pixelValue;
        }
    }
}

Now you're accessing all values within a scan line sequentially, which will make much better use of CPU cache for all three matrices involved (image.imagePtr, sumOfPixelValues, and sumOfPixelValuesSquared. [Thanks to Jon for noticing that when I fixed access to image.imagePtr, I broke the other two. Now the output array indexing is swapped to keep it optimal.]

Next, get rid of the member references. Another thread could theoretically be setting sumOfPixelValues to another array midway through, which does horrible horrible things to optimizations.

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{          
    uint [,] sums = this.sumOfPixelValues;
    ulong [,] squares = this.sumOfPixelValuesSquared;
    for (int y = 0; y < Height; y++)
    {
        for (int x = 0; x < Width; x++)
        {
            Byte  pixelValue = image.GetPixelBlue(x, y);
            sums[y, x] += pixelValue;
            squares[y, x] += pixelValue * pixelValue;
        }
    }
}

Now the compiler can generate optimal code for moving through the two output arrays, and after inlining and optimization, the inner loop can step through the image.imagePtr array with a stride of 3 instead of recalculating the offset all the time. Now an unsafe version for good measure, doing the optimizations that I think .NET ought to be smart enough to do but probably isn't:

unsafe public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{          
    byte* scanline = image.imagePtr;
    fixed (uint* sums = &this.sumOfPixelValues[0,0])
    fixed (uint* squared = &this.sumOfPixelValuesSquared[0,0])
    for (int y = 0; y < Height; y++)
    {
        byte* blue = scanline;
        for (int x = 0; x < Width; x++)
        {
            byte pixelValue = *blue;
            *sums += pixelValue;
            *squares += pixelValue * pixelValue;
            blue += 3;
            sums++;
            squares++;
        }
        scanline += image.stride;
    }
}
查看更多
够拽才男人
5楼-- · 2020-05-27 04:14

This is a classic case of micro-optimisation failing horribly. You're not going to get anything from looking at that loop. To get real speed benefits you need to start off by looking at the big picture:-

  • Can you asynchronously preload image[n+1] whilst processing image[n]?
  • Can you load just the B channel from the image? This will decrease memory bandwidth?
  • Can you load the B value and update the sumOfPixelValues(Squared) arrays directly, i.e. read the file and update instead of read file, store, read, update? Again, this decreases memory bandwidth.
  • Can you use one dimensional arrays instead of two dimensional? Maybe create your own array class that works either way.
  • Perhaps you could look into using Mono and the SIMD extensions?
  • Can you process the image in chunks and assign them to idle CPUs in a multi-cpu environment?

EDIT:

Try having specialised image accessors so you're not wasting memory bandwidth:

public Color GetBPixel (int x, int y)
{
    int offsetFromOrigin = (y * this.stride) + (x * 3);
    unsafe
    {
        return this.imagePtr [offsetFromOrigin + 1];
    }
}

or, better still:

public Color GetBPixel (int offset)
{
    unsafe
    {
        return this.imagePtr [offset + 1];
    }
}

and use the above in a loop like:

for (int start_offset = 0, y = 0 ; y < Height ; start_offset += stride, ++y)
{
   for (int x = 0, offset = start_offset ; x < Width ; offset += 3, ++x)
   {
      pixel = GetBPixel (offset);
      // do stuff
   }
}
查看更多
太酷不给撩
6楼-- · 2020-05-27 04:18

I'm not sure if it's faster but you may write something like;

public void PopulatePixelValueMatrices(GenericImage image,int Width, int Height)
{            
        Byte pixelValue;
        for (int x = 0; x < Width; x++)
        {
            for (int y = 0; y < Height; y++)
            {
                pixelValue = image.GetPixel(x, y).B;
                this.sumOfPixelValues[x, y] += pixelValue;
                this.sumOfPixelValuesSquared[x, y] += pixelValue * pixelValue;
            }
        }
}
查看更多
欢心
7楼-- · 2020-05-27 04:20

Sometimes doing things in native C#, even unsafe calls, is just slower than using methods that have already been optimized.

No results guaranteed, but you may want to investigate the System.Windows.Media.Imaging name space and look at your whole problem in a different way.

查看更多
登录 后发表回答