Improving memory layout for parallel computing

2020-03-06 04:31发布

问题:

I'm trying to optimize an algorithm (Lattice Boltzmann) for parallel computing using C++ AMP. And looking for some suggestions to optimize the memory layout, just found out that removing one parameter from the structure into another vector (the blocked vector) gave and increase of about 10%.

Anyone got any tips that can further improve this, or something i should take into consideration? Below is the most time consuming function that is executed for each timestep, and the structure used for the layout.

struct grid_cell {
//  int blocked;    // Define if blocked
    float n;        // North
    float ne;       // North-East
    float e;        // East
    float se;       // South-East
    float s;
    float sw;
    float w;
    float nw;
    float c;        // Center
};

int collision(const struct st_parameters param, vector<struct grid_cell> &node, vector<struct grid_cell> &tmp_node, vector<int> &obstacle) {
    int x,y;
    int i = 0;
    float c_sq = 1.0f/3.0f;     // Square of speed of sound
    float w0 = 4.0f/9.0f;       // Weighting factors
    float w1 = 1.0f/9.0f;
    float w2 = 1.0f/36.0f;

    int chunk = param.ny/20;
    float total_density = 0;

    float u_x,u_y;              // Avrage velocities in x and y direction
    float u[9];                 // Directional velocities
    float d_equ[9];             // Equalibrium densities
    float u_sq;                 // Squared velocity
    float local_density;        // Sum of densities in a particular node

    for(y=0;y<param.ny;y++) {
        for(x=0;x<param.nx;x++) {
            i = y*param.nx + x; // Node index
            // Dont consider blocked cells
            if (obstacle[i] == 0) {
                // Calculate local density
                local_density = 0.0;
                local_density += tmp_node[i].n;
                local_density += tmp_node[i].e;
                local_density += tmp_node[i].s;
                local_density += tmp_node[i].w;
                local_density += tmp_node[i].ne;
                local_density += tmp_node[i].se;
                local_density += tmp_node[i].sw;
                local_density += tmp_node[i].nw;
                local_density += tmp_node[i].c;

                // Calculate x velocity component
                u_x = (tmp_node[i].e + tmp_node[i].ne + tmp_node[i].se - 
                      (tmp_node[i].w + tmp_node[i].nw + tmp_node[i].sw)) 
                      / local_density;
                // Calculate y velocity component
                u_y = (tmp_node[i].n + tmp_node[i].ne + tmp_node[i].nw - 
                      (tmp_node[i].s + tmp_node[i].sw + tmp_node[i].se)) 
                      / local_density;
                // Velocity squared
                u_sq = u_x*u_x +u_y*u_y;

                // Directional velocity components;
                u[1] =  u_x;        // East
                u[2] =        u_y;  // North
                u[3] = -u_x;        // West
                u[4] =      - u_y;  // South
                u[5] =  u_x + u_y;  // North-East
                u[6] = -u_x + u_y;  // North-West
                u[7] = -u_x - u_y;  // South-West
                u[8] =  u_x - u_y;  // South-East

                // Equalibrium densities
                // Zero velocity density: weight w0
                d_equ[0] = w0 * local_density * (1.0f - u_sq / (2.0f * c_sq));
                // Axis speeds: weight w1
                d_equ[1] = w1 * local_density * (1.0f + u[1] / c_sq
                                 + (u[1] * u[1]) / (2.0f * c_sq * c_sq)
                                 - u_sq / (2.0f * c_sq));
                d_equ[2] = w1 * local_density * (1.0f + u[2] / c_sq
                                 + (u[2] * u[2]) / (2.0f * c_sq * c_sq)
                                 - u_sq / (2.0f * c_sq));
                d_equ[3] = w1 * local_density * (1.0f + u[3] / c_sq
                                 + (u[3] * u[3]) / (2.0f * c_sq * c_sq)
                                 - u_sq / (2.0f * c_sq));
                d_equ[4] = w1 * local_density * (1.0f + u[4] / c_sq
                                 + (u[4] * u[4]) / (2.0f * c_sq * c_sq)
                                 - u_sq / (2.0f * c_sq));
                // Diagonal speeds: weight w2
                d_equ[5] = w2 * local_density * (1.0f + u[5] / c_sq
                                 + (u[5] * u[5]) / (2.0f * c_sq * c_sq)
                                 - u_sq / (2.0f * c_sq));
                d_equ[6] = w2 * local_density * (1.0f + u[6] / c_sq
                                 + (u[6] * u[6]) / (2.0f * c_sq * c_sq)
                                 - u_sq / (2.0f * c_sq));
                d_equ[7] = w2 * local_density * (1.0f + u[7] / c_sq
                                 + (u[7] * u[7]) / (2.0f * c_sq * c_sq)
                                 - u_sq / (2.0f * c_sq));
                d_equ[8] = w2 * local_density * (1.0f + u[8] / c_sq
                                 + (u[8] * u[8]) / (2.0f * c_sq * c_sq)
                                 - u_sq / (2.0f * c_sq));

                // Relaxation step
                node[i].c = (tmp_node[i].c + param.omega * (d_equ[0] - tmp_node[i].c));
                node[i].e = (tmp_node[i].e + param.omega * (d_equ[1] - tmp_node[i].e));
                node[i].n = (tmp_node[i].n + param.omega * (d_equ[2] - tmp_node[i].n));
                node[i].w = (tmp_node[i].w + param.omega * (d_equ[3] - tmp_node[i].w));
                node[i].s = (tmp_node[i].s + param.omega * (d_equ[4] - tmp_node[i].s));
                node[i].ne = (tmp_node[i].ne + param.omega * (d_equ[5] - tmp_node[i].ne));
                node[i].nw = (tmp_node[i].nw + param.omega * (d_equ[6] - tmp_node[i].nw));
                node[i].sw = (tmp_node[i].sw + param.omega * (d_equ[7] - tmp_node[i].sw));
                node[i].se = (tmp_node[i].se + param.omega * (d_equ[8] - tmp_node[i].se));
            }
        }
    }
    return 1;
}

回答1:

Current GPUs are notoriously depending about memory layout. Without more details about your application here are some things I would suggest you explore:

  1. Unit-stride access is very important so GPUs prefer “structs of arrays” to “arrays of structures”. As you did moving field “blocked” into vector “obstacle”, it should be advantageous to convert all of the fields of “grid_cell” into separate vectors. This should show benefit on CPU as well for loops that don’t access all of the fields.

  2. If “obstacle” is very sparse (which I guess is unlikely) then moving it to its own vector is particularly value. GPUs like CPUs will load more than one word from the memory system either in cache lines or some other form and you waste bandwidth when you don’t need some of the data. For many system memory bandwidth is the bottleneck resource so any way to reduce bandwidth helps.

  3. This is more speculative, but now that you are writing all of the output vector, it is possible the memory subsystem is avoiding reading values in “node” that will simply be overwritten

  4. On some systems, the on-chip memory is split into banks and having an odd number of fields within your structure may help remove bank conflicts.

  5. Some systems will also “vectorize” loads and stores so again removing “blocked” from the structure might enable more vectorization. The shift to struct-of-arrays mitigates this worry.

Thanks for your interest in C++ AMP.

David Callahan

http://blogs.msdn.com/b/nativeconcurrency/ C++ AMP Team Blog



回答2:

In general, you should make sure that data used on different cpus are not shared (easy) and are not on the same cache line (false sharing, see for example here: False Sharing is No Fun). Data used by the same cpu should be close together to benefit from caches.



回答3:

Some small generic tops:

  • Any data structure that is shared across multiple processors should be read only.

  • Any data structure that requires modification is unique to the processor and does not share memory locality with data that is required by another processor.

  • Make sure your memory is arranged so that your code scans serially through it (not taking huge steps or jumping around).



回答4:

For anyone looking into this topic some hints. Lattice-Boltzmann is generally bandwidth limited. This means its performance depends mainly on the amount of data that can be loaded from and written to memory.

  • Use a highly efficient compiled programming language: C or C++ are good choices for CPU-based implementations.

  • Choose an architecture with a high bandwidth. For a CPU this means high clock RAM and a lot of memory channels (quad-channel or more).

  • This makes it crucial to use an appropriate linear memory layout that makes effective use of cache prefetching: The data is arranged in memory in small portions, so called cache lines. Whenever a processor accesses an element the entire cache line (on modern architectures 64 Bytes) it lies in are loaded. This means 8 double or 16 float values are loaded at once! While I have not found this to be a problem for multi-core processors as they share the L3 cache this should lead to problems on many-core architectures as changes to the same cache line have to be kept in sync and problems arise when other processors are working on data that another processor is working on (false sharing). This can be avoided by introducing padding, meaning you add elements you won't use to fill the rest of the cache line. Assume you want to update a cell with a discretisation with 27 speeds for the D3Q27-lattice then in the case of doubles (8 Bytes) the data lies on 4 distinct cache lines. You should add 5 doubles of padding to match the 32 Bytes (4*8 Bytes).

unsigned int const PAD          = (64 - sizeof(double)*D3Q27.SPEEDS % 64); ///< padding: number of doubles
size_t       const MEM_SIZE_POP = sizeof(double)*NZ*NY*NX*(SPEEDS+PAD);    ///< amount of memory to be allocated

Most compilers naturally align the start of the array with a cache line so you don't have to take care of that.

  • The linear indices are inconvenient for accessing. Therefore you should design the accessing as efficient as possible. You could write a wrapper class. In any case inline those functions, meaning every call is replaced by their definition in the code.
inline size_t const D3Q27_PopIndex(unsigned int const x, unsigned int const y, unsigned int const z, unsigned int const d)
{
    return (D3Q27.SPEEDS + D3Q27.PAD)*(NX*(NY*z + y) + x) + D3Q27.SPEEDS*p + d;
}
  • Furthermore cache locality can be increased by maximising the ratio between computation and communication for example using three-dimensional spatial loop blocking (Scaling issues with OpenMP), meaning every code works on a cube of cells instead of a single cell.

  • Generally implementations make use of two distinct populations A and B and perform collision and streaming from one implementation into another. This means every value in memory exists twice, once pre- and once post-collision. There exist different strategies for recombining steps and storing in such a way that you only have to keep one population copy in memory. For instance see the A-A pattern as proposed by P. Bailey et al. - "Accelerating Lattice Boltzmann Fluid Flow Simulations Using Graphics Processors" (https://www2.cs.arizona.edu/people/pbailey/Accelerating_GPU_LBM.pdf) or the Esoteric Twist by M. Geier & M. Schönherr - "Esoteric Twist: An Efficient in-Place Streaming Algorithmus for the Lattice Boltzmann Method on Massively Parallel Hardware" (https://pdfs.semanticscholar.org/ea64/3d63667900b60e6ff49f2746211700e63802.pdf). I have implemented the first with the use of macros meaning every access of a population calls a macro of the form:

#define O_E(a,b)       a*odd + b*(!odd)

#define  READ_f_0      D3Q27_PopIndex(x,           y, z,           0, p)
#define  READ_f_1      D3Q27_PopIndex(O_E(x_m, x), y, z, O_E( 1,  2), p)
#define  READ_f_2      D3Q27_PopIndex(O_E(x_p, x), y, z, O_E( 2,  1), p)
...

#define  WRITE_f_0     D3Q27_PopIndex(x,           y, z,           0, p)
#define  WRITE_f_1     D3Q27_PopIndex(O_E(x_p, x), y, z, O_E( 2,  1), p)
#define  WRITE_f_2     D3Q27_PopIndex(O_E(x_m, x), y, z, O_E( 1,  2), p)
...
  • If you have multiple interaction populations use grid merging. Lay the indices out linearly in memory and put two distinct populations side by side. The accessing of population p works then as follows:
inline size_t const D3Q27_PopIndex(unsigned int const x, unsigned int const y, unsigned int const z, unsigned int const d, unsigned int const p = 0)
{
    return (D3Q27.SPEEDS*D3Q27.NPOP + D3Q27.PAD)*(NX*(NY*z + y) + x) + D3Q27.SPEEDS*p + d;
}
  • For a regular grid make the algorithm as predictable as possible. Let every cell perform collision and streaming and then do the boundary conditions in reverse afterwards. If you have many cells that do not contribute directly to the algorithm omit them with a logical mask that you can store in the padding as well!

  • Make everything know to the compiler at compilation time: Template for example boundary conditions with a function that takes care of index changes so you don't have to rewrite every boundary condition.

  • Modern architectures have registers that can perform SIMD operations, so the same instruction on multiple data. Some processors (AVX-512) can process up to 512 bits of data and thus 32 doubles almost as fast as a single number. This seems to be very attractive for LBM in particular ever since gathering and scattering have been introduced (https://en.wikipedia.org/wiki/Gather-scatter_(vector_addressing)) but with the current bandwidth limitations (maybe it is worth it with DDR5 and processors with few cores) this is in my opinion not worth the hassle: The single core performance and parallel scaling is better (M. Wittmann et al. - "Lattice Boltzmann Benchmark Kernels as a Testbed for Performance Analysis" - https://arxiv.org/abs/1711.11468) but the overall algorithm performs not any better as it is bandwidth limited. So it only makes sense on architectures that are limited by the computing capacities rather than the bandwidth. On the Xeon Phi architecture the results seem to be remarkable Robertsen et al. - "High‐performance SIMD implementation of the lattice‐Boltzmann method on the Xeon Phi processor" (https://onlinelibrary.wiley.com/doi/abs/10.1002/cpe.5072).

In my opinion most of this is not worth the effort for simple 2D implementations. Do the easy optimisations there, loop blocking, a linear memory layout but forget about the more complex access patterns. In 3D the effect can be enormous: I have achieved up to 95% parallel scalability and an overall performance of over 150 Mlups with a D3Q19 on a modern 12-core processor. For more performance switch to more adequate architectures like GPUs with CUDA C that are optimised for bandwidth.