C++ zero-cost abstraction for SoA/AoS memory layou

2019-05-07 07:36发布

Say I have a large code using Array of Structures (AoS) memory layout. I would like to build a zero-cost abstraction in C++ which allows me to switch between AoS and SoA with as little refactoring effort as possible. For instance take a class with access member functions

 struct Item{
   auto& myDouble(){ return mDouble; }
   auto& myChar(){ return mChar; }
   auto& myString(){ return mString; }
 private:
   double mDouble;
   char mChar;
   std::string mString;
 };

which is used inside a container in a loop

std::vector<Item> vec_(1000);
for (auto& i : vec_)
  i.myDouble()=5.;

I would like to change the first snippet while the second one remains similar.. e.g. having something like

MyContainer<Item, SoA> vec_(1000)
for (auto& i : vec_)
  i.myDouble()=5.;

in which I can select the memory layout with a "SoA" or "AoS" template parameters. My questions are: does such thing exist somewhere? And if it doesn't, how would it be implemented at best?

2条回答
做个烂人
2楼-- · 2019-05-07 07:46

To achieve what you want, you simply have to make your new structure, iterable. Forgive my Java lingo, what I mean by iterable in C++, is simply that you should create functions within your class called begin and end. These should return an iterator object which has the (pre)++ or ++(post) overloaded, as well as the *(pointer) operator.

Another way is this: Why use non-member begin and end functions in C++11?

This will now allow you simply swap the container type and have the for-range loop still work the way it should.

查看更多
萌系小妹纸
3楼-- · 2019-05-07 07:48

I implemented a generic solution, I'll explain it here below (it will be a long post). This is not the only possible answer of course, and it'd be great to gather feedback. I placed the full code of this solution here https://github.com/crosetto/SoAvsAoS

We create two helper classes which given an item generate the container type as a vector of tuples or a tuple of vectors, depending on a tag template argument. We call this class a DataLayoutPolicy and we are going to use it e.g. in this way:

DataLayoutPolicy<std::vector, SoA, char, double, std::string>

to generate a tuple of vectors of char, int and double.

enum class DataLayout { SoA, //structure of arrays
                        AoS //array of structures
};
template <template <typename...> class Container, DataLayout TDataLayout, typename TItem>
struct DataLayoutPolicy;

This class will only contain static member functions to interact with the container (e.g. extract an element, insert, resize, etc...). We write two template specializations. The first one (trivial) for the array of structures situation:

template <template <typename...> class Container, template<typename...> class TItem, typename... Types>
struct DataLayoutPolicy<Container, DataLayout::AoS, TItem<Types...>> {
    using type = Container<TItem<Types...>>;
    using value_type = TItem<Types...>&;

    constexpr static value_type get( type& c_, std::size_t position_ ){ return value_type(*static_cast<TItem<Types...>*>(&c_[ position_ ])); }

    constexpr static void resize( type& c_, std::size_t size_ ) { c_.resize( size_ ); }

    template <typename TValue>
    constexpr static void push_back( type& c_, TValue&& val_ ){ c_.push_back( val_ ); }
    static constexpr std::size_t size(type& c_){ return  c_.size(); }
};

... just forwarding. We do the same for the structure of arrays case.

Note: there are a few things to be explained about the code below.

It wraps all the types in a ref_wrap type, which is a "decorated" std::reference_wrapper. This is because we want to access the elements as lvalue references, to be able to change their values. using a regular reference we would be in trouble if e.g. Types contains any reference. One thing worth noticing is that in the AoS case the DataLayoutPolicy::value_type is a reference, while in the SoA case is the value of a ref_wrap type.

we return by value a newly created tuple of ref_wrap of the values. This is surpisingly OK, because the compiler is optimizing all copies away, and it is even more OK in C++17 (the returned tuple is a 'prvalue'), because of the guaranteed copy elision added to the standard: the tuple is not copied, this code would work even if std::tuple and std::reference_wrapper didn't have a copy/move constructor.

we use a std::integer sequence to statically unroll a parameter pack: this is ugly but it is "the way" to do it since C++14 (and in C++11 one had to use template recursion to achieve the same). There is not yet such thing like a "for_each" for parameter packs.

we use C++17 fold expressions to call a function returning void multiple times. Before C++17 this was achieved concisely with tricky hacks.

template <typename T>
struct ref_wrap : public std::reference_wrapper<T>{
    operator T&() const noexcept { return this->get(); }
    ref_wrap(T& other_) : std::reference_wrapper<T>(other_){}
    void operator =(T && other_) {this->get()=other_;}
};

template <template <typename...> class Container, template<typename...> class TItem, typename... Types>
struct DataLayoutPolicy<Container, DataLayout::SoA, TItem<Types...>> {
    using type = std::tuple<Container<Types>...>;
    using value_type = TItem<ref_wrap<Types>...>;

    constexpr static value_type get( type& c_, std::size_t position_ )
    {
        return doGet( c_, position_, std::make_integer_sequence<unsigned, sizeof...( Types )>() ); // unrolling parameter pack
    }

    constexpr static void resize( type& c_, std::size_t size_ ) {
        doResize( c_, size_, std::make_integer_sequence<unsigned, sizeof...( Types )>() ); // unrolling parameter pack
    }

    template <typename TValue>
    constexpr static void push_back( type& c_, TValue&& val_ ){
        doPushBack( c_, std::forward<TValue>(val_), std::make_integer_sequence<unsigned, sizeof...( Types )>() ); // unrolling parameter pack
    }

    static constexpr std::size_t size(type& c_){ return std::get<0>( c_ ).size(); }

    private:

    template <unsigned... Ids>
    constexpr static auto doGet( type& c_, std::size_t position_, std::integer_sequence<unsigned, Ids...> )
    {
        return value_type{ ref_wrap( std::get<Ids>( c_ )[ position_ ] )... }; // guaranteed copy elision
    }

    template <unsigned... Ids>
    constexpr static void doResize( type& c_, unsigned size_, std::integer_sequence<unsigned, Ids...> )
    {
        ( std::get<Ids>( c_ ).resize( size_ ), ... ); //fold expressions
    }

    template <typename TValue, unsigned... Ids>
    constexpr static void doPushBack( type& c_, TValue&& val_, std::integer_sequence<unsigned, Ids...> )
    {
        ( std::get<Ids>( c_ ).push_back( std::get<Ids>( std::forward<TValue>( val_ ) ) ), ... ); // fold expressions
    }
};

So now this code shows pretty clearly how this abstraction can be built. We show below a possible strategy to use it. We define the policy_t type using the DataLayoutPolicy and a generic TItem type

template <template <typename T> class TContainer, DataLayout TDataLayout, typename TItem>
using policy_t = DataLayoutPolicy<TContainer, TDataLayout, TItem>;

The container class forwards most of the calls to the static functions defined by the policy_t type. It might look like the following

template <template <typename ValueType> class TContainer, DataLayout TDataLayout, typename TItem>
struct BaseContainer
{
    /*member functions like puhs_back, resize,...*/
    value_type operator[]( std::size_t position_ )
    {
            return policy_t::get( mValues, position_ );
    }

    iterator       begin() { return iterator( this, 0 ); }
    iterator       end() { return iterator( this, size() ); }

    private:

    typename policy_t::type mValues;

};

Now this is no standard container, so we have to define an iterator in order to use it whithin STL algorithms. The iterator we build looks like a STL iterator for a container of tuple, except for the fact that it must hold a reference to the container, because when we call the dereference operator we want to call our storage's operator[], which statically dispatches the operation using the container's data layout policy.

template <typename  TContainer>
class Iterator
{

private:
    using container_t = TContainer;
public:

    /* ... usual iterator member functions and type definitions ...*/

    template<typename TTContainer>
    Iterator( TTContainer* container_, std::size_t position_ = 0 ):
        mContainer( container_ )
        , mIterPosition( position_ )
    {
    }

    value_type operator*() {
        return (*mContainer)[ mIterPosition ];
    }

    private:
    container_t*        mContainer = nullptr;
    std::size_t         mIterPosition = std::numeric_limits<std::size_t>::infinity();
};

Eventually we define our "item" data structure: we make it a decorator of a std::tuple, with some specific member functions (in this case only getters/setters).

template<typename ... T>
struct Item : public std::tuple<T ...>{
    using std::tuple<T...>::tuple;
    auto & myDouble(){return std::get<0>(*this);}
    auto & myChar()  {return std::get<1>(*this);}
    auto & myString(){return std::get<2>(*this);}
};

When we call Item's member functions we have to rely on compiler optimization in order for our abstraction to be "zero-cost": we don't want to call the Item constructor, because we are creating a temporary tuple just to access one of it's members each time and then we thrash it right away.

so eventually we can write the program:

template<typename T>
using MyVector = std::vector<T, std::allocator<T>>;

int main(int argc, char** argv){
using container_t = BaseContainer<MyVector, DataLayout::SoA, Item<double, char, std::string, Pad> >;
container_t container_(1000);

 for(auto&& i : container_){
    i.myDouble()=static_cast<double>(argc);
}

and we can write generic and efficient code regardless of the memory layout underneath. What's left to do is to check that this is a zero cost abstraction. The easiest way for me to check that is using a debugger: compile the example with debug symbols on,

> clang++ -std=c++1z -O3 -g main.cpp -o test

run it with gdb, set a brakpoint in the for loop, and step through the assembly instructions (the layout split command shows the source code and disassembled instructions at the same time)

> gdb test
(gdb) break main.cpp : 10 # set breakpoint inside the loop
(gdb) run # execute until the breakpoint
(gdb) layout split # show assembly and source code in 2 separate frames
(gdb) stepi # execute one instruction

The instructions being executed inside the loop are in case of AoS data layout

0x400b00 <main(int, char**)+192>        movsd  %xmm0,(%rsi)
0x400b04 <main(int, char**)+196>        add    $0x610,%rsi
0x400b0b <main(int, char**)+203>        add    $0xffffffffffffffff,%rcx
0x400b0f <main(int, char**)+207>        jne    0x400b00 <main(int, char**)+192>

Notice in particular that in the second line the offset being add to compute the address is 0x160. This changes depending on the size of the data members in the item object. On the other hand for the SoA data structure we have

0x400b60 <main(int, char**)+224>        movups %xmm1,(%rdi,%rsi,8)
0x400b64 <main(int, char**)+228>        movups %xmm1,0x10(%rdi,%rsi,8)
0x400b69 <main(int, char**)+233>        movups %xmm1,0x20(%rdi,%rsi,8)
0x400b6e <main(int, char**)+238>        movups %xmm1,0x30(%rdi,%rsi,8)
0x400b73 <main(int, char**)+243>        movups %xmm1,0x40(%rdi,%rsi,8)
0x400b78 <main(int, char**)+248>        movups %xmm1,0x50(%rdi,%rsi,8)
0x400b7d <main(int, char**)+253>        movups %xmm1,0x60(%rdi,%rsi,8)
0x400b82 <main(int, char**)+258>        movups %xmm1,0x70(%rdi,%rsi,8)
0x400b87 <main(int, char**)+263>        movups %xmm1,0x80(%rdi,%rsi,8)
0x400b8f <main(int, char**)+271>        movups %xmm1,0x90(%rdi,%rsi,8)
0x400b97 <main(int, char**)+279>        movups %xmm1,0xa0(%rdi,%rsi,8)
0x400b9f <main(int, char**)+287>        movups %xmm1,0xb0(%rdi,%rsi,8)
0x400ba7 <main(int, char**)+295>        movups %xmm1,0xc0(%rdi,%rsi,8)
0x400baf <main(int, char**)+303>        movups %xmm1,0xd0(%rdi,%rsi,8)
0x400bb7 <main(int, char**)+311>        movups %xmm1,0xe0(%rdi,%rsi,8)
0x400bbf <main(int, char**)+319>        movups %xmm1,0xf0(%rdi,%rsi,8)
0x400bc7 <main(int, char**)+327>        add    $0x20,%rsi
0x400bcb <main(int, char**)+331>        add    $0x8,%rbx
0x400bcf <main(int, char**)+335>        jne    0x400b60 <main(int, char**)+224>

We see the loop is unrolled and vectorized by Clang (version 6.0.0), and the increment for the address is 0x20, independent of the number of data members present in the item struct.

查看更多
登录 后发表回答