Fastest method of accessing elements in Boost Mult

2019-08-05 05:19发布

What is faster - accessing elements of a multiarray using element selection operator, or traversing over the multiarray using iterators?

In my case, I need to do a full pass over all elements of multiarray each time.

1条回答
乱世女痞
2楼-- · 2019-08-05 05:45

The fastest way to access every element of a boost::multi_array is via data() and num_elements().

With data() you gain access to the underlying raw storage (a contiguous block that contains the array‘s data) so there isn't need for multiple index calculations (consider also that multi_array can index arrays from bases different from 0 and this is a further complication).

A simple test gives:

g++ -O3 -fomit-frame-pointer -march=native   (GCC v4.8.2)
Writing (index): 9.70651
Writing (data):  2.22353
Reading (index): 4.5973 (found 1)
Reading (data):  3.53811 (found 1)

clang++ -O3 -fomit-frame-pointer -march=native   (CLANG v3.3)
Writing (index): 5.49858
Writing (data):  2.13678
Reading (index): 5.07324 (found 1)
Reading (data):  2.55109 (found 1)

By default boost access methods perform range checking. If a supplied index is out of the range defined for an array, an assertion will abort the program. To disable range checking, you can define the BOOST_DISABLE_ASSERTS preprocessor macro prior to including multi_array.hpp in your application.

This will reduce a lot the performance difference:

g++ -O3 -fomit-frame-pointer -march=native   (GCC v4.8.2)
Writing (index): 3.15244
Writing (data):  2.23002
Reading (index): 1.89553 (found 1)
Reading (data):  1.54427 (found 1)

clang++ -O3 -fomit-frame-pointer -march=native   (CLANG v3.3)
Writing (index): 2.24831
Writing (data):  2.12853
Reading (index): 2.59164 (found 1)
Reading (data):  2.52141 (found 1)

Performance difference increases (i.e. data() is faster):

  • with a higher number of dimensions;
  • with less elements (for a large number of elements the access to elements is not going to be as significant as the cache pressure to load those elements into the CPU cache. The prefetcher is going to be sitting there trying to load those elements, which is going to take a large proportion of the time).

Anyway this kind of optimization is unlikely to make a measurable difference in a real program. You shouldn't worry about this unless you've conclusively determined, through extensive testing, that it is the source of some sort of bottleneck.

Source:

#include <chrono>
#include <iostream>

// #define BOOST_DISABLE_ASSERTS
#include <boost/multi_array.hpp>

int main()
{
  using array3 = boost::multi_array<unsigned, 3>;
  using index = array3::index;

  using clock = std::chrono::high_resolution_clock;
  using duration = std::chrono::duration<double>;

  constexpr unsigned d1(300), d2(400), d3(200), sup(100);

  array3 A(boost::extents[d1][d2][d3]);

  // Writing via index
  const auto t_begin1(clock::now());
  unsigned values1(0);
  for (unsigned n(0); n < sup; ++n)
    for (index i(0); i != d1; ++i)
      for (index j(0); j != d2; ++j)
        for (index k(0); k != d3; ++k)
          A[i][j][k] = ++values1;
  const auto t_end1(clock::now());

  // Writing directly
  const auto t_begin2(clock::now());
  unsigned values2(0);
  for (unsigned n(0); n < sup; ++n)
  {
    const auto sup(A.data() + A.num_elements());

    for (auto i(A.data()); i != sup; ++i)
      *i = ++values2;
  }
  const auto t_end2(clock::now());

  // Reading via index
  const auto t_begin3(clock::now());
  bool found1(false);
  for (unsigned n(0); n < sup; ++n)
    for (index i(0); i != d1; ++i)
      for (index j(0); j != d2; ++j)
        for (index k(0); k != d3; ++k)
          if (A[i][j][k] == values1)
            found1 = true;
  const auto t_end3(clock::now());

  // Reading directly
  const auto t_begin4(clock::now());
  bool found2(false);
  for (unsigned n(0); n < sup; ++n)
  {
    const auto sup(A.data() + A.num_elements());

    for (auto i(A.data()); i != sup; ++i)
      if (*i == values2)
        found2 = true;
  }
  const auto t_end4(clock::now());

  std::cout << "Writing (index): "
            << std::chrono::duration_cast<duration>(t_end1 - t_begin1).count()
            << std::endl
            << "Writing (data):  "
            << std::chrono::duration_cast<duration>(t_end2 - t_begin2).count()
            << std::endl
            << "Reading (index): "
            << std::chrono::duration_cast<duration>(t_end3 - t_begin3).count()
            << " (found " << found1 << ")" << std::endl
            << "Reading (data):  "
            << std::chrono::duration_cast<duration>(t_end4 - t_begin4).count()
            << " (found " << found2 << ")" << std::endl;

  return 0;
}
查看更多
登录 后发表回答