C++11 async is using only one core

2020-07-11 09:56发布

问题:

I'm trying to parallelise a long running function in C++ and using std::async it only uses one core.

It's not the running time of the function is too small, as I'm currently using test data that takes about 10 mins to run.

From my logic I create NThreads worth of Futures (each taking a proportion of the loop rather than an individual cell so it is a nicely long running thread), each of which will dispatch an async task. Then after they've been created the program spin locks waiting for them to complete. However it always uses one core?!

This isn't me looking at top either and saying it looks roughly like one CPU, my ZSH config outputs the CPU % of the last command, and it always exactly 100%, never above

auto NThreads = 12;
auto BlockSize = (int)std::ceil((int)(NThreads / PathCountLength));

std::vector<std::future<std::vector<unsigned __int128>>> Futures;

for (auto I = 0; I < NThreads; ++I) {
    std::cout << "HERE" << std::endl;
    unsigned __int128 Min = I * BlockSize;
    unsigned __int128 Max = I * BlockSize + BlockSize;

    if (I == NThreads - 1)
        Max = PathCountLength;

    Futures.push_back(std::async(
        [](unsigned __int128 WMin, unsigned __int128 Min, unsigned__int128 Max,
           std::vector<unsigned __int128> ZeroChildren,
           std::vector<unsigned __int128> OneChildren,
           unsigned __int128 PathCountLength)
           -> std::vector<unsigned __int128> {
           std::vector<unsigned __int128> LocalCount;
           for (unsigned __int128 I = Min; I < Max; ++I)
               LocalCount.push_back(KneeParallel::pathCountOrStatic(
                   WMin, I, ZeroChildren, OneChildren, PathCountLength));
          return LocalCount;
    },
    WMin, Min, Max, ZeroChildInit, OneChildInit, PathCountLength));
}

for (auto &Future : Futures) {
    Future.get();
}

Does anyone have any insight.

I'm compiling with clang and LLVM on Arch Linux. Are there any compile flags I need, but from what I can tell C++11 standardised the thread library?

Edit: If it helps anyone giving any further clues, when I comment out the local vector it runs on all cores as it should, when I drop it back in rolls back to one core.

Edit 2: So I pinned down the solution, but it seems very bizarre. Returning the vector from the lambda function fixed it to one core, so now I get round this by passing in a shared_ptr to the output vector and manipulating that. And hey presto, it fires up on the cores!

I figured it was pointless now using futures as I don't have a return and I'd use threads instead, nope!, using threads with no returns also uses one core. Weird eh?

Fine, go back to using futures, just return an into to throw away or something. Yep you guessed it, even returning an int from the thread sticks the program to one core. Except futures can't have void lambda functions. So my solution is to pass a pointer in to store the output, to an int lambda function that never returns anything. Yeah it feels like duct tape, but I can't see a better solution.

It seems so...bizzare? Like the compiler is somehow interpreting the lambda incorrectly. Could it be because I use the dev release of LLVM and not a stable branch...?

Anyway my solution, because I hate nothing more than finding my problm on here and having no answer:

auto NThreads = 4;
auto BlockSize = (int)std::ceil((int)(NThreads / PathCountLength));

auto Futures = std::vector<std::future<int>>(NThreads);
auto OutputVectors =
    std::vector<std::shared_ptr<std::vector<unsigned __int128>>>(
        NThreads, std::make_shared<std::vector<unsigned __int128>>());

for (auto I = 0; I < NThreads; ++I) {
  unsigned __int128 Min = I * BlockSize;
  unsigned __int128 Max = I * BlockSize + BlockSize;

if (I == NThreads - 1)
  Max = PathCountLength;

Futures[I] = std::async(
  std::launch::async,
  [](unsigned __int128 WMin, unsigned __int128 Min, unsigned __int128 Max,
       std::vector<unsigned __int128> ZeroChildren,
       std::vector<unsigned __int128> OneChildren,
       unsigned __int128 PathCountLength,
       std::shared_ptr<std::vector<unsigned __int128>> OutputVector)
        -> int {
      for (unsigned __int128 I = Min; I < Max; ++I) {
        OutputVector->push_back(KneeParallel::pathCountOrStatic(
            WMin, I, ZeroChildren, OneChildren, PathCountLength));
      }
    },
    WMin, Min, Max, ZeroChildInit, OneChildInit, PathCountLength,
    OutputVectors[I]);
}

for (auto &Future : Futures) {
  Future.get();
}

回答1:

By providing a first argument to async, you can configure it to run deferred (std::launch::deferred), to run in its own thread (std::launch::async), or let the system decide between both options (std::launch::async | std::launch::deferred). The latter is the default behavior.

So, to force it to run in another thread, adapt your call of std::async to std::async(std::launch::async, /*...*/).