Can we parallelize this task?

2019-08-27 15:49发布

Given a C string (array of characters terminating with a NULL character constant), we have to find the length of the string. Could you please suggest some ways to parallelize this for N number of threads of execution. I am having problem dividing into sub-problems as accessing a location of the array which is not present will give segmentation fault.

EDIT: I am not concerned that doing this task in parallel may have much greater overhead or not. Just want to know if this can be done (using something like openmp etc.)

6条回答
迷人小祖宗
2楼-- · 2019-08-27 16:10

You could do something ugly like this in Windows enclosing unsafe memory reads in a SEH __try block:

#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define N 2

DWORD WINAPI FindZeroThread(LPVOID lpParameter)
{
  const char* volatile* pp = (const char* volatile*)lpParameter;

  __try
  {
    while (**pp)
    {
      (*pp) += N;
    }
  }
  __except (EXCEPTION_EXECUTE_HANDLER)
  {
    *pp = NULL;
  }

  return 0;
}

size_t pstrlen(const char* s)
{
  int i;
  HANDLE handles[N];
  const char* volatile ptrs[N];
  const char* p = (const char*)(UINT_PTR)-1;

  for (i = 0; i < N; i++)
  {
    ptrs[i] = s + i;
    handles[i] = CreateThread(NULL, 0, &FindZeroThread, (LPVOID)&ptrs[i], 0, NULL);
  }

  WaitForMultipleObjects(N, handles, TRUE /* bWaitAll */, INFINITE);

  for (i = 0; i < N; i++)
  {
    CloseHandle(handles[i]);
    if (ptrs[i] && p > ptrs[i]) p = ptrs[i];
  }

  return (size_t)(p - s);
}

#define LEN (20 * 1000 * 1000)

int main(void)
{
  char* s = malloc(LEN);

  memset(s, '*', LEN);
  s[LEN - 1] = 0;

  printf("strlen()=%zu pstrlen()=%zu\n", strlen(s), pstrlen(s));

  return 0;
}

Output:

strlen()=19999999 pstrlen()=19999999

I think it may be better to use MMX/SSE instructions to speed up the code in a somewhat parallel way.

EDIT: This may be not a very good idea on Windows after all, see Raymond Chen's IsBadXxxPtr should really be called CrashProgramRandomly.

查看更多
够拽才男人
3楼-- · 2019-08-27 16:15

No it can't. Because each step requires the previous state to be known (did we encounter a null on the previous char). You can only safely check 1 character at a time.

Imagine you are turning over rocks and you MUST stop at one with white paint underneath (null) or you will die (aka seg fault etc).

You can't have people "working ahead" of each other, as the white paint rock might be in between.

Having multiple people (threads/processes) would simply be them taking turns being the one turning over the next rock. They would never be turning over rocks at the same time as each other.

查看更多
家丑人穷心不美
4楼-- · 2019-08-27 16:19

Let me acknowledge this,

Following code has been written using C# and not C. You can associate the idea what I am trying to articulate. And most of the content are from a Parallel Pattern (was a draft document by Microsoft on parallel approach)

To do the best static partitioning possible, you need to be able to accurately predict ahead of time how long all the iterations will take. That’s rarely feasible, resulting in a need for a more dynamic partitioning, where the system can adapt to changing workloads quickly. We can address this by shifting to the other end of the partitioning tradeoffs spectrum, with as much load-balancing as possible.

To do that, rather than pushing to each of the threads a given set of indices to process, we can have the threads compete for iterations. We employ a pool of the remaining iterations to be processed, which initially starts filled with all iterations. Until all of the iterations have been processed, each thread goes to the iteration pool, removes an iteration value, processes it, and then repeats. In this manner, we can achieve in a greedy fashion an approximation for the optimal level of load-balancing possible (the true optimum could only be achieved with a priori knowledge of exactly how long each iteration would take). If a thread gets stuck processing a particular long iteration, the other threads will compensate by processing work from the pool in the meantime. Of course, even with this scheme you can still find yourself with a far from optimal partitioning (which could occur if one thread happened to get stuck with several pieces of work significantly larger than the rest), but without knowledge of how much processing time a given piece of work will require, there’s little more that can be done.

Here’s an example implementation that takes load-balancing to this extreme. The pool of iteration values is maintained as a single integer representing the next iteration available, and the threads involved in the processing “remove items” by atomically incrementing this integer:

public static void MyParallelFor( 
int inclusiveLowerBound, int exclusiveUpperBound, Action<int> body) 
{ 
// Get the number of processors, initialize the number of remaining 
// threads, and set the starting point for the iteration. 
int numProcs = Environment.ProcessorCount; 
int remainingWorkItems = numProcs; 
int nextIteration = inclusiveLowerBound; 
using (ManualResetEvent mre = new ManualResetEvent(false)) 
{ 
// Create each of the work items. 
for (int p = 0; p < numProcs; p++) 
{ 
ThreadPool.QueueUserWorkItem(delegate 
{ 
int index; 
while ((index = Interlocked.Increment( 
ref nextIteration) - 1) < exclusiveUpperBound) 
{ 
body(index); 
} 
if (Interlocked.Decrement(ref remainingWorkItems) == 0) 
mre.Set(); 
}); 
} 
// Wait for all threads to complete. 
mre.WaitOne(); 
} 
}
查看更多
你好瞎i
5楼-- · 2019-08-27 16:24

Do you know the maximum size of that char array? If so, you could do a parallel search in different junks and return the index of the terminator with smallest index. Hence you are then only working on allocated memory, you cannot get segfaults.

Of course this is not as sophisticated as s_nairs answer but pretty straight forward. example:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <omp.h>

int main(int argc, char **argv)
{
    int N=1000;
    char *str = calloc(N, sizeof(char));
    strcpy(str, "This is a test string!");  
    fprintf(stdout, "%s\n", str);

    int nthreads = omp_get_num_procs();
    int i;
    int ind[nthreads];
    for( i = 0; i < nthreads; i++){
        ind[i] = -1;
    }

    int procn;
    int flag;
#pragma omp parallel  private(procn, flag)
    {
        flag = 1;
        procn = omp_get_thread_num();
#pragma omp for
        for( i = 0; i < N; i++){
            if (str[i] == '\0' && flag == 1){
                ind[procn] = i;
                flag = 0;
            }
        }
    }
    int len = 0;
    for( i = 0; i < nthreads; i++){
        if(ind[i]>-1){
            len = ind[i];
            break;
        }
    }
    fprintf(stdout,"strlen %d\n", len);
    free(str);
    return 0;
}
查看更多
来,给爷笑一个
6楼-- · 2019-08-27 16:26

It's probably not even worth trying. If string is short, overhead will be greater than gain in processing speed. If string is really long, the speed will probably be limited by speed of memory, not by CPU processing speed.

查看更多
等我变得足够好
7楼-- · 2019-08-27 16:30

I'd say with just a standard C-string this can not be done. However, if you can define a personal termination string with as many characters as processes - it's straight forward.

查看更多
登录 后发表回答