parallel file parsing, multiple CPU cores

I asked a related but very general question earlier (see especially this response).

This question is very specific. This is all the code I care about:

result = {}
for line in open('input.txt'):
  key, value = parse(line)
  result[key] = value

The function parse is completely self-contained (i.e., doesn't use any shared resources).

I have Intel i7-920 CPU (4 cores, 8 threads; I think the threads are more relevant, but I'm not sure).

What can I do to make my program use all the parallel capabilities of this CPU?

I assume I can open this file for reading in 8 different threads without much performance penalty since disk access time is small relative to the total time.

标签： python python-3.x parallel-processing

6条回答

姐就是有狂的资本

2楼-- · 2020-02-07 20:06

This can be done using Ray, which is a library for writing parallel and distributed Python.

To run the code below, first create input.txt as follows.

printf "1\n2\n3\n4\n5\n6\n" > input.txt

Then you can process the file in parallel by adding the @ray.remote decorator to the parse function and executing many copies in parallel as follows

import ray
import time

ray.init()

@ray.remote
def parse(line):
    time.sleep(1)
    return 'key' + str(line), 'value'

# Submit all of the "parse" tasks in parallel and wait for the results.
keys_and_values = ray.get([parse.remote(line) for line in open('input.txt')])
# Create a dictionary out of the results.
result = dict(keys_and_values)

Note that the optimal way to do this will depend on how long it takes to run the parse function. If it takes one second (as above), then parsing one line per Ray task makes sense. If it takes 1 millisecond, then it probably makes sense to parse a bunch of lines (e.g., 100) per Ray task.

Your script is simple enough that the multiprocessing module can also be used, however as soon as you want to do anything more complicated or want to leverage multiple machines instead of just one machine, then it will be much easier with Ray.

See the Ray documentation.

0人赞添加讨论(0) 举报

Juvenile、少年°

3楼-- · 2020-02-07 20:07

As TokenMacGuy said, You can use multiprocessing module. If You really need to parse massive amount of data, You should check out the disco project.

Disco is a distributed computing framework based on the MapReduce paradigm. Disco is open-source; developed by Nokia Research Center to solve real problems in handling massive amounts of data.

It really scales up for jobs where Your parse() job is "pure" (i.e., doesn't use any shared resources) and is CPU intensive. I tested a job on a single core and then compared to running it on 3 hosts with 8 cores each. It actually ran 24 times faster when run on the Disco cluster (NOTE: tested for an unreasonably CPU intensive job).

0人赞添加讨论(0) 举报

戒情不戒烟

4楼-- · 2020-02-07 20:13

You can use the multiprocessing module, but if parse() is quick, you won't get much performance improvement by doing that.

0人赞添加讨论(0) 举报

爱情/是我丢掉的垃圾

5楼-- · 2020-02-07 20:21

make distributed architecture with rabbitMQ, one task producer read file line by line and send lines to workers via rabbitMQ

use console utility like unix/parallel, xargs

$ python makelist.py | parallel -j+2 'wget "{}" -O - | python parse.py'

or this style

$ ls *.wav | xargs -n1 --max-procs=4 -I {} lame {} -o {}.mp3

Anyway, you need to realize map/reduce paradigm

0人赞添加讨论(0) 举报

SAY GOODBYE

6楼-- · 2020-02-07 20:22

cPython does not provide the threading model you are looking for easily. You can get something similar using the multiprocessing module and a process pool

such a solution could look something like this:

def worker(lines):
    """Make a dict out of the parsed, supplied lines"""
    result = {}
    for line in lines.split('\n'):
        k, v = parse(line)
        result[k] = v
    return result

if __name__ == '__main__':
    # configurable options.  different values may work better.
    numthreads = 8
    numlines = 100

    lines = open('input.txt').readlines()

    # create the process pool
    pool = multiprocessing.Pool(processes=numthreads)

    # map the list of lines into a list of result dicts
    result_list = pool.map(worker, 
        (lines[line:line+numlines] for line in xrange(0,len(lines),numlines) ) )

    # reduce the result dicts into a single dict
    result = {}
    map(result.update, result_list)

0人赞添加讨论(0) 举报

欢心

7楼-- · 2020-02-07 20:23

split the file in 8 smaller files
launch a separate script to process each file
join the results

Why that's the best way...

That's simple and easy - you don't have to program in any way different from linear processing.
You have the best performance by launching a small number of long-running processes.
The OS will deal with context switching and IO multiplexing so you don't have to worry about this stuff (the OS does a good job).
You can scale to multiple machines, without changing the code at all
...

0人赞添加讨论(0) 举报

parallel file parsing, multiple CPU cores

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间