How can I efficiently implement multithreading/mul

2019-07-16 09:14发布

问题:

Let's say I have a web bot written in python that sends data via POST request to a web site. The data is pulled from a text file line by line and passed into an array. Currently, I'm testing each element in the array through a simple for-loop. How can I effectively implement multi-threading to iterate through the data quicker. Let's say the text file is fairly large. Would attaching a thread to each request be smart? What do you think the best approach to this would be?

with open("c:\file.txt") as file:
     dataArr = file.read().splitlines()

dataLen = len(open("c:\file.txt").readlines())-1

def test(data):
     #This next part is pseudo code
     result = testData('www.example.com', data)
     if result == 'whatever':
          print 'success'

for i in range(0, dataLen):
    test(dataArr[i])

I was thinking of something along the lines of this, but I feel it would cause issues depending on the size of the text file. I know there is software that exists which allows the end-user to specify the amount of the threads when working with large amounts of data. I'm not entirely sure of how that works, but that's something I'd like to implement.

import threading

with open("c:\file.txt") as file:
     dataArr = file.read().splitlines()

dataLen = len(open("c:\file.txt").readlines())-1

def test(data):
     #This next part is pseudo code
     result = testData('www.example.com', data)
     if result == 'whatever':
          print 'success'

jobs = []

for x in range(0, dataLen):
     thread = threading.Thread(target=test, args=(dataArr[x]))
     jobs.append(thread)

for j in jobs:
    j.start()
for j in jobs:
    j.join()

回答1:

This sounds like a recipe for multiprocessing.Pool

See here: https://docs.python.org/2/library/multiprocessing.html#introduction

from multiprocessing import Pool

def test(num):
    if num%2 == 0:
        return True
    else:
        return False

if __name__ == "__main__":
    list_of_datas_to_test = [0, 1, 2, 3, 4, 5, 6, 7, 8]

    p = Pool(4)  # create 4 processes to do our work
    print(p.map(test, list_of_datas_to_test))  # distribute our work

Output looks like:

[True, False, True, False, True, False, True, False, True, False]


回答2:

Threads are slow in python because of the Global Interpreter Lock. You should consider using multiple processes with the Python multiprocessing module instead of threads. Using multiple processes can increase the "ramp up" time of your code, as spawning a real process takes more time than a light thread, but due to the GIL, threading won't do what you're after.

Here and here are a couple of basic resources on using the multiprocessing module. Here's an example from the second link:

import multiprocessing as mp
import random
import string

# Define an output queue
output = mp.Queue()

# define a example function
def rand_string(length, output):
    """ Generates a random string of numbers, lower- and uppercase chars. """
    rand_str = ''.join(random.choice(
                    string.ascii_lowercase
                    + string.ascii_uppercase
                    + string.digits)
               for i in range(length))
    output.put(rand_str)

# Setup a list of processes that we want to run
processes = [mp.Process(target=rand_string, args=(5, output)) for x in range(4)]

# Run processes
for p in processes:
    p.start()

# Exit the completed processes
for p in processes:
    p.join()

# Get process results from the output queue
results = [output.get() for p in processes]

print(results)