如何使用Python的多线程时获得更快的速度(How to get a faster speed w

现在我正在研究如何从网站尽可能快地获取数据。为了获得更快的速度，即时通讯使用多线程的考虑。这里为i用于测试多线程和简单的柱之间的差异的代码。

import threading
import time
import urllib
import urllib2


class Post:

    def __init__(self, website, data, mode):
        self.website = website
        self.data = data

        #mode is either "Simple"(Simple POST) or "Multiple"(Multi-thread POST)
        self.mode = mode

    def post(self):

        #post data
        req = urllib2.Request(self.website)
        open_url = urllib2.urlopen(req, self.data)

        if self.mode == "Multiple":
            time.sleep(0.001)

        #read HTMLData
        HTMLData = open_url.read()



        print "OK"

if __name__ == "__main__":

    current_post = Post("http://forum.xda-developers.com/login.php", "vb_login_username=test&vb_login_password&securitytoken=guest&do=login", \
                        "Simple")

    #save the time before post data
    origin_time = time.time()

    if(current_post.mode == "Multiple"):

        #multithreading POST

        for i in range(0, 10):
           thread = threading.Thread(target = current_post.post)
           thread.start()
           thread.join()

        #calculate the time interval
        time_interval = time.time() - origin_time

        print time_interval

    if(current_post.mode == "Simple"):

        #simple POST

        for i in range(0, 10):
            current_post.post()

        #calculate the time interval
        time_interval = time.time() - origin_time

        print time_interval

正如你所看到的，这是一个非常简单的代码。首先，我将模式设置为“简单”，我可以得到的时间间隔：50岁 （也许我的速度有点慢:(），那么我将模式设置为“多”，而我得到的时间间隔：35。从我所看到的，多线程实际上可以提高速度，但结果不是一样好，我的想象。我想获得更快的速度。

从调试，我发现程序主要块在该行： open_url = urllib2.urlopen(req, self.data)这行代码花费大量的时间后，并从指定的网站接收数据。我想也许我可以通过增加获得更快的速度time.sleep()并使用内部多线程urlopen函数，但我不能这样做，因为它的Python的功能。

如果不考虑服务器块后的速度，我还可以做的就是更快的速度prossible限制？或任何其他代码我可以修改？多谢！

Answer 1:

在许多情况下，python的线程不提高执行速度非常好......有时，这使情况变得更糟。欲了解更多信息，请参阅在全局解释器锁大卫Beazley的PyCon2010演示 / Pycon2010 GIL幻灯片。此演示文稿内容非常丰富，我强烈推荐给任何人考虑线程...

尽管大卫比兹利的谈话解释说，网络流量提高了Python的线程模块的调度，你应该使用多模块。我包括这在你的代码（见我的回答底部）的选项。

在我的老机器（Python的2.6.6）中的一个运行以下命令：

current_post.mode == "Process"  (multiprocessing)  --> 0.2609 seconds
current_post.mode == "Multiple" (threading)        --> 0.3947 seconds
current_post.mode == "Simple"   (serial execution) --> 1.650 seconds

我同意TokenMacGuy的评论和上述数字包括移动.join()以不同的循环。正如你所看到的，Python的多比线程显著快。

from multiprocessing import Process
import threading
import time
import urllib
import urllib2


class Post:

    def __init__(self, website, data, mode):
        self.website = website
        self.data = data

        #mode is either:
        #   "Simple"      (Simple POST)
        #   "Multiple"    (Multi-thread POST)
        #   "Process"     (Multiprocessing)
        self.mode = mode
        self.run_job()

    def post(self):

        #post data
        req = urllib2.Request(self.website)
        open_url = urllib2.urlopen(req, self.data)

        if self.mode == "Multiple":
            time.sleep(0.001)

        #read HTMLData
        HTMLData = open_url.read()

        #print "OK"

    def run_job(self):
        """This was refactored from the OP's code"""
        origin_time = time.time()
        if(self.mode == "Multiple"):

            #multithreading POST
            threads = list()
            for i in range(0, 10):
               thread = threading.Thread(target = self.post)
               thread.start()
               threads.append(thread)
            for thread in threads:
               thread.join()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)

        if(self.mode == "Process"):

            #multiprocessing POST
            processes = list()
            for i in range(0, 10):
               process = Process(target=self.post)
               process.start()
               processes.append(process)
            for process in processes:
               process.join()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)

        if(self.mode == "Simple"):

            #simple POST
            for i in range(0, 10):
                self.post()
            #calculate the time interval
            time_interval = time.time() - origin_time
            print "mode - {0}: {1}".format(method, time_interval)
        return time_interval

if __name__ == "__main__":

    for method in ["Process", "Multiple", "Simple"]:
        Post("http://forum.xda-developers.com/login.php", 
            "vb_login_username=test&vb_login_password&securitytoken=guest&do=login",
            method
            )

Answer 2:

你做错了最重要的事情，那就是伤害你的吞吐量之最，是您呼叫的方式thread.start()和thread.join()

for i in range(0, 10):
   thread = threading.Thread(target = current_post.post)
   thread.start()
   thread.join()

通过每一次循环中，您创建一个线程，启动它，然后等待其完成在移动到下一个线程之前 。你没有做任何事情同时在所有！

什么你应该做的反而是：

threads = []

# start all of the threads
for i in range(0, 10):
   thread = threading.Thread(target = current_post.post)
   thread.start()
   threads.append(thread)

# now wait for them all to finish
for thread in threads:
   thread.join()

Answer 3:

请记住，这里多线程可以在Python“提高速度”的唯一情况是，当你有这样一个具有众多I / O绑定操作。否则，多线程不会增加“速度”，因为它不能在运行多个CPU（没有，甚至没有，如果你有多个内核，Python不工作的方式）。当你想在同一时间做两件事情，而不是当你想两件事情平行（独立运转即两个进程），你应该使用多线程。

现在，你实际上在做什么实际上不会增加任何单一的DNS查询的速度，但将允许在等待某些人的结果被打掉多个请求，但你要小心，你又有多少呢或者你只会使响应时间甚至不如他们已经是。

也请停止使用的urllib2，并使用要求： http://docs.python-requests.org