How to increase the python speed over loops?

2019-07-10 04:02发布

I have a dataset of 370k records stored in a Pandas Dataframe which needs to be integrated. I tried multiprocessing, threading, Cpython and loop unrolling. But I was not successful and the time shown to compute was 22 hrs. The task is as follows:

%matplotlib inline  
from numba import jit, autojit
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

with open('data/full_text.txt', encoding = "ISO-8859-1") as f:
 strdata=f.readlines()
data=[]

for string in strdata:
 data.append(string.split('\t'))

df=pd.DataFrame(data,columns=["uname","date","UT","lat","long","msg"])

df=df.drop('UT',axis=1)

df[['lat','long']] = df[['lat','long']].apply(pd.to_numeric)

from textblob import TextBlob
from tqdm import tqdm

df['polarity']=np.zeros(len(df))

Threading:

 from queue import Queue
 from threading import Thread
 import logging
 logging.basicConfig(
 level=logging.DEBUG,
  format='(%(threadName)-10s) %(message)s',
  )


class DownloadWorker(Thread):
   def __init__(self, queue):
       Thread.__init__(self)
       self.queue = queue

   def run(self):
       while True:
           # Get the work from the queue and expand the tuple
         lowIndex, highIndex = self.queue.get()
         a = range(lowIndex,highIndex-1)
         for i in a:
            df['polarity'][i]=TextBlob(df['msg'][i]).sentiment.polarity
         self.queue.task_done()

  def main():
   # Create a queue to communicate with the worker threads
   queue = Queue()
   # Create 8 worker threads
   for x in range(8):
     worker = DownloadWorker(queue)
     worker.daemon = True
     worker.start()
  # Put the tasks into the queue as a tuple
   for i in tqdm(range(0,len(df)-1,62936)):
     logging.debug('Queueing')
     queue.put((i,i+62936 ))
     queue.join()
     print('Took {}'.format(time() - ts))

 main()

Multiprocessing with loop unrolling:

pool = multiprocessing.Pool(processes=2)
r = pool.map(assign_polarity, df)
pool.close()

def assign_polarity(df):
   a=range(0,len(df),5)
   for i in tqdm(a):
       df['polarity'][i]=TextBlob(df['msg'][i]).sentiment.polarity
       df['polarity'][i+1]=TextBlob(df['msg'][i+1]).sentiment.polarity
       df['polarity'][i+2]=TextBlob(df['msg'][i+2]).sentiment.polarity
       df['polarity'][i+3]=TextBlob(df['msg'][i+3]).sentiment.polarity
       df['polarity'][i+4]=TextBlob(df['msg'][i+4]).sentiment.polarity

How to increase the speed of computation? or storing the computation in dataframe in a faster way? My laptop configuration

  • Ram: 8GB
  • Physical cores: 2
  • Logical cores: 8
  • Windows 10

Implementing Multiprocessing gave me a higher computation time. Threading was being executed sequentially (I think because of GIL) Loop Unrolling gave me the same computation speed. Cpython was giving me errors while importing libraries.

1条回答
Luminary・发光体
2楼-- · 2019-07-10 04:40

ASD -- I noticed that storing something in a df iteratively is VERY slow. I'd try to store your TextBlobs in a list (or another structure) and then converting that list into a column of a df.

查看更多
登录 后发表回答