I wanted to run my program on all the cores that I have. Here is the code below which I used in my program(which is a part of my full program. somehow, managed to write the working flow).
def ssmake(data):
sslist=[]
for cols in data.columns:
sslist.append(cols)
return sslist
def scorecal(slisted):
subspaceScoresList=[]
if __name__ == '__main__':
pool = mp.Pool(4)
feature,FinalsubSpaceScore = pool.map(performDBScan, ssList)
subspaceScoresList.append([feature, FinalsubSpaceScore])
#for feature in ssList:
#FinalsubSpaceScore = performDBScan(feature)
#subspaceScoresList.append([feature,FinalsubSpaceScore])
return subspaceScoresList
def performDBScan(subspace):
minpoi=2
Epsj=2
final_data = df[subspace]
db = DBSCAN(eps=Epsj, min_samples=minpoi, metric='euclidean').fit(final_data)
labels = db.labels_
FScore = calculateSScore(labels)
return subspace, FScore
def calculateSScore(cluresult):
score = random.randint(1,21)*5
return score
def StartingFunction(prvscore,curscore,fe_select,df):
while prvscore<=curscore:
featurelist=ssmake(df)
scorelist=scorecal(featurelist)
a = {'a' : [1,2,3,1,2,3], 'b' : [5,6,7,4,6,5], 'c' : ['dog', 'cat', 'tree','slow','fast','hurry']}
df2 = pd.DataFrame(a)
previous=0
current=0
dim=[]
StartingFunction(previous,current,dim,df2)
I had a for
loop in scorecal(slisted)
method which was commented, takes each column to perform DBSCAN
and has to calculate the score for that particular column based on the result(but I tried using random score here in example). This looping is making my code to run for a longer time. So I tried to parallelize each column of the DataFrame to perform DBSCAN on the cores that i had on my system and wrote the code in the above fashion which is not giving the result that i need. I was new to this multiprocessing library. I was not sure with the placement of '__main__'
in my program. I also would like to know if there is any other way in python to run in a parallel fashion. Any help is appreciated.
Your code has all what is needed to run on multi-core processor using more than one core. But it is a mess. I don't know what problem you trying to solve with the code. Also I cannot run it since I don't know what is
DBSCAN
. To fix your code you should do several steps.Function
scorecal()
:result
is a list containing all the results returned byperformDBSCAN()
. You don't have to populate the list manually.Main body of the program:
I created very simplified version of your code (pool with 4 processes to handle 8 columns of my data) with dummy for loops (to achieve cpu-bound operation) and tried it. I got 100% cpu load (I have 4-core i5 processor) that naturally resulted in approx x4 faster computation (20 seconds vs 74 seconds) in comparison with single process implementation through for loop.
EDIT.
The complete code I used to try multiprocessing (I use Anaconda (Spyder) / Python 3.6.5 / Win10):