How to optimize this Pandas code to run faster

2019-09-22 01:17发布

问题:

I have this code to create a swarmplot from data from a DataFrame:

df = pd.DataFrame({"Refined__Some_ID":some_id_list,
                   "Refined_Age":age_list,
                   "Name":name_list                   
                          }
                         )
#Creating dataframe with strings from the lists
select  = df.apply(lambda row : any([isinstance(e, str) for e in row  ]),axis=1) 
#Exlcluding data from select in a new dataframe
dfAnalysis = df[~select]
dfAnalysis['Refined_Age'].replace('', np.nan, inplace=True)
dfAnalysis = dfAnalysis.dropna()
dfAnalysis['Refined_Age'] = dfAnalysis['Refined_Age'].apply(int)
# print dfAnalysis
print type(dfAnalysis['Refined_Patient_Age'][1])
g = sns.swarmplot(x = dfAnalysis['Refined_ID'],y = dfAnalysis['Refined_Age'], hue = dfAnalysis['Name'], orient="v")
g.set_xticklabels(g.get_xticklabels(),rotation=30)
# print g

It's taking a crazy amount of time to run (14 hours and counting!). How can I speed it up? Also, why is the code so slow in the first place?

The 3 lists being included in the dataframe are from a Couchdb database with about 320k documents.

UPDATE 1

I had intended to view the first 20 categories only but excluded the code to do so.

The line should have been:

x = dfAnalysis['Refined_ID'].iloc[:20]

回答1:

Do you really mean a swarmplot with several hundred thousand points? Besides it's gonna take forever, it's nonsense. Try with the first 1000 and see what kind of mess you get. Then use a boxplot or a violinplot instead. Try to understand your tools before using them.

From the docstring:

[...] it does not scale as well to large numbers of observations (both in terms of the ability to show all the points and in terms of the computation needed to arrange them).