I have the following dataframe of topic document probablity matrix with the first row being names of text files.
1 2 ... 80 81
0 778.txt 856.txt ... 831.txt 850.txt
1 0.002735042735042732 0.0054700854700846634 ... 0.01641025640567632 4.2490294446698094e-09
2 2.146512500161246e-28 8.006312700113502e-16 ... 4.580074538571013e-12 0.02017093592191074
where column 0 with values (0.0, 1.0) represents index for topic 1 and 2 respectively.After sorting each column(decsending)
def rank_topics_by_probability(self):
df = df.astype(float)
df2 = pd.DataFrame(-np.sort(-df, axis=0), columns=df.columns, index=df.index)
return df2
I got the following output
0 1 2 3 4 ... 77 78 79 80 81
1 1.0 2.735043e-03 0.004329 6.837607e-04 0.010396 ... 0.005399 1.367521e-02 1.641026e-02 1.641023e-02 2.017094e-02
2 0.0 9.941665e-23 0.001141 1.915713e-20 0.000202 ... 0.000071 6.475626e-10 1.816478e-12 2.494897e-08 1.366020e-10
I want to display topic-document rank matrix for each document such as
id topic-rank
778 1, 0
856 1, 0
835 0, 1
786 0, 1
...
831 0, 1
850 1, 0
For document with id 1 I assigned 1, 0 because probability of topic 2 is greater than topic 1 and so on. What is the way to do that? Sample data for the edited question these are only the head() values of the dataframe.
id text
0 15623 Y:\n1. Ran preliminary experiments to set para...
1 15625 Scrum Minutes- Hersheys\nPresent: Eyob, Masres...
2 15627 Present: Eyob, Masresha, Zelalem\nhersheys:\n...
3 15628 **********************************************...
4 15629 Scrum Minutes- Hersheys\nPresent: Eyob, Masres...