As I'm trying out the examples code on topic modeling developer's guide, I really want to understand the meaning of the output of that code.
First during the running process, it gives out:
Coded LDA: 10 topics, 4 topic bits, 1111 topic mask
max tokens: 148
total tokens: 1333
<10> LL/token: -9,24097
<20> LL/token: -9,1026
<30> LL/token: -8,95386
<40> LL/token: -8,75353
0 0,5 battle union confederate tennessee american states
1 0,5 hawes sunderland echo war paper commonwealth
2 0,5 test including cricket australian hill career
3 0,5 average equipartition theorem law energy system
4 0,5 kentucky army grant gen confederates buell
5 0,5 years yard national thylacine wilderness parks
6 0,5 gunnhild norway life extinct gilbert thespis
7 0,5 zinta role hindi actress film indian
8 0,5 rings south ring dust 2 uranus
9 0,5 tasmanian back time sullivan london century
<50> LL/token: -8,59033
<60> LL/token: -8,63711
<70> LL/token: -8,56168
<80> LL/token: -8,57189
<90> LL/token: -8,46669
0 0,5 battle union confederate tennessee united numerous
1 0,5 hawes sunderland echo paper commonwealth early
2 0,5 test cricket south australian hill england
3 0,5 average equipartition theorem law energy system
4 0,5 kentucky army grant gen war time
5 0,5 yard national thylacine years wilderness tasmanian
6 0,5 including gunnhild norway life time thespis
7 0,5 zinta role hindi actress film indian
8 0,5 rings ring dust 2 uranus survived
9 0,5 back london modern sullivan gilbert needham
<100> LL/token: -8,49005
<110> LL/token: -8,57995
<120> LL/token: -8,55601
<130> LL/token: -8,50673
<140> LL/token: -8,46388
0 0,5 battle union confederate tennessee war united
1 0,5 sunderland echo paper edward england world
2 0,5 test cricket south australian hill record
3 0,5 average equipartition theorem energy system kinetic
4 0,5 hawes kentucky army gen grant confederates
5 0,5 years yard national thylacine wilderness tasmanian
6 0,5 gunnhild norway including king life devil
7 0,5 zinta role hindi actress film indian
8 0,5 rings ring dust 2 uranus number
9 0,5 london sullivan gilbert thespis back mother
<150> LL/token: -8,51129
<160> LL/token: -8,50269
<170> LL/token: -8,44308
<180> LL/token: -8,47441
<190> LL/token: -8,62186
0 0,5 battle union confederate grant tennessee numerous
1 0,5 sunderland echo survived paper edward england
2 0,5 test cricket south australian hill park
3 0,5 average equipartition theorem energy system law
4 0,5 hawes kentucky army gen time confederates
5 0,5 yard national thylacine years wilderness tasmanian
6 0,5 gunnhild including norway life king time
7 0,5 zinta role hindi actress film indian
8 0,5 rings ring dust 2 uranus number
9 0,5 back london sullivan gilbert thespis 3
<200> LL/token: -8,54771
Total time: 6 seconds
so Question1: what does "Coded LDA: 10 topics, 4 topic bits, 1111 topic mask" mean in the first line? I only know what "10 topics" is about.
Question2: what does LL/Token in " <10> LL/token: -9,24097 <20> LL/token: -9,1026 <30> LL/token: -8,95386 <40> LL/token: -8,75353" mean? it seems like a metric to Gibss sampling. But isn't it monotonically increasing?
And after that, the following is printed:
elizabeth-9 needham-9 died-7 3-9 1731-6 mother-6 needham-9 english-7 procuress-6 brothel-4 keeper-9 18th-8.......
0 0.008 battle (8) union (7) confederate (6) grant (4) tennessee (4)
1 0.008 sunderland (6) years (6) echo (5) survived (3) paper (3)
2 0.040 test (6) cricket (5) hill (4) park (3) career (3)
3 0.008 average (6) equipartition (6) system (5) theorem (5) law (4)
4 0.073 hawes (7) kentucky (6) army (5) gen (4) war (4)
5 0.008 yard (6) national (6) thylacine (5) wilderness (4) tasmanian (4)
6 0.202 gunnhild (5) norway (4) life (4) including (3) king (3)
7 0.202 zinta (4) role (3) hindi (3) actress (3) film (3)
8 0.040 rings (10) ring (3) dust (3) 2 (3) uranus (3)
9 0.411 london (4) sullivan (3) gilbert (3) thespis (3) back (3)
0 0.55
The first line in this part is probably the token-topic assignment, right?
Question3: for the first topic,
0 0.008 battle (8) union (7) confederate (6) grant (4) tennessee (4)
0.008 is said to be the "topic distribution", is it the distribution of this topic in whole corpus? Then there seems to be a conflict: topic 0 as shown above will have its token appeared in the copus 8+7+6+4+4+... times; and in comparison topic 7 have 4+3+3+3+3... times recognized in the corpus. As a result, topic 7 should have lower distribution than topic 0. This is what I cann't understand. Further more, what ist that "0 0.55" at the end?
Thank you very much for reading this long post. Hope you can answer it and hope this could be helpful for others interested in Mallet.
best