Combining Lists of Word Frequency Data

2019-04-20 17:03发布

问题:

This seems like it should be an obvious question, but the tutorials and documentation on lists are not forthcoming. Many of these issues stem from the sheer size of my text files (hundreds of MB) and my attempts to boil them down to something manageable by my system. As a result, I'm doing my work in segments and am now trying to combine the results.

I have multiple word frequency lists (~40 of them). The lists can either be taken through Import[ ] or as variables generated in Mathematica. Each list appears as the following and has been generated using the Tally[ ] and Sort[ ] commands:

{{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 16850}, {"in",
16164}, {"de", 14930}, {"a", 14660}, {"to", 14175}, {"la", 7347}, {"was", 6030}, {"l", 5981}, {"le", 5735}, <<51293>>, {"abattoir", 1}, {"abattement", 1}, {"abattagen", 1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}

Here is an example of the second file:

{{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 16262}, {"and",
14488}, {"to", 12726}, {"a", 12635}, {"in", 11141}, {"la", 10739}, {"et", 9016}, {"les", 8675}, {"le", 7748}, <<101032>>, {"abattement", 1}, {"abattagen", 1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}

I want to combine them so that the frequency data aggregates: i.e. if the second file has 30,419 occurrences of 'the' and is joined to the first file, it should return that there are 72,635 occurrences (and so on as I move through the entire collection).

回答1:

It sounds like you need GatherBy.

Suppose your two lists are named data1 and data2, then use

{#[[1, 1]], Total[#[[All, 2]]]} & /@ GatherBy[Join[data1, data2], First]

This easily generalizes to any number of lists, not just two.



回答2:

Try using a hash table, like this. First set things up:

ClearAll[freq];
freq[_] = 0;

Now eg freq["safas"] returns 0. Next, if the lists are defined as

lst1 = {{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 
    16850}, {"in", 16164}, {"de", 14930}, {"a", 14660}, {"to", 
    14175}, {"la", 7347}, {"was", 6030}, {"l", 5981}, {"le", 
    5735}, {"abattoir", 1}, {"abattement", 1}, {"abattagen", 
    1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 
    1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 
    1}, {"aaa", 1}};
lst2 = {{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 
    16262}, {"and", 14488}, {"to", 12726}, {"a", 12635}, {"in", 
    11141}, {"la", 10739}, {"et", 9016}, {"les", 8675}, {"le", 
    7748}, {"abattement", 1}, {"abattagen", 1}, {"abattage", 
    1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 
    1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}};

you may run this

Scan[(freq[#[[1]]] += #[[2]]) &, lst1]

after which eg

freq["the"]
(*
42216
*)

and then the next list

Scan[(freq[#[[1]]] += #[[2]]) &, lst2]

after which eg

freq["the"]
72635

while still

freq["safas"]
(*
0
*)


回答3:

Here is a direct Sow/Reap function:

Reap[#2~Sow~# & @@@ data1~Join~data2;, _, {#, Tr@#2} &][[2]]

Here is a concise form of acl's method:

Module[{c},
  c[_] = 0;

  c[#] += #2 & @@@ data1~Join~data2;

  {#[[1, 1]], #2} & @@@ Most@DownValues@c
]

This appears to be a bit faster than Szabolcs code on my system:

data1 ~Join~ data2 ~GatherBy~ First /.
  {{{x_, a_}, {x_, b_}} :> {x, a + b}, {x : {_, _}} :> x}


回答4:

There's an old saying, "if all you have is a hammer, everything becomes a nail." So, here's my hammer: SelectEquivalents.

This can be done a little quicker using SelectEquivalents:

SelectEquivalents[data1~Join~data2, #[[1]]&, #[[2]]&, {#1, Total[#2]}&]

In order, the first param is obviously just the joined lists, the second one is what they're grouped by (in this case the first element), the third param strips off the string leaving just the count, and the fourth param puts it back together with the string as #1 and the counts in a list as #2.



回答5:

Try ReplaceRepeated.

Join the lists. Then use

//. {{f1___, {a_, c1_}, f2___, {a_, c2_}, f3___} -> {f1, f2, f3, {a, c1 + c2}}}