This seems like it should be an obvious question, but the tutorials and documentation on lists are not forthcoming. Many of these issues stem from the sheer size of my text files (hundreds of MB) and my attempts to boil them down to something manageable by my system. As a result, I'm doing my work in segments and am now trying to combine the results.
I have multiple word frequency lists (~40 of them). The lists can either be taken through Import[ ] or as variables generated in Mathematica. Each list appears as the following and has been generated using the Tally[ ] and Sort[ ] commands:
{{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n", 16850}, {"in",
16164}, {"de", 14930}, {"a", 14660}, {"to", 14175}, {"la", 7347},
{"was", 6030}, {"l", 5981}, {"le", 5735}, <<51293>>, {"abattoir",
1}, {"abattement", 1}, {"abattagen", 1}, {"abattage", 1},
{"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1},
{"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}
Here is an example of the second file:
{{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of", 16262}, {"and",
14488}, {"to", 12726}, {"a", 12635}, {"in", 11141}, {"la", 10739},
{"et", 9016}, {"les", 8675}, {"le", 7748}, <<101032>>,
{"abattement", 1}, {"abattagen", 1}, {"abattage", 1}, {"abated",
1}, {"abandonn", 1}, {"abaiss", 1}, {"aback", 1}, {"aase", 1},
{"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}}
I want to combine them so that the frequency data aggregates: i.e. if the second file has 30,419 occurrences of 'the' and is joined to the first file, it should return that there are 72,635 occurrences (and so on as I move through the entire collection).
It sounds like you need GatherBy
.
Suppose your two lists are named data1
and data2
, then use
{#[[1, 1]], Total[#[[All, 2]]]} & /@ GatherBy[Join[data1, data2], First]
This easily generalizes to any number of lists, not just two.
Try using a hash table, like this. First set things up:
ClearAll[freq];
freq[_] = 0;
Now eg freq["safas"]
returns 0. Next, if the lists are defined as
lst1 = {{"the", 42216}, {"of", 24903}, {"and", 18624}, {"n",
16850}, {"in", 16164}, {"de", 14930}, {"a", 14660}, {"to",
14175}, {"la", 7347}, {"was", 6030}, {"l", 5981}, {"le",
5735}, {"abattoir", 1}, {"abattement", 1}, {"abattagen",
1}, {"abattage", 1}, {"abated", 1}, {"abandonn", 1}, {"abaiss",
1}, {"aback", 1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah",
1}, {"aaa", 1}};
lst2 = {{"the", 30419}, {"n", 20414}, {"de", 19956}, {"of",
16262}, {"and", 14488}, {"to", 12726}, {"a", 12635}, {"in",
11141}, {"la", 10739}, {"et", 9016}, {"les", 8675}, {"le",
7748}, {"abattement", 1}, {"abattagen", 1}, {"abattage",
1}, {"abated", 1}, {"abandonn", 1}, {"abaiss", 1}, {"aback",
1}, {"aase", 1}, {"aaijaut", 1}, {"aaaah", 1}, {"aaa", 1}};
you may run this
Scan[(freq[#[[1]]] += #[[2]]) &, lst1]
after which eg
freq["the"]
(*
42216
*)
and then the next list
Scan[(freq[#[[1]]] += #[[2]]) &, lst2]
after which eg
freq["the"]
72635
while still
freq["safas"]
(*
0
*)
Here is a direct Sow
/Reap
function:
Reap[#2~Sow~# & @@@ data1~Join~data2;, _, {#, Tr@#2} &][[2]]
Here is a concise form of acl's method:
Module[{c},
c[_] = 0;
c[#] += #2 & @@@ data1~Join~data2;
{#[[1, 1]], #2} & @@@ Most@DownValues@c
]
This appears to be a bit faster than Szabolcs code on my system:
data1 ~Join~ data2 ~GatherBy~ First /.
{{{x_, a_}, {x_, b_}} :> {x, a + b}, {x : {_, _}} :> x}
There's an old saying, "if all you have is a hammer, everything becomes a nail." So, here's my hammer: SelectEquivalents
.
This can be done a little quicker using SelectEquivalents
:
SelectEquivalents[data1~Join~data2, #[[1]]&, #[[2]]&, {#1, Total[#2]}&]
In order, the first param is obviously just the joined lists, the second one is what they're grouped by (in this case the first element), the third param strips off the string leaving just the count, and the fourth param puts it back together with the string as #1
and the counts in a list as #2
.
Try ReplaceRepeated
.
Join the lists. Then use
//. {{f1___, {a_, c1_}, f2___, {a_, c2_}, f3___} -> {f1, f2, f3, {a, c1 + c2}}}