I'm trying to build an index of trigrams of words using dictonary type of structure. Keys are strings and values are numbers of occurences.
for t in arrayOfTrigrams:
if t in trigrams:
trigrams[t] += 1
else:
trigrams[t] = 1
But the data is very big - more than 500 MB of raw texts and I don't know how to cope with the MemoryError. And as distinct from Python memoryerror creating large dictionary I don't create any irrelevant stuff, each trigram is important.
On Further Edit -- code included If you are able to hold
arrayOfTrigrams
in memory, see the original solution at the bottom. But, if you haven't been able to createarrayOfTrigrams
(and I am somewhat skeptical that you have gotten that far given the data size), you could still shoot for creating a dictionary of repeated trigrams. Repeated bigrams always contain repeated words and Repeated trigrams contain repeated bigrams. Process your 500 MB of data in stages. First create a set of repeated words. Using this, create a dictionary of repeated bigrams. First do a raw frequency count of bigrams which contain one of the repeated words and then discard any whose count is just 1. Then process the data a third time, creating the dictionary of repeated trigrams. Again, do a raw frequency count of trigrams that contain a repeated bigram (which should be a small subset of all possible trigrams) and then discard those from the dictionary whose final count is just 1. In this way you can build up the dictionary without ever needing to keep all trigrams in memory at once.Proof of concept:
This code turns up 10016 trigrams which appear more than once. In contrast, when I evaluate
I get 188285, so in this somewhat sizeable natural language example, only 10016/188285 = 5.3% of possible trigrams are repeated trigrams. Assuming that a similar ratio holds for your data, I estimate that the size of a frequency dictionary for repeated trigrams will be around 100 MB.
original solution:
Your code and your question suggests that you are able to hold
arrayOfTrigrams
in memory but can't create the dictionary. One potential workaround is to first sort this array and create a frequency count of repeated trigrams:After constructing
repeatedTrigrams
you could use a set comprehension:Then
t in uniques
would convey the information that the count oft
is 1, which I would suspect to be true for the overwhelming majority of trigrams. At this stage you have all of the relevant frequency information and can discard the list of trigrams to free up some of the memory that you've consumed:My first suggestion would be to not hold
arrayOfTrigrams
in memory completely, but use streaming. You are reading it from somewhere, so you can control how you read it. Python's generators are very handy here. Let's pretend you're reading it from a file:Two things here:
S
from a fileN
time does occupyN*len(S)
bytes. By using the dictionary, we make sure there's one unique copy of each word in the input. This does consume some memory, of course.This function may look differently for you, depending on where you read your trigrams from. Keep in mind that the tokenizer I use here is extremely basic.
This already saves a little bit of memory, not too much though.
So, let's store intermediate results on disk:
In this step, you can tweak
LIMIT
to not use too much memory, i.e. just reduce it until you don't experience aMemoryError
any more.Now, you have
N
files on drive with a sorted list of trigrams. In a separate program, you can read them in and sum up all intermediate counts:If you have 4 GiB of memory, you may actually have to use the splitting functionality. With 8 GiB, you should be able to keep it all in RAM.