-->

How to get rid of MemoryError while dealing with a

2019-07-17 19:10发布

问题:

I'm trying to build an index of trigrams of words using dictonary type of structure. Keys are strings and values are numbers of occurences.

for t in arrayOfTrigrams:
    if t in trigrams:
        trigrams[t] += 1
    else:
        trigrams[t] = 1

But the data is very big - more than 500 MB of raw texts and I don't know how to cope with the MemoryError. And as distinct from Python memoryerror creating large dictionary I don't create any irrelevant stuff, each trigram is important.

回答1:

On Further Edit -- code included If you are able to hold arrayOfTrigrams in memory, see the original solution at the bottom. But, if you haven't been able to create arrayOfTrigrams (and I am somewhat skeptical that you have gotten that far given the data size), you could still shoot for creating a dictionary of repeated trigrams. Repeated bigrams always contain repeated words and Repeated trigrams contain repeated bigrams. Process your 500 MB of data in stages. First create a set of repeated words. Using this, create a dictionary of repeated bigrams. First do a raw frequency count of bigrams which contain one of the repeated words and then discard any whose count is just 1. Then process the data a third time, creating the dictionary of repeated trigrams. Again, do a raw frequency count of trigrams that contain a repeated bigram (which should be a small subset of all possible trigrams) and then discard those from the dictionary whose final count is just 1. In this way you can build up the dictionary without ever needing to keep all trigrams in memory at once.

Proof of concept:

from collections import defaultdict

chars = set('ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789')

def cleanWord(s):
    return ''.join(c for c in s if c in chars)

f = open('moby dick.txt') #downloaded from Project Gutenberg: http://www.gutenberg.org/ebooks/2701 -- Thanks!
words = f.read().split()
f.close()

words = [cleanWord(w.upper()) for w in words]
words = [w for w in words if len(w) > 1 and not(w in set('AIOY'))]

repeatedWords = defaultdict(int)
for w in words:
    repeatedWords[w] += 1

repeatedWords = set(w for w in repeatedWords if repeatedWords[w] > 1)

repeatedBigrams = defaultdict(int)
for i in range(len(words) - 1):
    x,y = words[i:i+2]
    if x in repeatedWords or y in repeatedWords:
        repeatedBigrams[x + ' ' + y] +=1

repeatedBigrams = set(b for b in repeatedBigrams if repeatedBigrams[b] > 1)

repeatedTrigrams = defaultdict(int)

for i in range(len(words) - 2):
    x,y,z = words[i:i+3]
    if x + ' ' + y in repeatedBigrams and y + ' ' + z in repeatedBigrams:
        repeatedTrigrams[x + ' ' + y + ' ' + z] +=1

repeatedTrigrams = {t:c for t,c in repeatedTrigrams.items() if c > 1}

This code turns up 10016 trigrams which appear more than once. In contrast, when I evaluate

len(set(' '.join(words[i:i+3]) for i in range(len(words)-2)))

I get 188285, so in this somewhat sizeable natural language example, only 10016/188285 = 5.3% of possible trigrams are repeated trigrams. Assuming that a similar ratio holds for your data, I estimate that the size of a frequency dictionary for repeated trigrams will be around 100 MB.


original solution:


Your code and your question suggests that you are able to hold arrayOfTrigrams in memory but can't create the dictionary. One potential workaround is to first sort this array and create a frequency count of repeated trigrams:

arrayOfTrigrams.sort()
repeatedTrigrams = {}

for i,t in enumerate(arrayOfTrigrams):
    if i > 0 and arrayOfTrigrams[i-1] == t:
        if t in repeatedTrigrams:
            repeatedTrigrams[t] += 1
        else:
            repeatedTrigrams[t] = 2

After constructing repeatedTrigrams you could use a set comprehension:

uniques = {t for t in arrayOfTrigrams if not t in repeatedTrigrams}

Then t in uniques would convey the information that the count of t is 1, which I would suspect to be true for the overwhelming majority of trigrams. At this stage you have all of the relevant frequency information and can discard the list of trigrams to free up some of the memory that you've consumed:

arrayOfTrigrams = None 


回答2:

My first suggestion would be to not hold arrayOfTrigrams in memory completely, but use streaming. You are reading it from somewhere, so you can control how you read it. Python's generators are very handy here. Let's pretend you're reading it from a file:

def read_trigrams(fobj):
    unique = {}
    def make_unique(w):
        w = w.strip("\"'`!?,.():-;{}").lower()
        return unique.setdefault(w, w)
    fobj.seek(0, 2)
    total_size = fobj.tell()
    fobj.seek(0, 0)

    read = 0
    prev_words = []
    for idx, line in enumerate(fobj):
        read += len(line)
        words = prev_words
        words.extend(filter(None, (make_unique(w) for w in line.split())))
        if len(words) > 3:
            for i in range(len(words) - 3):
                yield tuple(words[i:i+3])
        prev_words = words[-2:]

Two things here:

  1. We're using a generator so instead of reading in the whole file and return a list of trigrams, we return trigram by trigram. This is a little slower, but conserves memory.
  2. We make sure that eventually, there's at most one copy of each string we read, by having a dict of strings to itself. While it may seem weird at first, reading the same byte sequence S from a file N time does occupy N*len(S) bytes. By using the dictionary, we make sure there's one unique copy of each word in the input. This does consume some memory, of course.

This function may look differently for you, depending on where you read your trigrams from. Keep in mind that the tokenizer I use here is extremely basic.

This already saves a little bit of memory, not too much though.

So, let's store intermediate results on disk:

LIMIT = 5e6
def flush(counts, idx):
    with open('counts-%d' % (idx,), 'wb') as fobj:
        p = pickle.Pickler(fobj)
        for item in sorted(counts.items()):
            p.dump(item)

import sys
import pickle
from collections import defaultdict

counts = defaultdict(int)
caches = 0
with open(sys.argv[1], 'r') as fobj:
    for t in read_trigrams(fobj):
        counts[t] += 1
        if len(counts) > LIMIT:
            flush(counts, caches)
            caches += 1
            counts.clear()
flush(counts, caches)

In this step, you can tweak LIMIT to not use too much memory, i.e. just reduce it until you don't experience a MemoryError any more.

Now, you have N files on drive with a sorted list of trigrams. In a separate program, you can read them in and sum up all intermediate counts:

import sys
import pickle

def merger(inputs):
    unpicklers = [pickle.Unpickler(open(f, 'rb')) for f in inputs]
    DONE = (object(), )
    NEXT = (object(), )

    peek = [NEXT] * len(unpicklers)

    while True:
        for idx in range(len(unpicklers)):
            if peek[idx] is NEXT:
                try:
                    peek[idx] = unpicklers[idx].load()
                except EOFError:
                    peek[idx] = DONE

        if all(v is DONE for v in peek):
            return
        min_key = min(v[0] for v in peek if v is not DONE)
        yield min_key, sum(v[1] for v in peek if v[0] == min_key)
        peek = [NEXT if (v[0] == min_key) else v for v in peek]


for trigram, count in merger(sys.argv[1:]):
    print(trigram, count)

If you have 4 GiB of memory, you may actually have to use the splitting functionality. With 8 GiB, you should be able to keep it all in RAM.