Read in large file and make dictionary

2020-07-07 11:34发布

站内文章 / Python

59 0

我命由我不由天

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minimal example that shows the problem.

First make some fake data

paste <(seq 20000000) <(seq 2 20000001)  > largefile.txt

Now here is a minimal piece of python code to read it in and make a dictionary.

import sys
from collections import defaultdict
fin = open(sys.argv[1])

dict = defaultdict(list)

for line in fin:
    parts = line.split()
    dict[parts[0]].append(parts[1])

Timings:

time ./read.py largefile.txt
real    0m55.746s

However it is not I/O bound as:

time cut -f1 largefile.txt > /dev/null    
real    0m1.702s

If I comment out the dict line it takes 9 seconds. It seems that almost all the time is spent by dict[parts[0]].append(parts[1]).

Is there any way to speed this up? I don't mind using cython or even C if that is going to make a big difference. Or can pandas help here?

Here is the profile output on a file of size 10000000 lines.

python -m cProfile read.py test.data         20000009 function calls in 42.494 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 bisect.py:1(<module>)
        1    0.000    0.000    0.001    0.001 collections.py:1(<module>)
        1    0.000    0.000    0.000    0.000 collections.py:25(OrderedDict)
        1    0.000    0.000    0.000    0.000 collections.py:386(Counter)
        1    0.000    0.000    0.000    0.000 heapq.py:31(<module>)
        1    0.000    0.000    0.000    0.000 keyword.py:11(<module>)
        1   30.727   30.727   42.494   42.494 read.py:2(<module>)
 10000000    4.855    0.000    4.855    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 10000000    6.912    0.000    6.912    0.000 {method 'split of 'str' objects}
        1    0.000    0.000    0.000    0.000 {open}

Update. We can assume that parts[1] is an integer and that parts[0] is a short fixed length string.

My fake data isn't very good as you only get one value per key. Here is a better version.

perl -E 'say int rand 1e7, $", int rand 1e4 for 1 .. 1e7' > largefile.txt

The only operation I will do is to query a key to return the list of values associated with it.

回答1:

If you want the thing you said in the comment, then you can do it easily in pandas: Let's say you have a file with the same layout but the entries get duplicated, since in your example you add all the duplicates into a list:

Then you can read and manipulate the data:

In [1]: df = pd.read_table('largefile.txt', header=None, index_col=0)
In [2]: df.loc[2]
Out[2]:
1    2
Name: 2, dtype: int64

In [3]: df.loc[1]
Out[3]:
   1
0
1  1
1  3
1  5

Pandas stores everything in DataFrames and Series objects which are indexed so don't bother a lot about the output, the first column is the index and the second column is the important one and it will give you the numbers you need.

You can do a lot more with pandas though... For example you can group by the first column in your file and perform aggregations:

In [64]: df = pd.read_table('largefile.txt', header=None).groupby(0)
In [65]: df.sum()
Out[65]:
   1
0
1  9
2  2
3  4
5  6
In [66]: df.mean()
Out[66]:
   1
0
1  3
2  2
3  4
5  6    
In [67]: df[0].count()
Out[67]:
0
1    3
2    1
3    1
5    1
dtype: int64

I know that this is not the answer to how to speed up the dictionary thing, but from what you mentioned in the comment, this could be an alternate solution.

Edit - Added timing

Compared to the fastest dictionary solution and loading data into pandas DataFrame:

test_dict.py

import sys
d = {}
with open(sys.argv[1]) as fin:
    for line in fin:
        parts = line.split(None, 1)
        d[parts[0]] = d.get(parts[0], []) + [parts[1]]

test_pandas.py

import sys
import pandas as pd
df = pd.read_table(sys.argv[1], header=None, index_col=0)

Timed on a linux machine:

$ time python test_dict.py largefile.txt
real    1m13.794s
user    1m10.148s
sys     0m3.075s

$ time python test_pandas.py largefile.txt
real    0m10.937s
user    0m9.819s
sys     0m0.504s

Edit: for the new example file

In [1]: import pandas as pd
In [2]: df = pd.read_table('largefile.txt', header=None,
                           sep=' ', index_col=0).sort_index()
In [3]: df.index
Out[3]: Int64Index([0, 1, 1, ..., 9999998, 9999999, 9999999], dtype=int64)
In [4]: df[1][0]
Out[4]: 6301
In [5]: df[1][1].values
Out[5]: array([8936, 5983])

回答2:

Here are a few quick performance improvements I managed to get:

Using a plain dict instead of defaultdict, and changing d[parts[0]].append(parts[1]) to d[parts[0]] = d.get(parts[0], []) + [parts[1]], cut the time by 10%. I don't know whether it's eliminating all those calls to a Python __missing__ function, not mutating the lists in-place, or something else that deserves the credit.

Just using setdefault on a plain dict instead of defaultdict also cuts the time by 8%, which implies that it's the extra dict work rather than the in-place appends.

Meanwhile, replacing the split() with split(None, 1) helps by 9%.

Running in PyPy 1.9.0 instead of CPython 2.7.2 cut the time by 52%; PyPy 2.0b by 55%.

If you can't use PyPy, CPython 3.3.0 cut the time by 9%.

Running in 32-bit mode instead of 64-bit increased the time by 170%, which implies that if you're using 32-bit you may want to switch.

The fact that the dict takes over 2GB to store (slightly less in 32-bit) is probably a big part of the problem. The only real alternative is to store everything on disk. (In a real-life app you'd probably want to manage an in-memory cache, but here, you're just generating the data and quitting, which makes things simpler.) Whether this helps depends on a number of factors. I suspect that on a system with an SSD and not much RAM it'll speed things up, while on a system with a 5400rpm hard drive and 16GB of RAM (like the laptop I'm using at the moment) it won't… But depending on your system's disk cache, etc., who knows, without testing.

There's no quick&dirty way to store lists of strings in disk-based storage (shelve will presumably waste more time with the pickling and unpickling than it saves), but changing it to just concatenate strings instead and using gdbm kept the memory usage down below 200MB and finished in about the same time, and has the nice side effect that if you want to use the data more than once, you've got them stored persistently. Unfortunately, plain old dbm wouldn't work because the default page size is too small for this many entries, and the Python interface doesn't provide any way to override the default.

Switching to a simple sqlite3 database that just has non-unique Key and Value columns and doing it in :memory: took about 80% longer, while on disk it took 85% longer. I suspect that denormalizing things to store multiple values with each key wouldn't help, and would in fact make things worse. (Still, for many real life uses, this may be a better solution.)

Meanwhile, wrapping cProfile around your main loop:

         40000002 function calls in 31.222 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   21.770   21.770   31.222   31.222 <string>:2(<module>)
 20000000    2.373    0.000    2.373    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 20000000    7.079    0.000    7.079    0.000 {method 'split' of 'str' objects}

So, that's one third of your time spent in string.split, 10% spent in append, and the rest spend it code that cProfile couldn't see, which here includes both iterating the file and the defaultdict method calls.

Switching to a regular dict with setdefault (which, remember, was a little faster) shows 3.774 seconds spent in setdefault, so that's about 15% of the time, or presumably around 20% for the defaultdict version. Presuambly the __setitem__ method isn't going to be any worse than the setdefault or defaultdict.__getitem__ were.

However, we may not be seeing the time charged by malloc calls here, and they may be a huge chunk of the performance. To test that, you'll need a C-level profiler. So let's come back to that.

Meanwhile, at least some of the leftover time is probably taken up by the line-splitting as well, since that must cost on the same order as space-splitting, right? But I don't know of any way to improve that significantly.

Finally, a C-level profiler is going to help here, but one run on my system may not help much for your system, so I'll leave that to you.

The fastest version on my system depends on which Python I run, but it's either this:

d = {}    
for line in fin:
    parts = line.split(None, 1)
    d[parts[0]] = d.get(parts[0], []) + [parts[1]]

Or this:

d = {}    
for line in fin:
    parts = line.split(None, 1)
    d.setdefault(parts[0], []).append(parts[1])

… And they're both pretty close to each other.

The gdbm solution, which was about the same speed, and has obvious advantages and disadvantages, looks like this:

d = gdbm.open(sys.argv[1] + '.db', 'c')
for line in fin:
    parts = line.split(None, 1)
    d[parts[0]] = d.get(parts[0], '') + ',' + parts[1]

(Obviously if you want to be able to run this repeatedly, you will need to add a line to delete any pre-existing database—or, better, if it fits your use case, to check its timestamp against the input file's and skip the whole loop if it's already up-to-date.)

回答3:

Here's a quick C version for anyone who is interested. Headline times on my machine:

Python (>5Gb memory)

time ./read.py  largefile.txt

real    0m48.711s
user    0m46.911s
sys     0m1.783s

C (~1.9Gb memory)

gcc -O3 read.c -o read
time ./read largefile.txt

real    0m6.250s
user    0m5.676s
sys     0m0.573s

So about 7.8x quicker in C. :)

And I should add that my version of seq would not create a usable list without changing the command to:

paste <(seq -f "%.0f" 20000000) <(seq -f "%.0f" 2 20000001)  > largefile.txt

Code below, credit must go to Vijay Mathew who copied the dict example from Section 6.6 of The C Programming Language into his example (and I copied into my answer below): Quick Way to Implement Dictionary in C

====== Edit ====== (13/08/2013)

Following on from comment #2 on my answer, I've updated the code to that in code listing 2 to allow multiple values for a single key and also started using the updated perl code to generate the test file (which is half the size hence the approx halfed execution times).

Headline times are:

Python (>5Gb memory)

time ./read.py  largefile.txt

real    0m25.925s
user    0m25.228s
sys     0m0.688s

C (~1.9Gb memory)

gcc -O3 read.c -o read
time ./read largefile.txt

real    0m3.497s (although sub 3 seconds is possible by reducing the hash size?!?!?)
user    0m3.183s
sys     0m0.315s

So about 7.4x quicker in C although panda is probably close.

However, the point about the has size is important. I could 'cheat' by reducing the hash size down to a very small number which for a multivalued dictionary would increase insert speed at the expense of lookup. Therefore to really test either of these implementations you really need to also test the lookup speed.

Code 2 (multivalue dictionary)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct nlist { /* table entry: */
    struct nlist *next; /* next entry in chain */
    char *name; /* defined name */
    char *defn; /* replacement text */
};

#define HASHSIZE 10000001
static struct nlist *hashtab[HASHSIZE]; /* pointer table */

/* hash: form hash value for string s */
unsigned hash(char *s)
{
    unsigned hashval;
    for (hashval = 0; *s != '\0'; s++)
      hashval = *s + 31 * hashval;
    return hashval % HASHSIZE;
}

/* lookup: look for s in hashtab */
struct nlist *lookup(char *s)
{
    struct nlist *np;
    for (np = hashtab[hash(s)]; np != NULL; np = np->next)
        if (strcmp(s, np->name) == 0)
          return np; /* found */
    return NULL; /* not found */
}

struct nlist * lookup_all(char *key)
{
    struct nlist *np, *np2, *ret;
    unsigned hashval = hash(key);

    ret = NULL;

    for (np = hashtab[hashval]; np != NULL; np = np->next) {
      if (strcmp(key, np->name) == 0) {
        np2 = malloc(sizeof(*np2));
        np2->name = np->name;
        np2->defn = np->defn;
        np2->next = ret;
        ret = np2;
      }
    }
    return ret; /* not found */
}

/* install: put (name, defn) in hashtab */
struct nlist *install(char *name, char *defn)
{
    struct nlist *np, *np2;
    unsigned hashval = hash(name);;
    //if ((np = lookup(name, hashval)) == NULL) { /* not found */

    np = (struct nlist *) malloc(sizeof(*np));
    if (np == NULL || (np->name = strdup(name)) == NULL)
      return NULL;
    np->next = hashtab[hashval];
    hashtab[hashval] = np;

    if ((np->defn = strdup(defn)) == NULL)
       return NULL;
    return np;
}
#ifdef STRDUP
char *strdup(char *s) /* make a duplicate of s */
{
    char *p;
    p = (char *) malloc(strlen(s)+1); /* +1 for .\0. */
    if (p != NULL)
       strcpy(p, s);
    return p;
}
#endif /* STRDUP */

int main(int argc, char *argv[]) {

    FILE *fp;
    char str1[20];
    char str2[20];
    int size = 0;
    int progress = 0;
    struct nlist *result;

    fp = fopen(argv[1],"r");
    if(fp==NULL) {return 1;}

    fseek(fp, 0, SEEK_END);
    size = ftell(fp);
    rewind(fp);

   while(size != ftell(fp)) {
        if(0==fscanf(fp, "%s %s",str1,str2))
            break;
        (void)install(str1,str2);
    }
    printf("Done\n");
    fclose(fp);

    // Test a lookup to see if we get multiple items back.    
    result = lookup_all("1");
    while (result) {
        printf("Key = %s Value = %s\n",result->name,result->defn);
        result = result->next;
    }

    return 0;
}

Code 1 (single value dictionary)

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

struct nlist { /* table entry: */
    struct nlist *next; /* next entry in chain */
    char *name; /* defined name */
    char *defn; /* replacement text */
};

#define HASHSIZE 10000001
static struct nlist *hashtab[HASHSIZE]; /* pointer table */

/* hash: form hash value for string s */
unsigned hash(char *s)
{
    unsigned hashval;
    for (hashval = 0; *s != '\0'; s++)
      hashval = *s + 31 * hashval;
    return hashval % HASHSIZE;
}

/* lookup: look for s in hashtab */
struct nlist *lookup(char *s)
{
    struct nlist *np;
    for (np = hashtab[hash(s)]; np != NULL; np = np->next)
        if (strcmp(s, np->name) == 0)
          return np; /* found */
    return NULL; /* not found */
}

/* install: put (name, defn) in hashtab */
struct nlist *install(char *name, char *defn)
{
    struct nlist *np;
    unsigned hashval;
    if ((np = lookup(name)) == NULL) { /* not found */
        np = (struct nlist *) malloc(sizeof(*np));
        if (np == NULL || (np->name = strdup(name)) == NULL)
          return NULL;
        hashval = hash(name);
        np->next = hashtab[hashval];
        hashtab[hashval] = np;
    } else /* already there */
        free((void *) np->defn); /*free previous defn */
    if ((np->defn = strdup(defn)) == NULL)
       return NULL;
    return np;
}
#ifdef STRDUP
char *strdup(char *s) /* make a duplicate of s */
{
    char *p;
    p = (char *) malloc(strlen(s)+1); /* +1 for .\0. */
    if (p != NULL)
       strcpy(p, s);
    return p;
}
#endif /* STRDUP */

int main(int argc, char *argv[]) {

    FILE *fp;
    char str1[20];
    char str2[20];
    int size = 0;
    int progress = 0;

    fp = fopen(argv[1],"r");
    if(fp==NULL) {return 1;}

    fseek(fp, 0, SEEK_END);
    size = ftell(fp);
    rewind(fp);

   while(size != ftell(fp)) {
        if(0==fscanf(fp, "%s %s",str1,str2))
            break;
        //printf(">%s<>%s<\n",str1,str2);
        (void)install(str1,str2);
        ++progress;
        if((progress % 100000)==0)
            printf("Progress = %d\n",progress);
    }
    printf("Done\n");
    fclose(fp);

    return 0;
}

回答4:

You can still add an extra optimization on top of the others:

Since your keys are strings of "nearly" consecutive integers you can speed up the creation of the dict by inserting the elements in the dict in order. It will reduce the dict collisions. See comments on python dict implementation

Major subtleties ahead: Most hash schemes depend on having a "good" hash function, in the sense of simulating randomness. Python doesn't: its most important hash functions (for strings and ints) are very regular in common cases:

map(hash, (0, 1, 2, 3)) [0, 1, 2, 3] map(hash, ("namea", "nameb", "namec", "named")) [-1658398457, -1658398460, -1658398459, -1658398462]

This isn't necessarily bad! To the contrary, in a table of size 2**i, taking the low-order i bits as the initial table index is extremely fast, and there are no collisions at all for dicts indexed by a contiguous range of ints. The same is approximately true when keys are "consecutive" strings. So this gives better-than-random behavior in common cases, and that's very desirable.

So if you can pre-process the file to sort it, the python execution can be much faster.