Why is numpy/pandas parsing of a csv file with lon

2019-02-14 18:28发布

I'm trying to efficiently parse a csv file with around 20,000 entries per line (and a few thousand lines) to a numpy array (or list of arrays, or anything similar really). I found a number of other questions, along with this blog post, which suggest that pandas's csv parser is extremely fast. However I've benchmarked pandas, numpy and some pure-python approaches and it appears that the trivial pure-python string splitting + list comprehension beats everything else by quite a large margin.

  • What's going on here?

  • Are there any csv parsers that that would be more efficient?

  • If I change the format of the input data will it help?

Here's the source code I'm benchmarking with (the sum() is just to make sure any lazy iterators are forced to evaluate everything):

#! /usr/bin/env python3

import sys

import time
import gc

import numpy as np
from pandas.io.parsers import read_csv
import csv

def python_iterator_csv():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for line in f.readlines():
            all_data = line.strip().split(",")
            print(sum(float(x) for x in all_data))


def python_list_csv():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for line in f.readlines():
            all_data = line.strip().split(",")
            print(sum([float(x) for x in all_data]))


def python_array_csv():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for line in f.readlines():
            all_data = line.strip().split(",")
            print(sum(np.array([float(x) for x in all_data])))


def numpy_fromstring():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for line in f.readlines():
            print(sum(np.fromstring(line, sep = ",")))


def numpy_csv():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for row in np.loadtxt(f, delimiter = ",", dtype = np.float, ndmin = 2):
            print(sum(row))


def csv_loader(csvfile):
    return read_csv(csvfile,
                      header = None,
                      engine = "c",
                      na_filter = False,
                      quoting = csv.QUOTE_NONE,
                      index_col = False,
                      sep = ",")

def pandas_csv():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        for row in np.asarray(csv_loader(f).values, dtype = np.float64):
            print(sum(row))


def pandas_csv_2():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        print(csv_loader(f).sum(axis=1))


def simple_time(func, repeats = 3):
    gc.disable()

    for i in range(0, repeats):
        start = time.perf_counter()
        func()
        end = time.perf_counter()
        print(func, end - start, file = sys.stderr)
        gc.collect()

    gc.enable()
    return


if __name__ == "__main__":

    simple_time(python_iterator_csv)
    simple_time(python_list_csv)
    simple_time(python_array_csv)
    simple_time(numpy_csv)
    simple_time(pandas_csv)
    simple_time(numpy_fromstring)

    simple_time(pandas_csv_2)

The output (to stderr) is:

<function python_iterator_csv at 0x7f22302b1378> 19.754893831999652
<function python_iterator_csv at 0x7f22302b1378> 19.62786615600271
<function python_iterator_csv at 0x7f22302b1378> 19.66641107099713

<function python_list_csv at 0x7f22302b1ae8> 18.761991592000413
<function python_list_csv at 0x7f22302b1ae8> 18.722911622000538
<function python_list_csv at 0x7f22302b1ae8> 19.00348913199923

<function python_array_csv at 0x7f222baffa60> 41.8681991630001
<function python_array_csv at 0x7f222baffa60> 42.141840383999806
<function python_array_csv at 0x7f222baffa60> 41.86879085799956

<function numpy_csv at 0x7f222ba5cc80> 47.957625758001086
<function numpy_csv at 0x7f222ba5cc80> 47.245571732000826
<function numpy_csv at 0x7f222ba5cc80> 47.25457685799847

<function pandas_csv at 0x7f2228572620> 43.39656048499819
<function pandas_csv at 0x7f2228572620> 43.5016079220004
<function pandas_csv at 0x7f2228572620> 43.567352316000324

<function numpy_fromstring at 0x7f593ed3cc80> 32.490607361
<function numpy_fromstring at 0x7f593ed3cc80> 32.421125410997774
<function numpy_fromstring at 0x7f593ed3cc80> 32.37903898300283

<function pandas_csv_2 at 0x7f846d1aa730> 24.903284349999012
<function pandas_csv_2 at 0x7f846d1aa730> 25.498485038999206
<function pandas_csv_2 at 0x7f846d1aa730> 25.03262125800029

From the blog post linked above it seems that pandas can import a csv matrix of random doubles at a data rate of 145/1.279502 = 113 MB/s. My file is 814 MB, so pandas is only manages ~19 MB/s for me!

edit: As pointed out by @ASGM, this wasn't really fair to pandas because it is not designed for rowise iteration. I've included the suggested improvement in the benchmark but it's still slower than pure python approaches. (Also: I've played around with profiling similar code, before simplifying it to this benchmark, and the parsing always dominated the time taken.)

edit2: Best of three times without the sum:

python_list_csv    17.8
python_array_csv   23.0
numpy_csv          28.6
numpy_fromstring   13.3
pandas_csv_2       24.2

so without the summation numpy.fromstring beats pure python by a small margin (I think fromstring is written in C so this makes sense).

edit3:

I've done some experimentation with the C/C++ float parsing code here and it looks like I'm probably expecting too much from pandas/numpy. Most of the robust parsers listed there give times of 10+ seconds just to parse this number of floats. The only parser which resoundingly beats numpy.fromstring is boost's spirit::qi which is C++ and so not likely to make it into any python libraries.

[ More precise results: spirit::qi ~ 3s, lexical_cast ~ 7s, atof and strtod ~ 10s, sscanf ~ 18s, stringstream and stringstream reused are incredibly slow at 50s and 28s. ]

4条回答
看我几分像从前
2楼-- · 2019-02-14 18:36

The array_csv and numpy_csv times are quite similar. If you look at the loadtxt code you'll see that the actions are quite similar. With array_csv you construct an array for each line and use it, while numpy_csv collects the parsed (and converted) lines into one list, which is converted to an array at the end.

loadtxt for each row does:

        vals = split_line(line)
        ...
        # Convert each value according to its column and store
        items = [conv(val) for (conv, val) in zip(converters, vals)]
        # Then pack it according to the dtype's nesting
        items = pack_items(items, packing)
        X.append(items)

with a final

X = np.array(X, dtype)

That [conv(val) for ...] line is just a generalization of your [float(val) for val in ...].

If a plain list does the job, don't convert it to an array. That just adds unnecessary overhead.

Functions like loadtxt are most valuable when the csv columns contain a mix of data types. They streamline the work of creating structured arrays from that data. For pure numeric data such as yours they don't add much.

I can't speak for pandas, except that it has yet another layer on top of numpy, and does a lot of its own hardcoding.

查看更多
看我几分像从前
3楼-- · 2019-02-14 18:48

Does your CSV file contain column headers? If not, then explicitly passing header=None to pandas.read_csv can give a slight performance improvement for the Python parsing engine (but not for the C engine):

In [1]: np.savetxt('test.csv', np.random.randn(1000, 20000), delimiter=',')

In [2]: %timeit pd.read_csv('test.csv', delimiter=',', engine='python')
1 loops, best of 3: 9.19 s per loop

In [3]: %timeit pd.read_csv('test.csv', delimiter=',', engine='c')
1 loops, best of 3: 6.47 s per loop

In [4]: %timeit pd.read_csv('test.csv', delimiter=',', engine='python', header=None)
1 loops, best of 3: 6.26 s per loop

In [5]: %timeit pd.read_csv('test.csv', delimiter=',', engine='c', header=None)
1 loops, best of 3: 6.46 s per loop

Update

If there are no missing or invalid values then you can do a little better by passing na_filter=False (only valid for the C engine):

In [6]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None)
1 loops, best of 3: 6.42 s per loop

In [7]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False)
1 loops, best of 3: 4.72 s per loop

There may also be small gains to be had by specifying the dtype explicitly:

In [8]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64)
1 loops, best of 3: 4.36 s per loop

Update 2

Following up on @morningsun's comment, setting low_memory=False squeezes out a bit more speed:

In [9]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64, low_memory=True)
1 loops, best of 3: 4.3 s per loop

In [10]: %timeit pd.read_csv('test.csv', sep=',', engine='c', header=None, na_filter=False, dtype=np.float64, low_memory=False)
1 loops, best of 3: 3.27 s per loop

For what it's worth, these benchmarks were all done using the current dev version of pandas (0.16.0-19-g8d2818e).

查看更多
够拽才男人
4楼-- · 2019-02-14 18:49

In the pure python case, you're iterating over the rows and printing as you go. In the pandas case, you're importing the whole thing into a DataFrame, and then iterating over the rows. But pandas' strength isn't in iterating over the rows - it's in operations that take place over the whole DataFrame. Compare the speed of:

def pandas_csv():
    with open("../data/temp_fixed_l_no_initial", "r") as f:
        print csv_loader(f).sum(axis=1)

This is still somewhat slower than the pure python approach, which you're welcome to use if this is the extent of your use case. But as @ali_m's comment points out, if you want to do more than print the sum of the rows, or if you want to transform the data in any way, you will probably find pandas or numpy to be more efficient both in processing time and programming time.

查看更多
贼婆χ
5楼-- · 2019-02-14 19:01

if you are to give pandas the dtypes as dictionary (pd.read_csv(...,dtype={'x':np.float)) it will make things much faster... as pandas tries to check the data type for every column.

查看更多
登录 后发表回答