How to improve Python C Extensions file line readi

2020-03-31 08:54发布

问题:

Originally asked on Are there alternative and portable algorithm implementation for reading lines from a file on Windows (Visual Studio Compiler) and Linux? but closed as too abroad, then, I am here trying to reduce its scope with a more concise case usage.

My goal is to implement my own file reading module for Python with Python C Extensions with a line caching policy. The purely Python Algorithm implementation without any line caching policy is this:

# This takes 1 second to parse 100MB of log data
with open('myfile', 'r', errors='replace') as myfile:
    for line in myfile:
        if 'word' in line: 
            pass

Resuming the Python C Extensions implementation: (see here the full code with line caching policy)

// other code to open the file on the std::ifstream object and create the iterator
...

static PyObject * PyFastFile_iternext(PyFastFile* self, PyObject* args)
{
    std::string newline;

    if( std::getline( self->fileifstream, newline ) ) {
        return PyUnicode_DecodeUTF8( newline.c_str(), newline.size(), "replace" );
    }

    PyErr_SetNone( PyExc_StopIteration );
    return NULL;
}

static PyTypeObject PyFastFileType =
{
    PyVarObject_HEAD_INIT( NULL, 0 )
    "fastfilepackage.FastFile" /* tp_name */
};

// create the module
PyMODINIT_FUNC PyInit_fastfilepackage(void)
{
    PyFastFileType.tp_iternext = (iternextfunc) PyFastFile_iternext;
    Py_INCREF( &PyFastFileType );

    PyObject* thismodule;
    // other module code creating the iterator and context manager
    ...

    PyModule_AddObject( thismodule, "FastFile", (PyObject *) &PyFastFileType );
    return thismodule;
}

And this is the Python code which uses the Python C Extensions code to open a file and read its lines one by one:

from fastfilepackage import FastFile

# This takes 3 seconds to parse 100MB of log data
iterable = fastfilepackage.FastFile( 'myfile' )
for item in iterable:
    if 'word' in iterable():
        pass

Right now the Python C Extensions code fastfilepackage.FastFile with C++ 11 std::ifstream takes 3 seconds to parse 100MB of log data, while the Python implementation presented takes 1 second.

The content of the file myfile are just log lines with around 100~300 characters on each line. The characters are just ASCII (module % 256), but due bugs on the logger engine, it can put invalid ASCII or Unicode characters. Hence, this is why I used the errors='replace' policy while opening the file.

I just wonder if I can replace or improve this Python C Extension implementation, reducing the 3 seconds time to run the Python program.

I used this to do the benchmark:

import time
import datetime
import fastfilepackage

# usually a file with 100MB
testfile = './myfile.log'

timenow = time.time()
with open( testfile, 'r', errors='replace' ) as myfile:
    for item in myfile:
        if None:
            var = item

python_time = time.time() - timenow
timedifference = datetime.timedelta( seconds=python_time )
print( 'Python   timedifference', timedifference, flush=True )
# prints about 3 seconds

timenow = time.time()
iterable = fastfilepackage.FastFile( testfile )
for item in iterable:
    if None:
        var = iterable()

fastfile_time = time.time() - timenow
timedifference = datetime.timedelta( seconds=fastfile_time )
print( 'FastFile timedifference', timedifference, flush=True )
# prints about 1 second

print( 'fastfile_time %.2f%%, python_time %.2f%%' % ( 
        fastfile_time/python_time, python_time/fastfile_time ), flush=True )

Related questions:

  1. Reading file Line By Line in C
  2. Improving C++'s reading file line by line?

回答1:

Reading line by line is going to cause unavoidable slowdowns here. Python's built-in text oriented read-only file objects are actually three layers:

  1. io.FileIO - Raw, unbuffered access to the file
  2. io.BufferedReader - Buffers the underlying FileIO
  3. io.TextIOWrapper - Wraps the BufferedReader to implement buffered decode to str

While iostream does perform buffering, it's only doing the job of io.BufferedReader, not io.TextIOWrapper. io.TextIOWrapper adds an extra layer of buffering, reading 8 KB chunks out of the BufferedReader and decoding them in bulk to str (when a chunk ends in an incomplete character, it saves off the remaining bytes to prepend to the next chunk), then yielding individual lines from the decoded chunk on request until it's exhausted (when a decoded chunk ends in a partial line, the remainder is prepended to the next decoded chunk).

By contrast, you're consuming a line at a time with std::getline, then decoding a line at a time with PyUnicode_DecodeUTF8, then yielding back to the caller; by the time the caller requests the next line, odds are at least some of the code associated with your tp_iternext implementation has left the CPU cache (or at least, left the fastest parts of the cache). A tight loop decoding 8 KB of text to UTF-8 is going to go extremely fast; repeatedly leaving the loop and only decoding a 100-300 bytes at a time is going to be slower.

The solution is to do roughly what io.TextIOWrapper does: Read in chunks, not lines, and decode them in bulk (preserving incomplete UTF-8 encoded characters for the next chunk), then search for newlines to fish out substrings from the decoded buffer until it's exhausted (don't trim the buffer each time, just track indices). When no more complete lines remain in the decoded buffer, trim the stuff you've already yielded, and read, decode, and append a new chunk.

There is some room for improvement on Python's underlying implementation of io.TextIOWrapper.readline (e.g. they have to construct a Python level int each time they read a chunk and call indirectly since they can't guarantee they're wrapping a BufferedReader), but it's a solid basis for reimplementing your own scheme.

Update: On checking your full code (which is wildly different from what you've posted), you've got other issues. Your tp_iternext just repeatedly yields None, requiring you to call your object to retrieve the string. That's... unfortunate. That's more than doubling the Python interpreter overhead per item (tp_iternext is cheap to call, being quite specialized; tp_call is not nearly so cheap, going through convoluted general purpose code paths, requiring the interpreter to pass an empty tuple of args you never use, etc.; side-note, PyFastFile_tp_call should be accepting a third argument for the kwds, which you ignore, but must still be accepted; casting to ternaryfunc is silencing the error, but this will break on some platforms).

Final note (not really relevant to performance for all but the smallest files): The contract for tp_iternext does not require you to set an exception when the iterator is exhausted, just that you return NULL;. You can remove your call to PyErr_SetNone( PyExc_StopIteration );; as long as no other exception is set, return NULL; alone indicates end of iteration, so you can save some work by not setting it at all.



回答2:

These results are only for Linux or Cygwin compiler. If you are using Visual Studio Compiler, the results for std::getline and std::ifstream.getline are 100% or more slower than Python builtin for line in file iterator.

You will see linecache.push_back( emtpycacheobject ) being used around the code because this way I am only benchmarking the time used to read the lines, excluding the time Python would spend converting the input string into a Python Unicode Object. Therefore, I commented out all lines which call PyUnicode_DecodeUTF8.

These are the global definitions used on the examples:

const char* filepath = "./myfile.log";
size_t linecachesize = 131072;

PyObject* emtpycacheobject;
emtpycacheobject = PyUnicode_DecodeUTF8( "", 0, "replace" );

I managed to optimize my Posix C getline usage (by caching the total buffer size instead of always passing 0) and now the Posix C getline beats the Python builtin for line in file by 5%. I guess if I remove all Python and C++ code around the Posix C getline, it should gain some more performance:

char* readline = (char*) malloc( linecachesize );
FILE* cfilestream = fopen( filepath, "r" );

if( cfilestream == NULL ) {
    std::cerr << "ERROR: Failed to open the file '" << filepath << "'!" << std::endl;
}

if( readline == NULL ) {
    std::cerr << "ERROR: Failed to alocate internal line buffer!" << std::endl;
}

bool getline() {
    ssize_t charsread;
    if( ( charsread = getline( &readline, &linecachesize, cfilestream ) ) != -1 ) {
        fileobj.getline( readline, linecachesize );
        // PyObject* pythonobject = PyUnicode_DecodeUTF8( readline, charsread, "replace" );
        // linecache.push_back( pythonobject );
        // return true;

        Py_XINCREF( emtpycacheobject );
        linecache.push_back( emtpycacheobject );
        return true;
    }
    return false;
}

if( readline ) {
    free( readline );
    readline = NULL;
}

if( cfilestream != NULL) {
    fclose( cfilestream );
    cfilestream = NULL;
}

I also managed to improve the C++ performance to only 20% slower than the builtin Python C for line in file by using std::ifstream.getline():

char* readline = (char*) malloc( linecachesize );
std::ifstream fileobj;
fileobj.open( filepath );

if( fileobj.fail() ) {
    std::cerr << "ERROR: Failed to open the file '" << filepath << "'!" << std::endl;
}

if( readline == NULL ) {
    std::cerr << "ERROR: Failed to alocate internal line buffer!" << std::endl;
}

bool getline() {

    if( !fileobj.eof() ) {
        fileobj.getline( readline, linecachesize );
        // PyObject* pyobj = PyUnicode_DecodeUTF8( readline, fileobj.gcount(), "replace" );
        // linecache.push_back( pyobj );
        // return true;

        Py_XINCREF( emtpycacheobject );
        linecache.push_back( emtpycacheobject );
        return true;
    }
    return false;
}

if( readline ) {
    free( readline );
    readline = NULL;
}

if( fileobj.is_open() ) {
    fileobj.close();
}

Finally, I also managed to get only 10% slower performance than the builtin Python C for line in file with std::getline by caching the std::string it uses as input:

std::string line;
std::ifstream fileobj;
fileobj.open( filepath );

if( fileobj.fail() ) {
    std::cerr << "ERROR: Failed to open the file '" << filepath << "'!" << std::endl;
}

try {
    line.reserve( linecachesize );
}
catch( std::exception error ) {
    std::cerr << "ERROR: Failed to alocate internal line buffer!" << std::endl;
}

bool getline() {

    if( std::getline( fileobj, line ) ) {
        // PyObject* pyobj = PyUnicode_DecodeUTF8( line.c_str(), line.size(), "replace" );
        // linecache.push_back( pyobj );
        // return true;

        Py_XINCREF( emtpycacheobject );
        linecache.push_back( emtpycacheobject );
        return true;
    }
    return false;
}

if( fileobj.is_open() ) {
    fileobj.close();
}

After removing all boilerplate from C++, the performance for Posix C getline was 10% inferior to the Python builtin for line in file:

const char* filepath = "./myfile.log";
size_t linecachesize = 131072;

PyObject* emtpycacheobject = PyUnicode_DecodeUTF8( "", 0, "replace" );
char* readline = (char*) malloc( linecachesize );
FILE* cfilestream = fopen( filepath, "r" );

static PyObject* PyFastFile_tp_call(PyFastFile* self, PyObject* args, PyObject *kwargs) {
    Py_XINCREF( emtpycacheobject );
    return emtpycacheobject;
}

static PyObject* PyFastFile_iternext(PyFastFile* self, PyObject* args) {
    ssize_t charsread;
    if( ( charsread = getline( &readline, &linecachesize, cfilestream ) ) == -1 ) {
        return NULL;
    }
    Py_XINCREF( emtpycacheobject );
    return emtpycacheobject;
}

static PyObject* PyFastFile_getlines(PyFastFile* self, PyObject* args) {
    Py_XINCREF( emtpycacheobject );
    return emtpycacheobject;
}

static PyObject* PyFastFile_resetlines(PyFastFile* self, PyObject* args) {
    Py_INCREF( Py_None );
    return Py_None;
}

static PyObject* PyFastFile_close(PyFastFile* self, PyObject* args) {
    Py_INCREF( Py_None );
    return Py_None;
}

Values from the last test run where Posix C getline was 10% inferior to Python:

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.15%, python_time 0.87%
Python   timedifference 0:00:00.695292
FastFile timedifference 0:00:00.796305

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.13%, python_time 0.88%
Python   timedifference 0:00:00.708298
FastFile timedifference 0:00:00.803594

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.14%, python_time 0.88%
Python   timedifference 0:00:00.699614
FastFile timedifference 0:00:00.795259

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.15%, python_time 0.87%
Python   timedifference 0:00:00.699585
FastFile timedifference 0:00:00.802173

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.15%, python_time 0.87%
Python   timedifference 0:00:00.703085
FastFile timedifference 0:00:00.807528

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.17%, python_time 0.85%
Python   timedifference 0:00:00.677507
FastFile timedifference 0:00:00.794591

$ /bin/python3.6 fastfileperformance.py fastfile_time 1.20%, python_time 0.83%
Python   timedifference 0:00:00.670492
FastFile timedifference 0:00:00.804689