Originally asked on Are there alternative and portable algorithm implementation for reading lines from a file on Windows (Visual Studio Compiler) and Linux? but closed as too abroad, then, I am here trying to reduce its scope with a more concise case usage.
My goal is to implement my own file reading module for Python with Python C Extensions with a line caching policy. The purely Python Algorithm implementation without any line caching policy is this:
# This takes 1 second to parse 100MB of log data
with open('myfile', 'r', errors='replace') as myfile:
for line in myfile:
if 'word' in line:
pass
Resuming the Python C Extensions implementation: (see here the full code with line caching policy)
// other code to open the file on the std::ifstream object and create the iterator
...
static PyObject * PyFastFile_iternext(PyFastFile* self, PyObject* args)
{
std::string newline;
if( std::getline( self->fileifstream, newline ) ) {
return PyUnicode_DecodeUTF8( newline.c_str(), newline.size(), "replace" );
}
PyErr_SetNone( PyExc_StopIteration );
return NULL;
}
static PyTypeObject PyFastFileType =
{
PyVarObject_HEAD_INIT( NULL, 0 )
"fastfilepackage.FastFile" /* tp_name */
};
// create the module
PyMODINIT_FUNC PyInit_fastfilepackage(void)
{
PyFastFileType.tp_iternext = (iternextfunc) PyFastFile_iternext;
Py_INCREF( &PyFastFileType );
PyObject* thismodule;
// other module code creating the iterator and context manager
...
PyModule_AddObject( thismodule, "FastFile", (PyObject *) &PyFastFileType );
return thismodule;
}
And this is the Python code which uses the Python C Extensions code to open a file and read its lines one by one:
from fastfilepackage import FastFile
# This takes 3 seconds to parse 100MB of log data
iterable = fastfilepackage.FastFile( 'myfile' )
for item in iterable:
if 'word' in iterable():
pass
Right now the Python C Extensions code fastfilepackage.FastFile
with C++ 11 std::ifstream
takes 3 seconds to parse 100MB of log data, while the Python implementation presented takes 1 second.
The content of the file myfile
are just log lines
with around 100~300 characters on each line. The characters are just ASCII (module % 256), but due bugs on the logger engine, it can put invalid ASCII or Unicode characters. Hence, this is why I used the errors='replace'
policy while opening the file.
I just wonder if I can replace or improve this Python C Extension implementation, reducing the 3 seconds time to run the Python program.
I used this to do the benchmark:
import time
import datetime
import fastfilepackage
# usually a file with 100MB
testfile = './myfile.log'
timenow = time.time()
with open( testfile, 'r', errors='replace' ) as myfile:
for item in myfile:
if None:
var = item
python_time = time.time() - timenow
timedifference = datetime.timedelta( seconds=python_time )
print( 'Python timedifference', timedifference, flush=True )
# prints about 3 seconds
timenow = time.time()
iterable = fastfilepackage.FastFile( testfile )
for item in iterable:
if None:
var = iterable()
fastfile_time = time.time() - timenow
timedifference = datetime.timedelta( seconds=fastfile_time )
print( 'FastFile timedifference', timedifference, flush=True )
# prints about 1 second
print( 'fastfile_time %.2f%%, python_time %.2f%%' % (
fastfile_time/python_time, python_time/fastfile_time ), flush=True )
Related questions:
- Reading file Line By Line in C
- Improving C++'s reading file line by line?
Reading line by line is going to cause unavoidable slowdowns here. Python's built-in text oriented read-only file objects are actually three layers:
io.FileIO
- Raw, unbuffered access to the file
io.BufferedReader
- Buffers the underlying FileIO
io.TextIOWrapper
- Wraps the BufferedReader
to implement buffered decode to str
While iostream
does perform buffering, it's only doing the job of io.BufferedReader
, not io.TextIOWrapper
. io.TextIOWrapper
adds an extra layer of buffering, reading 8 KB chunks out of the BufferedReader
and decoding them in bulk to str
(when a chunk ends in an incomplete character, it saves off the remaining bytes to prepend to the next chunk), then yielding individual lines from the decoded chunk on request until it's exhausted (when a decoded chunk ends in a partial line, the remainder is prepended to the next decoded chunk).
By contrast, you're consuming a line at a time with std::getline
, then decoding a line at a time with PyUnicode_DecodeUTF8
, then yielding back to the caller; by the time the caller requests the next line, odds are at least some of the code associated with your tp_iternext
implementation has left the CPU cache (or at least, left the fastest parts of the cache). A tight loop decoding 8 KB of text to UTF-8 is going to go extremely fast; repeatedly leaving the loop and only decoding a 100-300 bytes at a time is going to be slower.
The solution is to do roughly what io.TextIOWrapper
does: Read in chunks, not lines, and decode them in bulk (preserving incomplete UTF-8 encoded characters for the next chunk), then search for newlines to fish out substrings from the decoded buffer until it's exhausted (don't trim the buffer each time, just track indices). When no more complete lines remain in the decoded buffer, trim the stuff you've already yielded, and read, decode, and append a new chunk.
There is some room for improvement on Python's underlying implementation of io.TextIOWrapper.readline
(e.g. they have to construct a Python level int
each time they read a chunk and call indirectly since they can't guarantee they're wrapping a BufferedReader
), but it's a solid basis for reimplementing your own scheme.
Update: On checking your full code (which is wildly different from what you've posted), you've got other issues. Your tp_iternext
just repeatedly yields None
, requiring you to call your object to retrieve the string. That's... unfortunate. That's more than doubling the Python interpreter overhead per item (tp_iternext
is cheap to call, being quite specialized; tp_call
is not nearly so cheap, going through convoluted general purpose code paths, requiring the interpreter to pass an empty tuple
of args you never use, etc.; side-note, PyFastFile_tp_call
should be accepting a third argument for the kwds
, which you ignore, but must still be accepted; casting to ternaryfunc
is silencing the error, but this will break on some platforms).
Final note (not really relevant to performance for all but the smallest files): The contract for tp_iternext
does not require you to set an exception when the iterator is exhausted, just that you return NULL;
. You can remove your call to PyErr_SetNone( PyExc_StopIteration );
; as long as no other exception is set, return NULL;
alone indicates end of iteration, so you can save some work by not setting it at all.
These results are only for Linux or Cygwin compiler. If you are using Visual Studio Compiler
, the results for std::getline
and std::ifstream.getline
are 100%
or more slower than Python builtin for line in file
iterator.
You will see linecache.push_back( emtpycacheobject )
being used around the code because this way I am only benchmarking the time used to read the lines, excluding the time Python would spend converting the input string into a Python Unicode Object. Therefore, I commented out all lines which call PyUnicode_DecodeUTF8
.
These are the global definitions used on the examples:
const char* filepath = "./myfile.log";
size_t linecachesize = 131072;
PyObject* emtpycacheobject;
emtpycacheobject = PyUnicode_DecodeUTF8( "", 0, "replace" );
I managed to optimize my Posix C getline
usage (by caching the total buffer size instead of always passing 0) and now the Posix C getline
beats the Python builtin for line in file
by 5%
. I guess if I remove all Python and C++ code around the Posix C getline
, it should gain some more performance:
char* readline = (char*) malloc( linecachesize );
FILE* cfilestream = fopen( filepath, "r" );
if( cfilestream == NULL ) {
std::cerr << "ERROR: Failed to open the file '" << filepath << "'!" << std::endl;
}
if( readline == NULL ) {
std::cerr << "ERROR: Failed to alocate internal line buffer!" << std::endl;
}
bool getline() {
ssize_t charsread;
if( ( charsread = getline( &readline, &linecachesize, cfilestream ) ) != -1 ) {
fileobj.getline( readline, linecachesize );
// PyObject* pythonobject = PyUnicode_DecodeUTF8( readline, charsread, "replace" );
// linecache.push_back( pythonobject );
// return true;
Py_XINCREF( emtpycacheobject );
linecache.push_back( emtpycacheobject );
return true;
}
return false;
}
if( readline ) {
free( readline );
readline = NULL;
}
if( cfilestream != NULL) {
fclose( cfilestream );
cfilestream = NULL;
}
I also managed to improve the C++ performance to only 20%
slower than the builtin Python C for line in file
by using std::ifstream.getline()
:
char* readline = (char*) malloc( linecachesize );
std::ifstream fileobj;
fileobj.open( filepath );
if( fileobj.fail() ) {
std::cerr << "ERROR: Failed to open the file '" << filepath << "'!" << std::endl;
}
if( readline == NULL ) {
std::cerr << "ERROR: Failed to alocate internal line buffer!" << std::endl;
}
bool getline() {
if( !fileobj.eof() ) {
fileobj.getline( readline, linecachesize );
// PyObject* pyobj = PyUnicode_DecodeUTF8( readline, fileobj.gcount(), "replace" );
// linecache.push_back( pyobj );
// return true;
Py_XINCREF( emtpycacheobject );
linecache.push_back( emtpycacheobject );
return true;
}
return false;
}
if( readline ) {
free( readline );
readline = NULL;
}
if( fileobj.is_open() ) {
fileobj.close();
}
Finally, I also managed to get only 10%
slower performance than the builtin Python C for line in file
with std::getline
by caching the std::string
it uses as input:
std::string line;
std::ifstream fileobj;
fileobj.open( filepath );
if( fileobj.fail() ) {
std::cerr << "ERROR: Failed to open the file '" << filepath << "'!" << std::endl;
}
try {
line.reserve( linecachesize );
}
catch( std::exception error ) {
std::cerr << "ERROR: Failed to alocate internal line buffer!" << std::endl;
}
bool getline() {
if( std::getline( fileobj, line ) ) {
// PyObject* pyobj = PyUnicode_DecodeUTF8( line.c_str(), line.size(), "replace" );
// linecache.push_back( pyobj );
// return true;
Py_XINCREF( emtpycacheobject );
linecache.push_back( emtpycacheobject );
return true;
}
return false;
}
if( fileobj.is_open() ) {
fileobj.close();
}
After removing all boilerplate from C++, the performance for Posix C getline
was 10% inferior to the Python builtin for line in file
:
const char* filepath = "./myfile.log";
size_t linecachesize = 131072;
PyObject* emtpycacheobject = PyUnicode_DecodeUTF8( "", 0, "replace" );
char* readline = (char*) malloc( linecachesize );
FILE* cfilestream = fopen( filepath, "r" );
static PyObject* PyFastFile_tp_call(PyFastFile* self, PyObject* args, PyObject *kwargs) {
Py_XINCREF( emtpycacheobject );
return emtpycacheobject;
}
static PyObject* PyFastFile_iternext(PyFastFile* self, PyObject* args) {
ssize_t charsread;
if( ( charsread = getline( &readline, &linecachesize, cfilestream ) ) == -1 ) {
return NULL;
}
Py_XINCREF( emtpycacheobject );
return emtpycacheobject;
}
static PyObject* PyFastFile_getlines(PyFastFile* self, PyObject* args) {
Py_XINCREF( emtpycacheobject );
return emtpycacheobject;
}
static PyObject* PyFastFile_resetlines(PyFastFile* self, PyObject* args) {
Py_INCREF( Py_None );
return Py_None;
}
static PyObject* PyFastFile_close(PyFastFile* self, PyObject* args) {
Py_INCREF( Py_None );
return Py_None;
}
Values from the last test run where Posix C getline
was 10% inferior to Python:
$ /bin/python3.6 fastfileperformance.py fastfile_time 1.15%, python_time 0.87%
Python timedifference 0:00:00.695292
FastFile timedifference 0:00:00.796305
$ /bin/python3.6 fastfileperformance.py fastfile_time 1.13%, python_time 0.88%
Python timedifference 0:00:00.708298
FastFile timedifference 0:00:00.803594
$ /bin/python3.6 fastfileperformance.py fastfile_time 1.14%, python_time 0.88%
Python timedifference 0:00:00.699614
FastFile timedifference 0:00:00.795259
$ /bin/python3.6 fastfileperformance.py fastfile_time 1.15%, python_time 0.87%
Python timedifference 0:00:00.699585
FastFile timedifference 0:00:00.802173
$ /bin/python3.6 fastfileperformance.py fastfile_time 1.15%, python_time 0.87%
Python timedifference 0:00:00.703085
FastFile timedifference 0:00:00.807528
$ /bin/python3.6 fastfileperformance.py fastfile_time 1.17%, python_time 0.85%
Python timedifference 0:00:00.677507
FastFile timedifference 0:00:00.794591
$ /bin/python3.6 fastfileperformance.py fastfile_time 1.20%, python_time 0.83%
Python timedifference 0:00:00.670492
FastFile timedifference 0:00:00.804689