On Python, there is this option errors='ignore'
for the open
Python function:
open( '/filepath.txt', 'r', encoding='UTF-8', errors='ignore' )
With this, reading a file with invalid UTF8 characters will replace them with nothing, i.e., they are ignored. For example, a file with the characthers Føö»BÃ¥r
is going to be read as FøöBår
.
If a line as Føö»BÃ¥r
is read with getline()
from stdio.h
, it will be read as Føö�Bår
:
FILE* cfilestream = fopen( "/filepath.txt", "r" );
int linebuffersize = 131072;
char* readline = (char*) malloc( linebuffersize );
while( true )
{
if( getline( &readline, &linebuffersize, cfilestream ) != -1 ) {
std::cerr << "readline=" readline << std::endl;
}
else {
break;
}
}
How can I make stdio.h
getline()
read it as FøöBår
instead of Føö�Bår
, i..e, ignoring invalid UTF8 characters?
One overwhelming solution I can think of it do iterate throughout all characters on each line read and build a new readline
without any of these characters. For example:
FILE* cfilestream = fopen( "/filepath.txt", "r" );
int linebuffersize = 131072;
char* readline = (char*) malloc( linebuffersize );
char* fixedreadline = (char*) malloc( linebuffersize );
int index;
int charsread;
int invalidcharsoffset;
while( true )
{
if( ( charsread = getline( &readline, &linebuffersize, cfilestream ) ) != -1 )
{
invalidcharsoffset = 0;
for( index = 0; index < charsread; ++index )
{
if( readline[index] != '�' ) {
fixedreadline[index-invalidcharsoffset] = readline[index];
}
else {
++invalidcharsoffset;
}
}
std::cerr << "fixedreadline=" << fixedreadline << std::endl;
}
else {
break;
}
}
Related questions:
You are confusing what you see with what is really going on. The
getline
function does not do any replacement of characters. [Note 1]You are seeing a replacement character (U+FFFD) because your console outputs that character when it is asked to render an invalid UTF-8 code. Most consoles will do that if they are in UTF-8 mode; that is, the current locale is UTF-8.
Also, saying that a file contains the "characters
Føö»BÃ¥r
" is at best imprecise. A file does not really contain characters. It contains byte sequences which may be interpreted as characters -- for example, by a console or other user presentation software which renders them into glyphs -- according to some encoding. Different encodings produce different results; in this particular case, you have a file which was created by software using the Windows-1252 encoding (or, roughly equivalently, ISO 8859-15), and you are rendering it on a console using UTF-8.What that means is that the data read by getline contains an invalid UTF-8 sequence, but it (probably) does not contain the replacement character code. Based on the character string you present, it contains the hex character
\xbb
, which is a guillemot (»
) in Windows code page 1252.Finding all the invalid UTF-8 sequences in a string read by
getline
(or any other C library function which reads files) requires scanning the string, but not for a particular code sequence. Rather, you need to decode UTF-8 sequences one at a time, looking for the ones which are not valid. That's not a simple task, but thembtowc
function can help (if you have enabled a UTF-8 locale). As you'll see in the linked manpage,mbtowc
returns the number of bytes contained in a valid "multibyte sequence" (which is UTF-8 in a UTF-8 locale), or -1 to indicate an invalid or incomplete sequence. In the scan, you should pass through the bytes in a valid sequence, or remove/ignore the single byte starting an invalid sequence, and then continue the scan until you reach the end of the string.Here's some lightly-tested example code (in C):
Notes
\n
on systems like Windows where the two character CR-LF sequence is used as a line-end indication.I also managed to fix it by trailing/cutting down all Non-ASCII characters.
This one takes about
2.6
seconds to parse 319MB:Alternative and slower version using
memcpy
Using
menmove
does not improve much speed, so you could either one.This one takes about
3.1
seconds to parse 319MB:Optimized solution using
iconv
This takes about
4.6
seconds to parse 319MB of text.Slowest solution ever using
mbtowc
This takes about
24.2
seconds to parse 319MB of text.If you comment out the line
fixedchar = mbtowc(NULL, source, charsread);
and uncomment the linecharsread -= fixedchar;
(breaking the invalid characters removal) this will take1.9
seconds instead of24.2
seconds (also compiled with-O3
optimization level).Fastest version from all my others above using
memmove
You cannot use
memcpy
here because the memory regions overlap!This takes about
2.4
seconds to parse 319MB.If you comment out the lines
*destination = *source
andmemmove( destination, source, 1 )
(breaking the invalid characters removal) the performance still almost the same as whenmemmove
is being called. Here in, callingmemmove( destination, source, 1 )
is a little slower than directly doing*destination = *source;
Bonus
You can also use Python C Extensions (API).
It takes about
2.3
seconds to parse 319MB without converting them to cached versionUTF-8 char*
And takes about
3.2
seconds to parse 319MB converting them toUTF-8
char*. And also takes about3.2
seconds to parse 319MB converting them to cachedASCII
char*.To built it, create the file
source/fastfilewrappar.cpp
with the contents of the above file and thesetup.py
with the following contents:To run example, use following
Python
script:Example:
Using std::getline
This takes about
4.7
seconds to parse 319MB.If you remove the
UTF-8
removal algorithm borrowed from the fastest benchmark usingstdlib.h getline()
, it takes1.7
seconds to run.Resume
2.6
seconds trimming UTF-8 using two buffers with indexing3.1
seconds trimming UTF-8 using two buffers with memcpy4.6
seconds removing invalid UTF-8 with iconv24.2
seconds removing invalid UTF-8 with mbtowc2.4
seconds trimming UTF-8 using one buffer with pointer direct assigningBonus
2.3
seconds removing invalid UTF-8 without converting them to a cachedUTF-8 char*
3.2
seconds removing invalid UTF-8 converting them to a cachedUTF-8 char*
3.2
seconds trimming UTF-8 and caching asASCII char*
4.7
seconds trimming UTF-8 withstd::getline()
using one buffer with pointer direct assigningThe used file
./text.txt
had820.800
lines where each line was equal to:id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char&id-é-char\r\n
And all versions where compiled with
g++ (GCC) 7.4.0
iconv (GNU libiconv 1.14)
g++ -o main test.cpp -O3 -liconv && time ./main
As @rici well explains in his answer, there can be several invalid UTF-8 sequences in a byte sequence.
Possibly iconv(3) could be worth a look, e.g. see https://linux.die.net/man/3/iconv_open.
Example
This byte sequence, if interpreted as UTF-8, contains some invalid UTF-8:
If you display this you would see something like
When this string passes through the remove_invalid_utf8 function in the following C program, the invalid UTF-8 bytes are removed using the iconv function mentioned above.
So the result is then:
C Program