I would like to Split some files (around 1000) into words and remove numbers and punctuation. I will then process these tokenized words accordingly... However, the files are mostly in German language and are encoded in different types:
- ISO-8859-1
- ISO Latin-1
- ASCII
- UTF-8
The problem that I am facing is that I cannot find a correct way to apply Character Conversion functions such as tolower()
and I also get some weird icons in the terminal when I use std::cout
at Ubuntu linux
.
For example, in non UTF-8 files, the word französische
is shown as franz�sische
, für
as
f�r
etc... Also, words like Örebro
or Österreich
are ignored by tolower()
. From what I know the "Unicode replacement character" � (U+FFFD)
is inserted for any character that the program cannot decode correctly when trying to handle Unicode.
When I open UTF-8 files i dont get any weird characters but i still cannot convert upper case special characters such as Ö
to lower case... I used std::setlocale(LC_ALL, "de_DE.iso88591");
and some other options that I have found on stackoverflow but I still dont get the desired output.
My guess on how I should solve this is:
- Check encoding of file that is about to be opened
- open file according to its specific encoding
- Convert file input to UTF-8
- Process file and apply
tolower()
etc
Is the above algorithm
feasible or the complexity will skyrocket?
What is the correct approach for this problem? How can I open the files with some sort of encoding options?
1. Should my OS have the corresponding locale enabled as global variable to process (without bothering how console displays it) text? (in linux for example I do not have de_DE enabled when i use -locale -a
)
2. Is this problem only visible due to terminal default encoding? Do I need to take any further steps before i process the extracted string normally in c++?
My linux locale:
LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=el_GR.UTF-8
LC_TIME=el_GR.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=el_GR.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=el_GR.UTF-8
LC_NAME=el_GR.UTF-8
LC_ADDRESS=el_GR.UTF-8
LC_TELEPHONE=el_GR.UTF-8
LC_MEASUREMENT=el_GR.UTF-8
LC_IDENTIFICATION=el_GR.UTF-8
LC_ALL=
C
C.UTF-8
el_GR.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
POSIX
Here is some sample code that I wrote that doesnt work as I want atm.
void processFiles() {
std::string filename = "17454-8.txt";
std::ifstream inFile;
inFile.open(filename);
if (!inFile) {
std::cerr << "Failed to open file" << std::endl;
exit(1);
}
//calculate file size
std::string s = "";
s.reserve(filesize(filename) + std::ifstream::pos_type(1));
std::string line;
while( (inFile.good()) && std::getline(inFile, line) ) {
s.append(line + "\n");
}
inFile.close();
std::cout << s << std::endl;
//remove punctuation, numbers, tolower,
//TODO encoding detection and specific transformation (cannot catch Ö, Ä etc) will add too much complexity...
std::setlocale(LC_ALL, "de_DE.iso88591");
for (unsigned int i = 0; i < s.length(); ++i) {
if (std::ispunct(s[i]) || std::isdigit(s[i]))
s[i] = ' ';
if (std::isupper(s[i]))
s[i]=std::tolower(s[i]);
}
//std::cout << s << std::endl;
//tokenize string
std::istringstream iss(s);
tokens.clear();
tokens = {std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};
for (auto & i : tokens)
std::cout << i << std::endl;
//PROCESS TOKENS
return;
}