How to apply functions on text files with

2019-03-02 07:04发布

问题:

I would like to Split some files (around 1000) into words and remove numbers and punctuation. I will then process these tokenized words accordingly... However, the files are mostly in German language and are encoded in different types:

  • ISO-8859-1
  • ISO Latin-1
  • ASCII
  • UTF-8

The problem that I am facing is that I cannot find a correct way to apply Character Conversion functions such as tolower() and I also get some weird icons in the terminal when I use std::cout at Ubuntu linux.

For example, in non UTF-8 files, the word französische is shown as franz�sische, für as f�r etc... Also, words like Örebro or Österreich are ignored by tolower(). From what I know the "Unicode replacement character" � (U+FFFD) is inserted for any character that the program cannot decode correctly when trying to handle Unicode.

When I open UTF-8 files i dont get any weird characters but i still cannot convert upper case special characters such as Ö to lower case... I used std::setlocale(LC_ALL, "de_DE.iso88591"); and some other options that I have found on stackoverflow but I still dont get the desired output.

My guess on how I should solve this is:

  1. Check encoding of file that is about to be opened
  2. open file according to its specific encoding
  3. Convert file input to UTF-8
  4. Process file and apply tolower() etc

Is the above algorithm feasible or the complexity will skyrocket?

What is the correct approach for this problem? How can I open the files with some sort of encoding options?

1. Should my OS have the corresponding locale enabled as global variable to process (without bothering how console displays it) text? (in linux for example I do not have de_DE enabled when i use -locale -a)

2. Is this problem only visible due to terminal default encoding? Do I need to take any further steps before i process the extracted string normally in c++?

My linux locale:

LANG=en_US.UTF-8
LANGUAGE=en_US
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=el_GR.UTF-8
LC_TIME=el_GR.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=el_GR.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=el_GR.UTF-8
LC_NAME=el_GR.UTF-8
LC_ADDRESS=el_GR.UTF-8
LC_TELEPHONE=el_GR.UTF-8
LC_MEASUREMENT=el_GR.UTF-8
LC_IDENTIFICATION=el_GR.UTF-8
LC_ALL=

C
C.UTF-8
el_GR.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
POSIX

Here is some sample code that I wrote that doesnt work as I want atm.

void processFiles() {
    std::string filename = "17454-8.txt";
    std::ifstream inFile;
    inFile.open(filename);
    if (!inFile) {
        std::cerr << "Failed to open file" << std::endl;
        exit(1);
    }

    //calculate file size
    std::string s = "";
    s.reserve(filesize(filename) + std::ifstream::pos_type(1));
    std::string line;
    while( (inFile.good()) && std::getline(inFile, line) ) {
        s.append(line + "\n");
    }
    inFile.close();

    std::cout << s << std::endl;
    //remove punctuation, numbers, tolower,
    //TODO encoding detection and specific transformation (cannot catch Ö, Ä etc) will add too much complexity...
    std::setlocale(LC_ALL, "de_DE.iso88591");
    for (unsigned int i = 0; i < s.length(); ++i) {
        if (std::ispunct(s[i]) || std::isdigit(s[i]))
            s[i] = ' ';
        if (std::isupper(s[i]))
            s[i]=std::tolower(s[i]);
    }
    //std::cout << s << std::endl;
    //tokenize string
    std::istringstream iss(s);
    tokens.clear();
    tokens = {std::istream_iterator<std::string>{iss}, std::istream_iterator<std::string>{}};
    for (auto & i : tokens)
        std::cout << i << std::endl;

        //PROCESS TOKENS
    return;
}

回答1:

Unicode defines "code points" for characters. A code point is a 32 bit value.

There are some types of encodings. ASCII only uses 7 bits, which gives 128 different chars. The 8th bit was used by Microsoft to define another 128 chars, depending on the locale, and called "code pages". Nowadays MS uses UTF-16 2 bytes encoding. Because this is not enough for the whole Unicode set, UTF-16 is also locale dependant, with names that match Unicode's names "Latin-1", or "ISO-8859-1" etc.

Most used in Linux (typically for files) is UTF-8, which uses a variable number of bytes for each character. The first 128 chars are exactly the same as ASCII chars, with just one byte per character. To represent a character UTF8 can use up to 4 bytes. More onfo in the Wikipedia.

While MS uses UTF-16 for both files and RAM, Linux likely uses UFT-32 for RAM.

In order to read a file you need to know its encoding. Trying to detect it is a real nightmare which may not succeed. The use of std::basic_ios::imbue allows you to set the desired locale for your stream, like in this SO answer

tolower and such functions can work with a locale, e.g.

#include <iostream>
#include <locale>

int main() {
    wchar_t s = L'\u00D6'; //latin capital 'o' with diaeresis, decimal 214
    wchar_t sL = std::tolower(s, std::locale("en_US.UTF-8")); //hex= 00F6, dec= 246
    std::cout << "s = " << s << std::endl;
    std::cout << "sL= " << sL << std::endl;

    return 0;
}

outputs:

s = 214
sL= 246

In this other SO answer you can find good solutions, as the use of iconv Linux or iconv W32 library.

In Linux the terminal can be set to use a locale with the help of LC_ALL, LANG and LANGUAGE, e.g.:

//Deutsch
LC_ALL="de_DE.UTF-8"
LANG="de_DE.UTF-8"
LANGUAGE="de_DE:de:en_US:en"

//English 
LC_ALL="en_US.UTF-8"
LANG="en_US.UTF-8"
LANGUAGE="en_US:en"