Unsigned integer as UTF-8 value

2019-03-22 08:06发布

assuming that I have

uint32_t a(3084);

I would like to create a string that stores the unicode character U+3084 which means that I should take the value of a and use it as the coordinate for the right character in the UTF8 table/charset.

Now, clearly std::to_string() doesn't work for me, there are a lot of functions in the standard to convert between numeric values and char, I can't find anything that grants me UTF8 support and outputs an std::string.

I would like to ask if I have to create this function from scratch or there is something in the C++11 standard that can help me with that; please note that my compiler ( gcc/g++ 4.8.1 ) doesn't offer a complete support for codecvt.

4条回答
男人必须洒脱
2楼-- · 2019-03-22 08:32
auto s = u8"\343\202\204"; // Octal escaped representation of HIRAGANA LETTER YA
std::cout << s << std::endl;

prints

for me (using g++ 4.8.1). s has type const char*, as you'd expect, but I don't know if this is implementation defined. Unfortunately C++ doesn't have any support for manipulation of UTF8 strings are far as I know; for that you need to use a library like Glib::ustring.

查看更多
疯言疯语
3楼-- · 2019-03-22 08:38

Here's some C++ code that wouldn't be hard to convert to C. Adapted from an older answer.

std::string UnicodeToUTF8(unsigned int codepoint)
{
    std::string out;

    if (codepoint <= 0x7f)
        out.append(1, static_cast<char>(codepoint));
    else if (codepoint <= 0x7ff)
    {
        out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else if (codepoint <= 0xffff)
    {
        out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else
    {
        out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    return out;
}
查看更多
▲ chillily
4楼-- · 2019-03-22 08:39

std::string_convert::to_bytes has a single-char overload just for you.

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
#include <iomanip>

// utility function for output
void hex_print(const std::string& s)
{
    std::cout << std::hex << std::setfill('0');
    for(unsigned char c : s)
        std::cout << std::setw(2) << static_cast<int>(c) << ' ';
    std::cout << std::dec << '\n';
}

int main()
{
    uint32_t a(3084);

    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv1;
    std::string u8str = conv1.to_bytes(a);
    std::cout << "UTF-8 conversion produced " << u8str.size() << " bytes:\n";
    hex_print(u8str);
}

I get (with libc++)

$ ./test
UTF-8 conversion produced 3 bytes:
e0 b0 8c 
查看更多
Fickle 薄情
5楼-- · 2019-03-22 08:51

The C++ standard contains the std::codecvt<char32_t, char, mbstate_t> facet which converts between UTF-32 and UTF-8 according to 22.4.1.4 [locale.codecvt] paragraph 3. Sadly, the std::codecvt<...> facets aren't easy to use. At some point there was discussion about filtering stream buffers which would take case of the code conversion (the standard C++ library needs to implement them anyway for std::basic_filebuf<...>) but I can't see any trace of these.

查看更多
登录 后发表回答