How to decode an URI with UTF-8 characters in C++

2019-09-11 19:32发布

问题:

I need to decode an URI in C++. I found several questions about it, but they all fail to deal with UTF-8 encoding and accents (I'm interested in accurately dealing with ASCII characters).

Then, I went with a broadly used library like libcurl... but it also failed to address the UTF-8 encoding. Here's what I'm doing

string UriHelper::Decode(const string &encoded)
{
    CURL *curl = curl_easy_init();
    int outlength;
    char *cres = curl_easy_unescape(curl, encoded.c_str(), encoded.length(), &outlength);
    string res(cres, cres + outlength);
    curl_free(cres);
    curl_easy_cleanup(curl);
    return res;
}

The problem is that a%C3%A1e%C3%A9i%C3%ADo%C3%B3u%C3%BA gets decoded as aáeéiíoóuú when it should be aáeéiíoóuú. If I use a%E1e%E9i%EDo%F3u%FA it works just fine.

Is there any library out there that can take care of differently encoded URIs and deal with them?

Thanks!

回答1:

There is nothing wrong with your decoding. The printing of the decoded URL is the problem. The output device that you print to is configured to accept strings encoded in ISO-8859-1, not in UTF-8.

Either configure the output device to accept strings encoded in UTF-8 or convert the decoded URL from UTF-8 to ISO-8859-1.



回答2:

As Oswald said, the problem isn't with the decoding... but with the method I'm using for showing the string. As I don't really need to deal with UTF-8 strings, I'm going to go with his second suggestion and convert it to ISO-8859-1.

Borrowed the idea (and most of the code) from this answer Is there a way to convert from UTF8 to iso-8859-1?

In order to do this, I added a dependency to iconv.

Here's my UriHelper.h

#pragma once

using namespace std;

static class UriHelper
{
public:
    static string Encode(const string &source);
    static string Decode(const string &encoded);
};

And this is my UriHelper.cpp

#include "UriHelper.h"
#include <curl/curl.h>
#include <iconv.h>

string UriHelper::Encode(const string &source)
{
    CURL *curl = curl_easy_init();
    char *cres = curl_easy_escape(curl, source.c_str(), source.length());
    string res(cres);
    curl_free(cres);
    curl_easy_cleanup(curl);
    return res;
}

string UriHelper::Decode(const string &encoded)
{
    CURL *curl = curl_easy_init();
    int outlength;
    char *cres = curl_easy_unescape(curl, encoded.c_str(), encoded.length(), &outlength);
    string res(cres, cres + outlength);
    curl_free(cres);
    curl_easy_cleanup(curl);

    //if it's UTF-8, convert it to ISO_8859-1. Based on https://stackoverflow.com/questions/11156473/is-there-a-way-to-convert-from-utf8-to-iso-8859-1/11156490#11156490
    iconv_t cd = iconv_open("ISO_8859-1", "UTF-8");

    const char *in_buf = res.c_str();
    size_t in_left = res.length();

    char *output = new char[res.length() + 1];
    std::fill(output, output + res.length() + 1, '\0');
    char *out_buf = &output[0];
    size_t out_left = res.length();

    do {
        if (iconv(cd, &in_buf, &in_left, &out_buf, &out_left) == (size_t)-1) {
            //failed to convert, just return the value received from curl
            delete[] output;
            iconv_close(cd);
            return res;
        }
    } while (in_left > 0 && out_left > 0);

    string outputString(output);
    delete[] output;
    iconv_close(cd);

    return outputString;
}