How Do You Write Code That Is Safe for UTF-8?

We have a set of applications that were developed for the ASCII character set. Now, we're trying to install it in Iceland, and are running into problems where the Icelandic characters are getting screwed up.

We are working through our issues, but I was wondering: Is there a good "guide" out there for writing C++ code that is designed for 8-bit characters and which will work properly when UTF-8 data is given to it?

I can't expect everyone to read the whole Unicode standard, but if there is something more digestible available, I'd like to share it with the team so we don't run into these issues again.

Re-writing all the applications to use wchar_t or some other string representation is not feasible at this time. I'll also note that these applications communicate over networks with servers and devices that use 8-bit characters, so even if we did Unicode internally, we'd still have issues with translation at the boundaries. For the most part, these applications just pass data around; they don't "process" the text in any way other than copying it from place to place.

The operating systems used are Windows and Linux. We use std::string and plain-old C strings. (And don't ask me to defend any of the design decisions. I'm just trying to help fix the mess.)

Here is a list of what has been suggested:

标签： c++ unicode utf-8 globalization

8条回答

家丑人穷心不美

2楼-- · 2019-03-13 07:13

Icelandic uses ISO Latin 1, so eight bits should be enough. We need more details to figure out what's happening.

0人赞添加讨论(0) 举报

我命由我不由天

3楼-- · 2019-03-13 07:16

UTF-8 was designed exactly with your problems in mind. One thing I would be careful about is that ASCII is realy a 7-bit encoding, so if any part of your infrastructure is using the 8th bit for other purposes, that may be tricky.

0人赞添加讨论(0) 举报

爱情/是我丢掉的垃圾

4楼-- · 2019-03-13 07:20

This looks like a comprehensive quick guide:
http://www.cl.cam.ac.uk/~mgk25/unicode.html

0人赞添加讨论(0) 举报

时光不老，我们不散

5楼-- · 2019-03-13 07:20

Be aware that full unicode doesn't fit in 16bit characters; so either use 32-bit chars, or variable-width encoding (UTF-8 is the most popular).

0人赞添加讨论(0) 举报

在下西门庆

6楼-- · 2019-03-13 07:24

You may want to use wide characters (wchar_t instead of char and std::wstring instead of std::string). This doesn't automatically solve 100% of your problems, but is good first step.

Also use string functions which are Unicode-aware (refer to documentation). If something manipulates wide chars or string it generally is aware that they are wide.

0人赞添加讨论(0) 举报

Melony?

7楼-- · 2019-03-13 07:28

You might want to check out icu. They might have functions available that would make working with UTF-8 strings easier.

0人赞添加讨论(0) 举报

1 2 下一页

How Do You Write Code That Is Safe for UTF-8?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间