How to remove non-printable/invisible characters i

Sometimes I have evil non-printable characters in the middle of a string. These strings are user input, so I must make my program receive it well instead of try to change the source of the problem.

For example, they can have zero width no-break space in the middle of the string. For example, while parsing a .po file, one problematic part was the string "he is a man of god" in the middle of the file. While it everything seems correct, inspecting it with irb shows:

 "he is a man of god".codepoints
 => [104, 101, 32, 105, 115, 32, 97, 32, 65279, 109, 97, 110, 32, 111, 102, 32, 103, 111, 100]

I believe that I know what a BOM is, and I even handle it nicely. However sometimes I have such characters on the middle of the file, so it is not a BOM.

My current approach is to remove all characters that I found evil in a really smelly fashion:

text = (text.codepoints - CODEPOINTS_BlACKLIST).pack("U*")

The most close I got was following this post which leaded me to :print: option on regexps. However it was no good for me:

"m".scan(/[[:print:]]/).join.codepoints
 => [65279, 109]

so the question is: How can I remove all non-printable characters from a string in ruby?

标签： ruby encoding non-printing-characters

3条回答

Viruses.

2楼-- · 2019-02-13 07:12

try this:

>>"aaa\f\d\x00abcd".gsub(/[^[:print:]]/,'.')
=>"aaa.d.abcd"

0人赞添加讨论(0) 举报

你好瞎i

3楼-- · 2019-02-13 07:28

I was also having the same issue in ROR version 3.9.3, and I was using Visual Studio 2010 as my editor. Notepad++ solved my problem.

If you are using Notepad++ and the problem is in a UTF-8 file:

Open the file
In Encoding menu select "Encode in UTF-8 without BOM as shown in the screenshot

Screenshot where it showing the aforesaid menu item

For more details Refer this

0人赞添加讨论(0) 举报

倾城　Initia

4楼-- · 2019-02-13 07:29

Ruby can help you convert from one multi-byte character set to another. Check into the these search results, plus read up on Ruby String's encode method.

Also, Ruby's Iconv is your friend.

Finally, James Grey wrote a series of articles which cover this in good detail.

One of the things you can do using those tools is to tell them to transcode to a visually similar character, or ignore them completely.

Dealing with alternate character sets is one of the most... irritating things I've ever had to do, because files can contain anything, but be marked as text. You might not expect it and then your code dies or starts throwing errors, because people are so ingenious when coming up with ways to insert alternate characters into content.

0人赞添加讨论(0) 举报

How to remove non-printable/invisible characters i

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间