Identify and remove strange characters

2019-09-18 03:34发布

站内文章 / Python

56 0

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

What command can I use to identify and remove certain strange characters that form "words" such as:

í‰äó_
í€Œ¢í‰ä‰åí‰ä‹¢
ití€Œ¢í‰ä‰åí‰ä‹¢
í‰äóìgo

from a series of files? Those are some examples... I want to remove such occurrences.

Using the string module after you've gotten the data from the file:

import string
final_str = ''
for char in my_str:
    if char in string.printable:
        final_str += char

Alternative one-liner:

''.join([str(char) for char in my_str if char in string.printable])

Since you tagged shell and command-line, here you go

$ tr -cd [:graph:][:space:] < foo.txt
_

it
go

How about a regex sub?

something like:

import re

clean_name = re.sub(r'[^a-zA-Z0-9\._-]', '', dirty_name)

Add to the regex any other allowed char.

标签： python shell command-line

一夜七次

女 | 书童

私信

Ta的文章更多文章

0条评论

还没有人评论过~