Identify and remove strange characters

2019-09-18 03:34发布

问题:

What command can I use to identify and remove certain strange characters that form "words" such as:

í‰äó_
퀌¢í‰ä‰åí‰ä‹¢
it퀌¢í‰ä‰åí‰ä‹¢
í‰äóìgo

from a series of files? Those are some examples... I want to remove such occurrences.

回答1:

Using the string module after you've gotten the data from the file:

import string
final_str = ''
for char in my_str:
    if char in string.printable:
        final_str += char

Alternative one-liner:

''.join([str(char) for char in my_str if char in string.printable])


回答2:

Since you tagged shell and command-line, here you go

$ tr -cd [:graph:][:space:] < foo.txt
_

it
go


回答3:

How about a regex sub?

something like:

import re

clean_name = re.sub(r'[^a-zA-Z0-9\._-]', '', dirty_name)

Add to the regex any other allowed char.