I am working on cleaning up a text-based data file and cannot figure out how why the gsub("[[:punct:]]", "", X1)
is not giving a match for all punctuation. Unfortunately, I cannot replicate the problem here, which makes me think it is a character encoding issue -- the punctuation in question have an appearance that is obviously different from standard ASCII.
Is this a problem that I can solve after reading in the files, or do I have to do something at the front end? For example, Hadley's post on an encoding issue makes me think that I need to specifying the encoding statement when I read the files. However, I am reading a bunch of different txt files from a folder, so I am not sure the best solution. Basically, I just want to retain all letters [A-Za-z] and exclude everything else. (That said, gsub([^A-Za-z], "", X1)
doesn't work either!)
Any suggestions on handling this problem would be greatly appreciated!
Probably the punctuation character is out of the ascii range. By default
[[:punct:]]
contains only ascii punctuation characters. But you can extend the class to unicode with the(*UCP)
directive. But this doesn't suffice, you need to inform the regex engine that it must read the target string as an utf encoded string with(*UTF)
(otherwise a multibyte encoded character will be seen as several one byte characters). So:Note: these two directives exist only in perl mode and must be at the very begining of the pattern.
Note2: you can do the same like this:
Because
\pP
is a shorthand for all unicode punctation characters,(*UCP)
becomes useless.