I'm cleaning text strings in R. I want to remove all the punctuation except apostrophes and hyphens. This means I can't use the [:punct:]
character class (unless there's a way of saying [:punct:] but not '-
).
! " # $ % & ( ) * + , . / : ; < = > ? @ [ \ ] ^ _ { | } ~.
and backtick must come out.
For most of the above, escaping is not an issue. But for square brackets, I'm really having issues. Here's what I've tried:
gsub('[abc]', 'L', 'abcdef') #expected behaviour, shown as sanity check
# [1] "LLLdef"
gsub('[[]]', 'B', 'it[]') #only 1 substitution, ie [] treated as a single character
# [1] "itB"
gsub('[\[\]]', 'B', 'it[]') #single escape, errors as expected
Error: '[' is an unrecognized escape in character string starting "'[["
gsub('[\\[\\]]', 'B', 'it[]') #double escape, single substitution
# [1] "itB"
gsub('[\\]\\[]', 'B', 'it[]') #double escape, reversed order, NO substitution
# [1] "it[]"
I'd prefer not to used fixed=TRUE
with gsub
since that will prevent me from using a character class. So, how do I include square brackets in a regex character class?
ETA additional trials:
gsub('[[\\]]', 'B', 'it[]') #double escape on closing ] only, single substitution
# [1] "itB"
gsub('[[\]]', 'B', 'it[]') #single escape on closing ] only, expected error
Error: ']' is an unrecognized escape in character string starting "'[[]"
ETA: the single substitution was caused by not setting perl=T
in my gsub
calls. ie:
gsub('[[\\]]', 'B', 'it[]', perl=T)