Remove punctuation but keeping emoticons?

2019-02-08 23:11发布

站内文章 / 前沿技术

54 0

我命由我不由天

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

Is that possible to remove all the punctuations but keeping the emoticons such as

:-(

structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label =     c("ãããæããããéãããæãããInappropriate announce:-(", 
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something     you are working to fix?", 
"@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)", 
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D", 
"xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...", 
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L, 
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54", 
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text", 
"created"), class = "data.frame", row.names = c(NA, -6L))

回答1:

Here's an approach that is less sophisticated and likely slower than @gagolews's solution. It requires you feed it an emoticon dictionary. You can create that or use the one in the qdapDictionaries package. The basic approach converts the emoticons to text that couldn't be mistaken for anything else (I use dat$Temp <- prefix to ensure this). Then you strip out punctuation using qdap::strip and then convert the placeholders back into emoticons via mgsub:

library(qdap)
#reps <- emoticon
emos <- c(":-(", ":)", ":D", ":p", "X-(")
reps <- data.frame(seq_along(emos), emos)

reps[, 1] <- paste0("EMOTICONREPLACE", reps[, 1])
dat$Temp <- mgsub(as.character(reps[, 2]), reps[, 1], dat[, 1])
dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]), 
    strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE))

View it:

truncdf(left_just(dat[, 3, drop=F]), 50)

##   Temp                                              
## 1 RT AirAsia ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í No
## 2 You know there is a problem when customer service 
## 3 ãããæããããéãããæãããInappropriate announce:-(         
## 4 AirAsia your direct debit Maybank payment gateways
## 5 xdek ke flight AirAsia Malaysia to LA hahah:p bagi
## 6 AirAsia Apart from the slight delay and shortage o

EDIT: To keep the ? and ! as requested pass the char.keep argument in strip function:

dat$Temp <- mgsub(reps[, 1], as.character(reps[, 2]), 
    strip(dat$Temp, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))

回答2:

1. A working pure-regex solution (a.k.a. Edit#2)

This task can be done purely with regular expressions (many thanks to @Mike Samuel)

First we build a database of emoticons:

(emots <- as.character(outer(c(":", ";", ":-", ";-"),
+                c(")", "(", "]", "[", "D", "o", "O", "P", "p"), stri_paste)))
## [1] ":)"  ";)"  ":-)" ";-)" ":("  ";("  ":-(" ";-(" ":]"  ";]"  ":-]" ";-]" ":["  ";["  ":-[" ";-[" ":D"  ";D"  ":-D" ";-D"
## [21] ":o"  ";o"  ":-o" ";-o" ":O"  ";O"  ":-O" ";-O" ":P"  ";P"  ":-P" ";-P" ":p"  ";p"  ":-p" ";-p"

An exemplary input text:

text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"

A helper function that escapes some special characters so that they may be used in a regex pattern (using the stringi package):

library(stringi)
escape_regex <- function(r) {
   stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}

A regular expression to match the emoticons:

(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"

Now, as @Mike Samuel suggested below, we just match (emoticon)|punctuation (note that emoticons are in a capturing group) and then replace the matches with the result of capturing group 1 (so if it's an emoticon, we have replacement=this emoticon, if it's a punctuation char, we have replacement=nothing). This will work because the alternation with | in ICU Regex (which is the regex engine used in stri_replace_all_regex) is greedy and left-biased: emoticons will be matched earlier than the punctuation characters.

stri_replace_all_regex(text, stri_c(regex1, "|\\p{P}"), "$1")
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-)  and the salesperson said Oh boy"

BTW, if you want to get rid only of a selected set of characters, put e.g. [.,] instead of [\\p{P}] above.

2. Regex solution hint - my first (not wise) attempt (a.k.a. original answer)

My very first idea (left here mainly for "historical reasons") was to approach this problem by using look-aheads and look-behinds, but - as you see - that's far from perfect.

To remove all : and ; not followed by ), (, D, X, 8, [, or ] use negative look-behind:

stri_replace_all_regex(text, "[:;](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P -) --- and the salesperson said Oh boy!"

Now we can add some old-school emoticons (with noses, e.g. :-), ;-D etc.)

stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-) --- and the salesperson said Oh boy!"

Now the hyphens removal (negative look behind and look ahead)

stri_replace_all_regex(text, "[:;](?![-]?[)P(DX8\\[\\]])|(?!<[:;])[-](?![)P(DX8\\[\\]])", "")
## [1] ":) :8 ;P :] :) ;D :( LOL :) I've been to... the grocery store :P :-)  and the salesperson said Oh boy!"

and so on. Of course, first you should build your own database of emoticons (to leave as they are) and punctuation marks (to remove). The regex will highly depend on these two sets, so it will be difficult to add new emoticons --- it's definitely not worth applying (and may twist your brain).

3. The second attempt (regex-dumb friendlier, a.k.a. Edit#1)

On the other hand, if you're allergic to complex regexes, try this. This approach has some "didactic benefits" - we have full insight on what's being done in each of the following steps:

Locate all emoticons within text;
Locate all the punctuation characters within text;
Find positions of the punctuation characters that are not parts of emoticons;
Remove the characters located in step 3.

An exemplary input text - 1 string only - a generalized case is left as an exercise ;)

text <- ":) ;P :] :) ;D :( LOL :) I've been to... the (grocery) st{o}re :P :-) --- and the salesperson said: Oh boy!"

A helper function that escapes some special characters so that they may be used in a regex:

escape_regex <- function(r) {
   library("stringi")
   stri_replace_all_regex(r, "\\(|\\)|\\[|\\]", "\\\\$0")
}

A regular expression to match the emoticons:

(regex1 <- stri_c("(", stri_c(escape_regex(emots), collapse="|"), ")"))
## [1] "(:\\)|;\\)|:-\\)|;-\\)|:\\(|;\\(|:-\\(|;-\\(|:\\]|;\\]|:-\\]|;-\\]|:\\[|;\\[|:-\\[|;-\\[|:D|;D|:-D|;-D|:o|;o|:-o|;-o|:O|;O|:-O|;-O|:P|;P|:-P|;-P|:p|;p|:-p|;-p)"

Locate the start and end positions of all the emoticons (i.e. locate the first OR the second OR ... emoticon):

where_emots <- stri_locate_all_regex(text, regex1)[[1]] # only for the first string of text
print(where_emots)
##       start end
##  [1,]     1   2
##  [2,]     4   5
##  [3,]     7   8
##  [4,]    10  11
##  [5,]    13  14
##  [6,]    16  17
##  [7,]    23  24
##  [8,]    64  65
##  [9,]    67  69

Locate all the punctuation chars (Here \\p{P} is the Unicode character class representing punctuation characters):

where_punct <- stri_locate_all_regex(text, "\\p{P}")[[1]]
print(where_punct)
##       start end
##  [1,]     1   1
##  [2,]     2   2
##  [3,]     4   4
##  [4,]     7   7
##  [5,]     8   8
## ...
## [26,]    72  72
## [27,]    73  73
## [28,]    99  99
## [29,]   107 107

As some punctuation chars occur within the emoticons, we should not stage them for removal:

which_punct_omit <- sapply(1:nrow(where_punct), function(i) {
   any(where_punct[i,1] >= where_emots[,1] &
        where_punct[i,2] <= where_emots[,2]) })
where_punct <- where_punct[!which_punct_omit,] # update where_punct
print(where_punct)
##       start end
##  [1,]    27  27
##  [2,]    38  38
##  [3,]    39  39
##  [4,]    40  40
##  [5,]    46  46
##  [6,]    54  54
##  [7,]    58  58
##  [8,]    60  60
##  [9,]    71  71
## [10,]    72  72
## [11,]    73  73
## [12,]    99  99
## [13,]   107 107

Each punctuation mark surely consists only of 1 character, thus always where_punct[,1]==where_punct[,2].

Now the final part. As you see, where_punct[,1] contains the positions of characters to be removed. IMHO the easiest way to do that (without loops) is by converting a string to UTF-32 (each character == 1 integer), remove undesired elements, and then go back to the textual representation:

text_tmp <- stri_enc_toutf32(text)[[1]]
print(text_tmp) # here - just ASCII codes...
## [1]  58  41  32  59  80  32  58  93  32  58....
text_tmp <- text_tmp[-where_punct[,1]] # removal, but be sure that where_punct is not empty!

And the result is:

stri_enc_fromutf32(text_tmp)
## [1] ":) ;P :] :) ;D :( LOL :) Ive been to the grocery store :P :-)  and the salesperson said Oh boy"

Here you are.

回答3:

I added this functionality to qdap version > 2.0.0 as the sub_holder function. Basically this function uses the response I gave above but lightens the coding load. The sub_holder function takes a text vector and items you want to sub out (such as emoticons). It returns a list with:

the vector of the test with the items subbed out for place holders
A function (called unhold) that swaps the holders for the original terms

Here's the code:

emos <- c(":-(", ":)", ":D", ":p", "X-(")
(m <- sub_holder(emos, dat[,1]))
m$unhold(strip(m$output, digit.remove = FALSE, lower.case=FALSE, char.keep=c("!", "?")))

回答4:

Using rex may make this type of task a little simpler. It will automatically escape characters as necessary, and will or all the elements of a vector if put into the or() function. re_matches() with the global argument will get you a list of all of the emoticons for a given line.

x = structure(list(text = structure(c(4L, 6L, 1L, 2L, 5L, 3L), .Label =     c("ãããæããããéãããæãããInappropriate announce:-(", 
"@AirAsia your direct debit (Maybank) payment gateways is not working. Is it something     you are working to fix?", 
"@AirAsia Apart from the slight delay and shortage of food on our way back from Phuket, both flights were very smooth. Kudos :)", 
"RT @AirAsia: ØØÙØÙÙÙÙ ÙØØØ ØØØÙ ÙØØØØÙ ØØØØÙÙÙí í Now you can enjoy a #great :D breakfast onboard with our new breakfast meals! :D", 
"xdek ke flight @AirAsia Malaysia to LA... hahah..:p bagi la promo murah2 sikit, kompom aku beli...", 
"You know there is a problem when customer service asks you to wait for 103 minutes and your no is 42 in the queue. X-("
), class = "factor"), created = structure(c(5L, 4L, 4L, 3L, 2L, 
1L), .Label = c("1/2/2014 16:14", "1/2/2014 17:00", "3/2/2014 0:54", 
"3/2/2014 0:58", "3/2/2014 1:28"), class = "factor")), .Names = c("text", 
"created"), class = "data.frame", row.names = c(NA, -6L))

emots <- as.character(outer(c(":", ";", ":-", ";-"), c(")", "(", "]", "[", "D", "o", "O", "P", "p"), paste0))

library(rex)
re_matches(x$text,
  rex(
    capture(name = 'emoticons',
      or(emots)
    ),
  global = T)

#>[[1]]
#>  emoticon
#>1       :D
#>2       :D
#>
#>[[2]]
#>  emoticon
#>1     <NA>
#>
#>[[3]]
#>  emoticon
#>1      :-(
#>
#>[[4]]
#>  emoticon
#>1     <NA>
#>
#>[[5]]
#>  emoticon
#>1       :p
#>
#>[[6]]
#>  emoticon
#>1       :)