How to count CAPSLOCK in string using R

2019-04-02 04:44发布

问题:

In src$Review each row is filled with text in Russian. I want to count the CAPSLOCK in each row. For example, in "My apple is GREEN" I want to count not just the quantity of capital letters, but the amount of CAPSLOCK (without "My", only "GREEN"). So, it works only if at least two characters are presented in uppercase.

Now I have following code in my script:

capscount <- str_count(src$Review, "[А-Я]")

It counts only the total amount of capital letters. I only need the total amount of characters that are in CAPSLOCK, which means that these characters are counted only if at least 2 following letters in a word (e.g., "GR" in "GREEN") are displayed.

Thank you in advance.

回答1:

The pattern you are looking for is "\\b[A-Z]{2,}\\b". It will match on two or more capital letters in succession that have boundaries, \\b, on each side. That is the overall structure, fill in with the Russian alphabet where necessary.

#test string. A correct count should be 1 0 2
x <- c("My GREEN", "My Green", "MY GREEN")

library(stringr)
str_count(x, "\\b[A-Z]{2,}\\b")
#[1] 1 0 2

library(stringi)
stri_count(x, regex="\\b[A-Z]{2,}\\b")
#[1] 1 0 2

#base R
sapply(gregexpr("\\b[A-Z]{2,}\\b", x), function(x) length(c(x[x > 0])))
#[1] 1 0 2

Update

If you would like character counts:

sapply(str_match_all(x, "\\b[A-Z]{2,}\\b"), nchar)


回答2:

Use Pierre's regex with nchar and str_extract_all. Use simplify = TRUE and paste0 to concatenate all the instances.

library(stringr)

string <- c("My applie is GREEN and Her Majesty's apricot is ORANGE", "I have a LARGE sword", "My baby is sick")

nchar(
  paste0(
    str_extract_all(string = string, pattern = "\\b[A-Z]{2,}\\b", simplify = TRUE), 
    collapse = "")
  )


回答3:

The qdapRegex package I maintain has a regular expression for this, which is the same as @Hugh's regex but IMO it's nice to have lots of common regexes stored in a library that I can just grab. qdapRegex uses stringi as the backend and so should be available if you've installed qdapRegex.

On @Pierre Lafortune's string:

x <- c("My GREEN", "My Green", "MY GREEN")

library(qdapRegex)
stringi::stri_count_regex(x, grab("@rm_caps"))

## [1] 1 0 2

Let's look at the regex:

grab("@rm_caps")

## "(\\b[A-Z]{2,}\\b)"

On @Hugh's string:

x2 <- c("My applie is GREEN and Her Majesty's apricot is ORANGE", "I have a LARGE sword", "My baby is sick")
stringi::stri_count_regex(x2, grab("@rm_caps"))

## [1] 2 1 0