Extract characters bound by parentheses

2019-08-01 14:52发布

I have data imported from a .csv file. The first column contains character strings that contain text within parentheses. The data look like:

    symbol
    ___________________________________________
1 | Apollo Senior Floating Rate Fund Inc. (AFT)
2 | Apollo Tactical Income Fund Inc. (AIF)
3 | Altra Industrial Motion Corp. (AIMC)
4 | Allegion plc (ALLE) 
5 | Amphenol Corporation (APH)
6 | Ares Management Corporation (ARES)
7 | ARMOUR Residential REIT, Inc. (ARR)
8 | Banc of California, Inc. (BANC)
9 | BlackRock Resources (BCX)
10| Belden Inc (BDC)
...

I need to convert that column of data into a list such as:

    symbol2
    ___________________________________________
1 | AFT
2 | AIF
3 | AIMC
4 | ALLE
5 | APH
6 | ARES
7 | ARR
8 | BANC
9 | BCX
10| BDC
...

My ultimate goal is to get a single character string where the text bound by parentheses are separated by a ";" like this:

"AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC;..."

I can do this last step with

paste(symbol2, collapes = ";")

but I can't figure out how to isolate the desired text.

I've tried everything listed here (extract a substring in R according to a pattern) by replacing the ":" with "(" and could not get anything to work. I tried:

gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", symbol, perl=T)

as recommended here (Extract text in parentheses in R), but the output is

"c(4, 5, 2, 1, 3, 6, 7, 8, 17, 9,...)"

Any help?

3条回答
Lonely孤独者°
2楼-- · 2019-08-01 15:03

We can extract the contents using str_extract from stringr

library(stringr)
symbol2 <- str_extract(df$symbol, "(?<=\\().+?(?=\\))")
symbol2
#[1] "AFT"  "AIF"  "AIMC" "ALLE" "APH"  "ARES"

Regex taken from here.

You can then paste them together

paste(symbol2, collapse = ";")
#[1] "AFT;AIF;AIMC;ALLE;APH;ARES"
查看更多
Bombasti
3楼-- · 2019-08-01 15:06

1) read.table Use read.table with the indicated sep and comment values to get a 2 column data frame in which the first column is the names and the second column is the symbols. Finally take that second column and collapse it into a single string. No packages or regular expressions are used.

DF2 <- read.table(text = unlist(DF), sep = "(", comment = ")")
paste(DF2[[2]], collapse = ";")
## [1] "AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC"

2) dplyr We can use separate from tidyr to separate the name and symbol columns dropping the name column at the same time. unlist that and collapse it into a single string. tidyr 0.8.2 or later must be used.

library(dplyr)
library(tidyr)

DF %>%
  separate(symbol, c(NA, "symbol2"), "[()]", extra = "drop") %>%
  unlist %>%
  paste(collapse = ";")
## [1] "AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC"

3) gsub We can match everything up to and including (, i.e. ".*\\(" and also everything from ) onwards, i.e. "\\).*" and replace those with the empty string. Then collapse as before.

paste(gsub(".*\\(|\\).*", "", DF$symbol), collapse = ";")
## [1] "AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC"

4) trimws This is another base solution. It requires R 3.6.0 or later (currently r-devel). We define whitespace as anything other than parentheses and use trimws to trim it away. Then we define whitespace as parentheses and then trim that away. That leaves us with the symbols which we can now collapse.

paste(trimws(trimws(DF$symbol, white = "[^()]"), white = "[()]"), collapse = ";")
## [1] "AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC"

Note

The input in reproducible form is:

Lines <- "
    symbol
1 | Apollo Senior Floating Rate Fund Inc. (AFT)
2 | Apollo Tactical Income Fund Inc. (AIF)
3 | Altra Industrial Motion Corp. (AIMC)
4 | Allegion plc (ALLE) 
5 | Amphenol Corporation (APH)
6 | Ares Management Corporation (ARES)
7 | ARMOUR Residential REIT, Inc. (ARR)
8 | Banc of California, Inc. (BANC)
9 | BlackRock Resources (BCX)
10| Belden Inc (BDC)"
DF <- read.table(text = Lines, sep = "|", strip.white = TRUE, as.is = TRUE)
查看更多
你好瞎i
4楼-- · 2019-08-01 15:24

Here is an option using base R's sub and a capture group

df$symbol2 <- sub(".+\\((\\w+)\\)$", "\\1", df$V1)
df
#                                            V1 symbol2
#1  Apollo Senior Floating Rate Fund Inc. (AFT)     AFT
#2       Apollo Tactical Income Fund Inc. (AIF)     AIF
#3         Altra Industrial Motion Corp. (AIMC)    AIMC
#4                          Allegion plc (ALLE)    ALLE
#5                   Amphenol Corporation (APH)     APH
#6           Ares Management Corporation (ARES)    ARES
#7          ARMOUR Residential REIT, Inc. (ARR)     ARR
#8              Banc of California, Inc. (BANC)    BANC
#9                    BlackRock Resources (BCX)     BCX
#10                            Belden Inc (BDC)     BDC 

Sample data

df <- read.table(text =
"'Apollo Senior Floating Rate Fund Inc. (AFT)'
'Apollo Tactical Income Fund Inc. (AIF)'
'Altra Industrial Motion Corp. (AIMC)'
'Allegion plc (ALLE)'
'Amphenol Corporation (APH)'
'Ares Management Corporation (ARES)'
'ARMOUR Residential REIT, Inc. (ARR)'
'Banc of California, Inc. (BANC)'
'BlackRock Resources (BCX)'
'Belden Inc (BDC)'", header = F)
查看更多
登录 后发表回答