I have data imported from a .csv file. The first column contains character strings that contain text within parentheses. The data look like:
symbol
___________________________________________
1 | Apollo Senior Floating Rate Fund Inc. (AFT)
2 | Apollo Tactical Income Fund Inc. (AIF)
3 | Altra Industrial Motion Corp. (AIMC)
4 | Allegion plc (ALLE)
5 | Amphenol Corporation (APH)
6 | Ares Management Corporation (ARES)
7 | ARMOUR Residential REIT, Inc. (ARR)
8 | Banc of California, Inc. (BANC)
9 | BlackRock Resources (BCX)
10| Belden Inc (BDC)
...
I need to convert that column of data into a list such as:
symbol2
___________________________________________
1 | AFT
2 | AIF
3 | AIMC
4 | ALLE
5 | APH
6 | ARES
7 | ARR
8 | BANC
9 | BCX
10| BDC
...
My ultimate goal is to get a single character string where the text bound by parentheses are separated by a ";" like this:
"AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC;..."
I can do this last step with
paste(symbol2, collapes = ";")
but I can't figure out how to isolate the desired text.
I've tried everything listed here (extract a substring in R according to a pattern) by replacing the ":" with "(" and could not get anything to work. I tried:
gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", symbol, perl=T)
as recommended here (Extract text in parentheses in R), but the output is
"c(4, 5, 2, 1, 3, 6, 7, 8, 17, 9,...)"
Any help?
We can extract the contents using str_extract
from stringr
library(stringr)
symbol2 <- str_extract(df$symbol, "(?<=\\().+?(?=\\))")
symbol2
#[1] "AFT" "AIF" "AIMC" "ALLE" "APH" "ARES"
Regex taken from here.
You can then paste
them together
paste(symbol2, collapse = ";")
#[1] "AFT;AIF;AIMC;ALLE;APH;ARES"
Here is an option using base R's sub
and a capture group
df$symbol2 <- sub(".+\\((\\w+)\\)$", "\\1", df$V1)
df
# V1 symbol2
#1 Apollo Senior Floating Rate Fund Inc. (AFT) AFT
#2 Apollo Tactical Income Fund Inc. (AIF) AIF
#3 Altra Industrial Motion Corp. (AIMC) AIMC
#4 Allegion plc (ALLE) ALLE
#5 Amphenol Corporation (APH) APH
#6 Ares Management Corporation (ARES) ARES
#7 ARMOUR Residential REIT, Inc. (ARR) ARR
#8 Banc of California, Inc. (BANC) BANC
#9 BlackRock Resources (BCX) BCX
#10 Belden Inc (BDC) BDC
Sample data
df <- read.table(text =
"'Apollo Senior Floating Rate Fund Inc. (AFT)'
'Apollo Tactical Income Fund Inc. (AIF)'
'Altra Industrial Motion Corp. (AIMC)'
'Allegion plc (ALLE)'
'Amphenol Corporation (APH)'
'Ares Management Corporation (ARES)'
'ARMOUR Residential REIT, Inc. (ARR)'
'Banc of California, Inc. (BANC)'
'BlackRock Resources (BCX)'
'Belden Inc (BDC)'", header = F)
1) read.table Use read.table
with the indicated sep
and comment
values to get a 2 column data frame in which the first column is the names and the second column is the symbols. Finally take that second column and collapse it into a single string. No packages or regular expressions are used.
DF2 <- read.table(text = unlist(DF), sep = "(", comment = ")")
paste(DF2[[2]], collapse = ";")
## [1] "AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC"
2) dplyr We can use separate
from tidyr to separate the name and symbol columns dropping the name column at the same time. unlist
that and collapse it into a single string. tidyr 0.8.2 or later must be used.
library(dplyr)
library(tidyr)
DF %>%
separate(symbol, c(NA, "symbol2"), "[()]", extra = "drop") %>%
unlist %>%
paste(collapse = ";")
## [1] "AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC"
3) gsub We can match everything up to and including (, i.e. ".*\\("
and also everything from ) onwards, i.e. "\\).*"
and replace those with the empty string. Then collapse as before.
paste(gsub(".*\\(|\\).*", "", DF$symbol), collapse = ";")
## [1] "AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC"
4) trimws This is another base solution. It requires R 3.6.0 or later (currently r-devel). We define whitespace as anything other than parentheses and use trimws
to trim it away. Then we define whitespace as parentheses and then trim that away. That leaves us with the symbols which we can now collapse.
paste(trimws(trimws(DF$symbol, white = "[^()]"), white = "[()]"), collapse = ";")
## [1] "AFT;AIF;AIMC;ALLE;APH;ARES;ARR;BANC;BCX;BDC"
Note
The input in reproducible form is:
Lines <- "
symbol
1 | Apollo Senior Floating Rate Fund Inc. (AFT)
2 | Apollo Tactical Income Fund Inc. (AIF)
3 | Altra Industrial Motion Corp. (AIMC)
4 | Allegion plc (ALLE)
5 | Amphenol Corporation (APH)
6 | Ares Management Corporation (ARES)
7 | ARMOUR Residential REIT, Inc. (ARR)
8 | Banc of California, Inc. (BANC)
9 | BlackRock Resources (BCX)
10| Belden Inc (BDC)"
DF <- read.table(text = Lines, sep = "|", strip.white = TRUE, as.is = TRUE)