str_extract specific patterns (example)

2020-03-04 06:31发布

I'm still a little confused by regex syntax. Can you please help me with these patterns:

_A00_A1234B_
_A00_A12345B_
_A1_A12345_

my approaches so far:

vapply(strsplit(files, "[_.]"), function(files) files[nchar(files) == 7][1], character(1))

or

str_extract(str2, "[A-Z][0-9]{5}[A-Z]")

The expected outputs are

A1234B
A12345B
A12345

Thanks!

标签: regex r
4条回答
够拽才男人
2楼-- · 2020-03-04 07:13

You can do this without using a regular expression ...

x <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
sapply(strsplit(x, '_', fixed=T), '[', 3)
# [1] "A1234B"  "A12345B" "A12345" 

If you insist on using a regular expression, the following will suffice.

regmatches(x, regexpr('[^_]+(?=_$)', x, perl=T))
查看更多
何必那么认真
3楼-- · 2020-03-04 07:16

Using rex to construct the regular expression may make it more understandable.

x <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")

# approach #1, assumes always is between the second underscores.
re_matches(x,
  rex(
    "_",
    anything,
    "_",
    capture(anything),
    "_"
  )
)

#>         1
#> 1  A1234B
#> 2 A12345B
#> 3  A12345


# approach #2, assumes an alpha, followed by 4 or 5 digits with a possible trailing alpha.
re_matches(x,
  rex(
    capture(
      alpha,
      between(digit, 4, 5),
      maybe(alpha)
    )
  )
)

#>         1
#> 1  A1234B
#> 2 A12345B
#> 3  A12345
查看更多
狗以群分
4楼-- · 2020-03-04 07:24
vec <- c("_A00_A1234B_", "_A00_A12345B_", "_A1_A12345_")

You can use sub and this regex:

sub(".*([A-Z]\\d{4,5}[A-Z]?).*", "\\1", vec)
# [1] "A1234B"  "A12345B" "A12345" 
查看更多
贼婆χ
5楼-- · 2020-03-04 07:25

You can try

library(stringr)
str_extract(str2, "[A-Z][0-9]{4,5}[A-Z]?")
#[1] "A1234B"  "A12345B" "A12345" 

Here, the pattern looks for a capital letter [A-Z], followed by 4 or 5 digits [0-9]{4,5}, followed by a capital letter [A-Z] ?

Or you can use stringi which would be faster

library(stringi)
 stri_extract(str2, regex="[A-Z][0-9]{4,5}[A-Z]?")
 #[1] "A1234B"  "A12345B" "A12345" 

Or a base R option would be

 regmatches(str2,regexpr('[A-Z][0-9]{4,5}[A-Z]?', str2))
 #[1] "A1234B"  "A12345B" "A12345" 

data

str2 <- c('_A00_A1234B_', '_A00_A12345B_', '_A1_A12345_')
查看更多
登录 后发表回答