I'd like to separate column values using tidyr::separate
and a regex expression but am new to regex expressions
df <- data.frame(A=c("enc0","enc10","enc25","enc100","harab0","harab25","harab100","requi0","requi25","requi100"), stringsAsFactors=F)
This is what I've tried
library(tidyr)
df %>%
separate(A, c("name","value"), sep="[a-z]+")
Bad Output
name value
1 0
2 10
3 25
4 100
5 0
# etc
How do I save the name
column as well?
For a bare R version without a lookaround-based regex, define the regular expression first:
Then use two
substr()
commands to separate and return the desired two components, before and after the matched pattern.The regex here looks for the pattern "any alpha"
[a-zA-Z]
followed by "any numeric"[0-9]
. I believe this is what thereshape
command does if thesep
argument is specified as "".You could use the package unglue
Created on 2019-10-08 by the reprex package (v0.3.0)
You can add one more step If you really want to get it with
separate
, in which I don't see the point, i.e. (Using the same regex as @ WiktorStribiżew),You may use a
(?<=[a-z])(?=[0-9])
lookaround based regex withtidyr::separate
:The
(?<=[a-z])(?=[0-9])
pattern matches a location in the string right in between a lowercase ASCII letter ((?<=[a-z])
) and a digit ((?=[0-9])
). The(?<=...)
is a positive lookahead that requires the presence of some pattern immediately to the left of the current location, and(?=...)
is a positive lookahead that requires the presence of its pattern immediately to the right of the current location. Thus, the letters and digits are kept intact when splitting.Alternatively, you may use
extract
:Output:
The
^([a-z]+)(\\d+)$
pattern matches:^
- start of input([a-z]+)
- Capturing group 1 (columnname
): one or more lowercase ASCII letters(\\d+)
- Capturing group 2 (columnvalue
): one or more digits$
- end of string.