I'm working with a string vector with a structure corresponding to the one below:
messy_vec <- c("0 - 9","100 - 150","21 - abc","50 - 56","70abc - 80")
I'm looking to change a class of this vector to factor which levels would be ordered according to the first digit(s). The code:
messy_vec_fac <- as.factor(messy_vec)
would produce
> messy_vec_fac
[1] 0 - 9 100 - 150 21 - abc 50 - 56 70abc - 80
Levels: 0 - 9 100 - 150 21 - abc 50 - 56 70abc - 80
whereas I'm interested in obtaining vector of characteristics:
[1] 0-9 100 - 150 21 - abc 50 - 56 70abc - 80
Levels: 0 - 9 21 - abc 50 - 56 70abc - 80 100 - 150
As indicated, the order of levels corresponds to the order:
0 21 50 70 100
which are the first digits derived from the elements of the messy vector.
Side points
This is not crucial to the sought solution but it would be good if the proposed solution would not assume the maximum number of digits in the first part of the vector elements. It may happen that the following values occur:
- 8787abc - 89898 deff - in this case the value 8787 should be used to assert the order
- 001 def - 1111 OHMG - in this case the value 1 should be used to assert the order
- It can be safely assumed that all vector elements have
-
strings:[[:space:]]-[[:space:]]
- Duplicate values occur
Edits
Following very useful suggestion by CathG I'm trying to cram this solution into a bigger dplyr
syntax
# ... %>%
mutate(very_needed_factor= factor(messy_vec,
levels = messy_vec[
order(
as.numeric(
sub("(\\d+)[^\\d]* - .*", "\\1",
messy_vec)))]))
# %>% ...
But I keep on getting the following error:
Warning messages:
1: In order(as.numeric(sub("(\\d+)[^\\d]* - .*", "\\1", c("12-14", :
NAs introduced by coercion
2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
duplicated levels in factors are deprecated
If I correctly understood what you want to do, you can capture the first digits appearing in each of the string with
sub
and convert them to numeric to be then used to order the levels in thefactor
call.NB: in case of duplicated values, you can do
levels=unique(messy_vec[order(num_vec)])
in thefactor
callHere is another solution