Ordering a complex string vector in order to obtai

2019-02-21 00:25发布

问题:

I'm working with a string vector with a structure corresponding to the one below:

messy_vec <- c("0 - 9","100 - 150","21 - abc","50 - 56","70abc - 80")

I'm looking to change a class of this vector to factor which levels would be ordered according to the first digit(s). The code:

messy_vec_fac <- as.factor(messy_vec)

would produce

> messy_vec_fac
[1] 0 - 9      100 - 150  21 - abc   50 - 56    70abc - 80
Levels: 0 - 9 100 - 150 21 - abc 50 - 56 70abc - 80

whereas I'm interested in obtaining vector of characteristics:

[1] 0-9 100 - 150 21 - abc 50 - 56 70abc - 80

Levels: 0 - 9 21 - abc 50 - 56 70abc - 80 100 - 150

As indicated, the order of levels corresponds to the order:

0 21 50 70 100

which are the first digits derived from the elements of the messy vector.

Side points

This is not crucial to the sought solution but it would be good if the proposed solution would not assume the maximum number of digits in the first part of the vector elements. It may happen that the following values occur:

  • 8787abc - 89898 deff - in this case the value 8787 should be used to assert the order
  • 001 def - 1111 OHMG - in this case the value 1 should be used to assert the order
  • It can be safely assumed that all vector elements have - strings: [[:space:]]-[[:space:]]
  • Duplicate values occur

Edits

Following very useful suggestion by CathG I'm trying to cram this solution into a bigger dplyr syntax

# ... %>%
  mutate(very_needed_factor= factor(messy_vec,
                                      levels = messy_vec[
                                        order(
                                          as.numeric(
                                            sub("(\\d+)[^\\d]* - .*", "\\1",
                                                messy_vec)))]))
# %>% ...

But I keep on getting the following error:

Warning messages:
1: In order(as.numeric(sub("(\\d+)[^\\d]* - .*", "\\1", c("12-14",  :
  NAs introduced by coercion
2: In `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels,  :
  duplicated levels in factors are deprecated

回答1:

If I correctly understood what you want to do, you can capture the first digits appearing in each of the string with sub and convert them to numeric to be then used to order the levels in the factor call.

num_vec <- as.numeric(sub("(\\d+)[^\\d]* - .*", "\\1", messy_vec))
messy_vec_fac <- factor(messy_vec, levels=messy_vec[order(num_vec)])

messy_vec_fac
#[1] 0 - 9      100 - 150  21 - abc   50 - 56    70abc - 80
#Levels: 0 - 9 21 - abc 50 - 56 70abc - 80 100 - 150

NB: in case of duplicated values, you can do levels=unique(messy_vec[order(num_vec)]) in the factor call



回答2:

Here is another solution

library(magrittr)    
messy_vec <- c("0 - 9","100 - 150","21 - abc","50 - 56","70abc - 80")
ints <- strsplit(messy_vec, "-") %>% 
  unlist() %>% 
  gsub(pattern = "([[:space:]]|[[:alpha:]])*", replacement = "") %>% 
  as.integer() %>% 
  matrix(nrow = 2)
factor(messy_vec, levels = messy_vec[order(ints[1, ], ints[2, ])])