Sequence length encoding using R

2020-02-01 01:49发布

问题:

Is there a way to encode increasing integer sequences in R, analogous to encoding run lengths using run length encoding (rle)?

I'll illustrate with an example:

Analogy: Run length encoding

r <- c(rep(1, 4), 2, 3, 4, rep(5, 5))
rle(r)
Run Length Encoding
  lengths: int [1:5] 4 1 1 1 5
  values : num [1:5] 1 2 3 4 5

Desired: sequence length encoding

s <- c(1:4, rep(5, 4), 6:9)
s
[1] 1 2 3 4 5 5 5 5 6 7 8 9

somefunction(s)
Sequence lengths
  lengths: int [1:4] 5 1 1 5
  value1 : num [1:4] 1 5 5 5

Edit 1

Thus, somefunction(1:10) will give the result:

Sequence lengths
  lengths: int [1:1] 10
  value1 : num [1:1] 1 

This results means that there is an integer sequence of length 10 with starting value of 1, i.e. seq(1, 10)

Note that there isn't a mistake in my example result. The vector in fact ends in the sequence 5:9, not 6:9 which was used to construct it.

My use case is that I am working with survey data in an SPSS export file. Each subquestion in a grid of questions will have a name of the pattern paste("q", 1:5), but sometimes there is an "other" category which will be marked q_99, q_other or something else. I wish to find a way of identifying the sequences.

Edit 2

In a way, my desired function is the inverse of the base function sequence, with the start value, value1 in my example, added.

lengths <- c(5, 1, 1, 5)
value1 <- c(1, 5, 5, 5)

s
[1] 1 2 3 4 5 5 5 5 6 7 8 9
sequence(lengths) + rep(value1-1, lengths) 
[1] 1 2 3 4 5 5 5 5 6 7 8 9

Edit 3

I should have stated that for my purposes a sequence is defined as increasing integer sequences as opposed to monotonically increasing sequences, e.g. c(4,5,6,7) but not c(2,4,6,8) nor c(5,4,3,2,1). However, any other integer can appear between sequences.

This means a solution should be able to cope with this test case:

somefunction(c(2, 4, 1:4, 5, 5))
    Sequence lengths
      lengths: int [1:4] 1 1 5 1
      value1 : num [1:4] 2 4 1 5 

In the ideal case, the solution can also cope with the use case suggested originally, which would include characters in the vector, e.g.

somefunction(c(2, 4, 1:4, 5, "other"))
    Sequence lengths
      lengths: int [1:5] 1 1 5 1 1
      value1 : num [1:5] 2 4 1 5 "other"

回答1:

EDIT : added control to do the character vectors as well.

Based on rle, I come to following solution :

somefunction <- function(x){

    if(!is.numeric(x)) x <- as.numeric(x)
    n <- length(x)
    y <- x[-1L] != x[-n] + 1L
    i <- c(which(y|is.na(y)),n)

    list(
      lengths = diff(c(0L,i)),
      values = x[head(c(0L,i)+1L,-1L)]
    )

}

> s <- c(2,4,1:4, rep(5, 4), 6:9,4,4,4)

    > somefunction(s)
    $lengths
    [1] 1 1 5 1 1 5 1 1 1

    $values
    [1] 2 4 1 5 5 5 4 4 4

This one works on every test case I tried and uses vectorized values without ifelse clauses. Should run faster. It converts strings to NA, so you keep a numeric output.

> S <- c(4,2,1:5,5, "other" , "other",4:6,2)

> somefunction(S)
$lengths
[1] 1 1 5 1 1 1 3 1

$values
[1]  4  2  1  5 NA NA  4  2

Warning message:
In somefunction(S) : NAs introduced by coercion


回答2:

Here is my solution

diff_s = which(diff(s) != 1)
lengths = diff(c(0, diff_s, length(s)))
values  = s[c(1, diff_s + 1)]

EDIT: function to take care of strings too

sle2 = function(s){
  s2 = as.numeric(s)
  s2[is.na(s2)] = 100 + as.numeric(factor(s[is.na(s2)]))
  diff_s2 = which(diff(s2) != 1)
  lengths = diff(c(0, diff_s2, length(s)))
  values  = s[c(1, diff_s2 + 1)]
  return(list(lengths = lengths, values = values))
}

sle2(c(4,2,1:5,5, "other" , "other",4:6,2, "someother", "someother"))

lengths
 [1] 1 1 5 1 1 1 3 1 1 1

$values
 [1] "4"   "2"  "1"   "5"  "other" "other"  "4"   "2"  "someother" "someother"

Warning message:
In sle2(c(4, 2, 1:5, 5, "other", "other", 4:6, 2, "someother", "someother")) :
  NAs introduced by coercion


回答3:

You could use this for a start (given you s above):

s2<-c(0, diff(s))
s3<-ifelse((c(s2[-1], 0)==1) & (s2!=1), 1, s2)
rle(ifelse(s3==1, -1, seq_along(s3)))

It doesn't return the values yet, there are probably easy enough ways to adpat the code. At least you have the sequence lengths, so you can easily retrieve the starting values for the sequences.



回答4:

How about:

sle <- function(s)
{
    diffs <- which(diff(s)!=1)
    lengths <- c(diffs[1],diff(diffs),length(s)-diffs[length(diffs)])
    value1 <- s[c(1,diffs+1)]
    cat("", "Sequence Length Encoding\n", " lengths:")
    str(lengths)
    cat("  value1:")
    str(value1)
}


sle(s)
 Sequence Length Encoding
  lengths: int [1:4] 5 1 1 5
  value1: num [1:4] 1 5 5 5

sle(c(2,4,1:4,rep(5,4),6:9,4,4,4))
 Sequence Length Encoding
  lengths: int [1:9] 1 1 5 1 1 5 1 1 1
  value1: num [1:9] 2 4 1 5 5 5 4 4 4


回答5:

Here's an enhancement to Joris Meys's solution. Consider this a solution to a future problem :-) .

Carl

seqle <- function(x,incr=1) {
    if(!is.numeric(x)) x <- as.numeric(x)
    n <- length(x)
    #y <- x[-1L] != x[-n] + 1L
    y <- x[-1L] != x[-n] + incr
    i <- c(which(y|is.na(y)),n)
    list( lengths = diff(c(0L,i)),  values = x[head(c(0L,i)+1L,-1L)])
}


回答6:

"My use case is that I am working with survey data in an SPSS export file. Each subquestion in a grid of questions will have a name of the pattern paste("q", 1:5), but sometimes there is an "other" category which will be marked q_99, q_other or something else. I wish to find a way of identifying the sequences."

I usually do something like this when I'm pulling data from confirmit, DASH, SPSS, SAS, MySQL or whatever depending on the source it always gets punted into a data.frame():

surv.pull <- function(dat, pattern) {
  dat <- data.frame(dat[,grep(pattern,colnames(dat))],check.names=F)
return(dat)
}

If you use pattern like [q][_][9][9] you can decide to pull a data.frame of other data spaces by or not by adding "." to the end [q][_][9][9]. so that it pulls q_99whatever

Most of my data columns are in the form like this q8a.1, .3, .4, .5, .6, .7, .8, ... so surv.pull(dat, "[q][8][a].") would pull them all, including the other if there was a specify. Obviously, using regex you could decide whether or not to pull the other.

Alternatively, the general convention is to push other specify questions to the end of the data space, so a quick df <- df[-ncol(df)] would drop it or other_list <- df[ncol(df)] would save it.



标签: r encoding