I've seen this error everytime I try to run seqdef on my data that has already been converted to STS format using seqformat. A sample of my dataframe looks like
head(df.new, 10)
user_id orderdate cart to
1 8 1 produce 30
2 8 31 produce 60
3 8 61 produce 70
4 8 71 produce 92
5 10 1 produce 30
6 10 31 produce 42
7 10 43 meat seafood 56
8 10 57 deli 77
9 17 1 beverages 3
10 17 4 beverages 8
It has a total of 14000 rows of orders and there are some orders which occur on the same day for each user (i.e. orderdate == to). Below are the codes that I have used to create my STS data which is used as input to seqdef.
df.form <- seqformat(df.new, id='user_id', begin='orderdate', end='to', status='cart', from='SPELL', to='STS', process=FALSE)
df.seq <- seqdef(df.form, left='DEL', right = 'unknown', xtstep=10, void = 'unknown')
The error message I get when running the seqdef is
[>] found missing values ('NA') in sequence data
[>] preparing 35000 sequences
[>] coding void elements with 'unknown' and missing values with '*'
[>] 21 distinct states appear in the data:
1 = alcohol
2 = babies
3 = bakery
4 = beverages
5 = breakfast
6 = bulk
7 = canned goods
8 = dairy eggs
9 = deli
10 = dry goods pasta
11 = frozen
12 = household
...
[>] adding special state(s) to the alphabet: unknown
Error in `levels<-`(`*tmp*`, value = if (nl == nL) as.character(labels) else paste0(labels, :
factor level [24] is duplicated
I tried removing those orders where orderdate == to and the same error still occurs. I would appreciate any help I can get to fix this problem. Thanks.
The error occurs because you are using the same code ('unknown') for right missings and voids.
When the sequences contain 'missings', these missings will be considered as a separate state when you set
with.missing = TRUE
in functions such asseqdist
orseqdplot
, while voids are used to adjust the row lengths and are simply ignored when plotting the sequences (seqplot
) or computing dissimilarities (seqdist
).