As I have a big dataset and only limited computational ressources, I want to make use of aggregated sequence objects for a discrepancy analysis using the R packages TraMineR
and WeightedCluster
. But I struggle to find the right syntax for doing so.
In the example code below you find two discrepancy analyses, the first tree diagramm of the discrepancy analysis uses the original dataset, the second uses aggregated data (that is only unique sequences weighted by their frequencies).
Unfortunately, the results do not match. Do you have any idea why?
Example code
library(TraMineR)
library(WeightedCluster)
## Load example data and assign labels
data(mvad)
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training")
mvad.labels <- c("Employment", "Further Education", "Higher Education",
"Joblessness", "School", "Training")
mvad.scodes <- c("EM", "FE", "HE", "JL", "SC", "TR")
## Aggregate example data
mvad.agg <- wcAggregateCases(mvad[, 17:86], weights=mvad$weight)
mvad.agg
## Define sequence object
mvad.seq <- seqdef(mvad[, 17:86], alphabet=mvad.alphabet, states=mvad.scodes,
labels=mvad.labels, weights=mvad$weight, xtstep=6)
mvad.agg.seq <- seqdef(mvad[mvad.agg$aggIndex, 17:86], alphabet=mvad.alphabet,
states=mvad.scodes, labels=mvad.labels,
weights=mvad.agg$aggWeights, xtstep=6)
## Computing OM dissimilarities
mvad.dist <- seqdist(mvad.seq, method="OM", indel=1.5, sm="CONSTANT")
mvad.agg.dist <- seqdist(mvad.agg.seq, method="OM", indel=1.5, sm="CONSTANT")
## Discrepancy analysis
tree <- seqtree(mvad.seq ~ gcse5eq + Grammar + funemp,
data=mvad, diss=mvad.dist, weight.permutation="diss")
seqtreedisplay(tree, type="d", border=NA)
tree.agg <- seqtree(mvad.agg.seq ~ gcse5eq + Grammar + funemp,
data=mvad[mvad.agg$aggIndex, ], diss=mvad.agg.dist,
weight.permutation="diss")
seqtreedisplay(tree.agg, type="d", border=NA)
This question is related to big data and the computation of sequence distances.
The procedure you are using for aggregated data is wrong, because you do not consider explanatory covariates when aggregating the data. Because of that each unique sequence is attributed to an almost random covariate profile, giving wrong results.
What you need to do is aggregating sequence and covariates. Here covariates "Grammar" "funemp" "gcse5eq" are located in columns 10 to 12. So
We then come to the next problem: permutation test. If you do nothing, you will permute only aggregates (and omit permutations inside aggregates) giving you wrong p-values. Two solutions can be used:
In all the cases, you may observe small differences of p-values (because you have a different procedure), and also because p-values are estimated using permutation tests. To get more precise p-value try to use an higher R value (number of permutations). In the tree procedure, the minimum p-value to make a split can be changed using the
pval
argument. You can try to set it just a little higher to see if the differences come from here.I hope it helps.