As I have a big dataset and only limited computational ressources, I want to make use of aggregated sequence objects for a discrepancy analysis using the R packages TraMineR
and WeightedCluster
. But I struggle to find the right syntax for doing so.
In the example code below you find two discrepancy analyses, the first tree diagramm of the discrepancy analysis uses the original dataset, the second uses aggregated data (that is only unique sequences weighted by their frequencies).
Unfortunately, the results do not match. Do you have any idea why?
Example code
library(TraMineR)
library(WeightedCluster)
## Load example data and assign labels
data(mvad)
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training")
mvad.labels <- c("Employment", "Further Education", "Higher Education",
"Joblessness", "School", "Training")
mvad.scodes <- c("EM", "FE", "HE", "JL", "SC", "TR")
## Aggregate example data
mvad.agg <- wcAggregateCases(mvad[, 17:86], weights=mvad$weight)
mvad.agg
## Define sequence object
mvad.seq <- seqdef(mvad[, 17:86], alphabet=mvad.alphabet, states=mvad.scodes,
labels=mvad.labels, weights=mvad$weight, xtstep=6)
mvad.agg.seq <- seqdef(mvad[mvad.agg$aggIndex, 17:86], alphabet=mvad.alphabet,
states=mvad.scodes, labels=mvad.labels,
weights=mvad.agg$aggWeights, xtstep=6)
## Computing OM dissimilarities
mvad.dist <- seqdist(mvad.seq, method="OM", indel=1.5, sm="CONSTANT")
mvad.agg.dist <- seqdist(mvad.agg.seq, method="OM", indel=1.5, sm="CONSTANT")
## Discrepancy analysis
tree <- seqtree(mvad.seq ~ gcse5eq + Grammar + funemp,
data=mvad, diss=mvad.dist, weight.permutation="diss")
seqtreedisplay(tree, type="d", border=NA)
tree.agg <- seqtree(mvad.agg.seq ~ gcse5eq + Grammar + funemp,
data=mvad[mvad.agg$aggIndex, ], diss=mvad.agg.dist,
weight.permutation="diss")
seqtreedisplay(tree.agg, type="d", border=NA)
This question is related to big data and the computation of sequence distances.