I have two matrices of normalized read counts for control and treatment in a time series day1 to day26. I want to calculate distance matrix by Dynamic Time Wrapping afterward use that for clustering but seems too complicated. I did so; who can help for more clarification please? Thanks a lot
> head(control[,1:4])
MAST2 WWC2 PHYHIPL R3HDM2
Control_D1 6.591024 5.695156 3.388652 5.756384
Control_D1 8.043454 5.365221 6.859768 6.936970
Control_D3 7.731590 4.868267 6.919972 6.931073
Control_D4 8.129948 5.105528 6.627016 7.090268
Control_D5 7.690863 4.729501 6.824746 6.904610
Control_D6 8.101723 5.334501 6.868990 7.115883
>
> head(lead[,1:4])
MAST2 WWC2 PHYHIPL R3HDM2
Lead30_D1 6.418423 5.610699 3.734425 5.778046
Lead30_D2 7.918360 4.295191 6.559294 6.780952
Lead30_D3 7.807142 4.294722 6.599187 6.716040
Lead30_D4 7.856720 4.432136 6.572337 6.848483
Lead30_D5 7.827311 4.204738 6.607107 6.784094
Lead30_D6 7.848760 4.458451 6.581216 6.943003
>
> dim(control)
[1] 26 2603
> dim(lead)
[1] 26 2603
library(dtw)
for (i in control) {
for (j in lead) {
result[i,j] <- dtw( dist(control[,,i],lead[,,j]), distance.only=T )$normalizedDistance
}
}
Says that
Error in lead[, , j] : incorrect number of dimensions
There have already been questions similar to yours,
but the answers haven't been too detailed.
Here's a breakdown of what you need to know,
in the specific case of R.
Calculating cross-distance matrices
The proxy
package is made specifically for the calculation of cross-distance matrices.
You should check its vignette to know which measures are already implemented by it.
An example of its use:
set.seed(1L)
sample_data <- matrix(rnorm(50L), nrow = 5L, ncol = 10L)
suppressPackageStartupMessages(library(proxy))
distance_matrix <- proxy::dist(sample_data, method = "euclidean",
upper = TRUE, diag = TRUE)
print(distance_matrix)
#> 1 2 3 4 5
#> 1 0.000000 2.636027 3.834764 5.943374 3.704322
#> 2 2.636027 0.000000 2.587398 4.515470 2.310364
#> 3 3.834764 2.587398 0.000000 4.008678 3.899561
#> 4 5.943374 4.515470 4.008678 0.000000 5.059321
#> 5 3.704322 2.310364 3.899561 5.059321 0.000000
Note: in the context of time series,
proxy
treats each row in a matrix as a series,
which can be confirmed by the fact that sample_data
above is a 5x10
matrix and the resulting cross-distance matrix is 5x5
.
Using the DTW distance
The dtw
package implements many variations of DTW,
and it also leverages proxy
.
You could calculate a DTW distance matrix with:
suppressPackageStartupMessages(library(dtw))
dtw_distmat <- proxy::dist(sample_data, method = "dtw",
upper = TRUE, diag = TRUE)
print(distance_matrix)
#> 1 2 3 4 5
#> 1 0.000000 2.636027 3.834764 5.943374 3.704322
#> 2 2.636027 0.000000 2.587398 4.515470 2.310364
#> 3 3.834764 2.587398 0.000000 4.008678 3.899561
#> 4 5.943374 4.515470 4.008678 0.000000 5.059321
#> 5 3.704322 2.310364 3.899561 5.059321 0.000000
Using custom distances
One nice thing about proxy
is that it gives you the option to register custom functions.
You seem to be interested in the normalized version of DTW,
so you could do something like this:
ndtw <- function(x, y = NULL, ...) {
dtw::dtw(x, y, ..., distance.only = TRUE)$normalizedDistance
}
pr_DB$set_entry(
FUN = ndtw,
names = "ndtw",
loop = TRUE,
distance = TRUE
)
ndtw_distmat <- proxy::dist(sample_data, method = "ndtw",
upper = TRUE, diag = TRUE)
print(ndtw_distmat)
#> 1 2 3 4 5
#> 1 0.0000000 0.4046622 0.5075772 0.6789465 0.5290478
#> 2 0.4046622 0.0000000 0.3630849 0.4866252 0.3612722
#> 3 0.5075772 0.3630849 0.0000000 0.5678698 0.3303344
#> 4 0.6789465 0.4866252 0.5678698 0.0000000 0.5078112
#> 5 0.5290478 0.3612722 0.3303344 0.5078112 0.0000000
See the documentation of pr_DB
for more information.
Other DTW implementations
The dtwclust
package
(which I made)
implements a basic but faster version of DTW which can use multi-threading and also leverages proxy
:
suppressPackageStartupMessages(library(dtwclust))
dtw_basic_distmat <- proxy::dist(sample_data, method = "dtw_basic", normalize = TRUE)
print(dtw_basic_distmat)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.0000000 0.4046622 0.5075772 0.6789465 0.5290478
#> [2,] 0.4046622 0.0000000 0.3630849 0.4866252 0.3612722
#> [3,] 0.5075772 0.3630849 0.0000000 0.5678698 0.3303344
#> [4,] 0.6789465 0.4866252 0.5678698 0.0000000 0.5078112
#> [5,] 0.5290478 0.3612722 0.3303344 0.5078112 0.0000000
The dtw_basic
implementation only supports two step patterns and one window type,
but it is considerably faster:
suppressPackageStartupMessages(library(microbenchmark))
microbenchmark(
proxy::dist(sample_data, method = "dtw", window.type = "sakoechiba", window.size = 5L),
proxy::dist(sample_data, method = "dtw_basic", window.size = 5L)
)
Unit: microseconds
expr min lq mean
proxy::dist(sample_data, method = "dtw", window.type = "sakoechiba", window.size = 5L) 5279.124 5621.742 6070.069
proxy::dist(sample_data, method = "dtw_basic", window.size = 5L) 657.966 710.418 776.474
median uq max neval cld
5802.354 6348.199 10411.000 100 b
752.282 814.037 1161.626 100 a
Another multi-threaded implementation is included in the parallelDist
package,
although I haven't personally tested it.
Multivariate or multi-dimensional time series
A single multivariate series is commonly a matrix where time spans the rows and the multiple variables span the columns.
DTW also works for them:
mv_series1 <- matrix(rnorm(15L), nrow = 5L, ncol = 3L)
mv_series2 <- matrix(rnorm(15L), nrow = 5L, ncol = 3L)
print(dtw_distance <- dtw_basic(mv_series1, mv_series2))
#> [1] 22.80421
The nice thing about proxy
is that it can calculate distances between objects contained in lists too,
so you can put several multivariate series in lists of matrices:
mv_series <- lapply(1L:5L, function(dummy) {
matrix(rnorm(15L), nrow = 5L, ncol = 3L)
})
mv_distmat_dtwclust <- proxy::dist(mv_series, method = "dtw_basic")
print(mv_distmat_dtwclust)
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] 0.00000 27.43599 32.14207 36.42211 31.19279
#> [2,] 27.43599 0.00000 20.88470 23.88436 29.73219
#> [3,] 32.14207 20.88470 0.00000 22.14376 29.99899
#> [4,] 36.42211 23.88436 22.14376 0.00000 28.81111
#> [5,] 31.19279 29.73219 29.99899 28.81111 0.00000
Your case
Regardless of what you choose,
you can probably use proxy
to get your result,
but since you haven't provided your whole data,
I can't give you a more specific example.
I presume that dtwclust::dtw_basic(control[, 1:4], lead[, 1:4], normalize = TRUE)
would give you the distance between one pair of series,
assuming you're treating each one as a multivariate series with 4 variables.
If your question is "why am I getting this error?" the answer is that you're trying to subset a matrix, which is a two dimensional array, according to a 3rd dimension.
see:
dim(lead)
# [1] 26 2603
lead[,,6.418423] # yes, that's the value j has the first time through the loop
# This will reproduce your error
lead[,,1]
# This will also reproduce your error
Hopefully you can see now that you have a few problems:
- You're trying to subset a matrix according to a 3rd dimension
- Your
i
and j
values are the values in control
and lead
respectively. You can use them as their values, or you can generate the index, e.g., for(i in seq_along(control)
if you're planning to use it for something other than getting that same value out.
- Taking it to the next step, it's unclear what you want to pass to the
dist
function. dist
takes a single matrix and computes the distance between its rows. You seem to be trying to pass it two values from two different matrices, or perhaps two subsets of two different matrices. It looks like you might need to go back and look at the examples in the documentation for xtr