Partially related to my other question here.
In my case the 'original' aim was to choose n=50 objects out of N=292, such that the sum of all pairwise distances between the chosen objects is maximized (maxsum or p-dispersion sum).
Thanks to the users who provided advice, I did some further reading, and now I understand that the problem is indeed quadratic in its simplest form, and a solver like CPLEX may be able to solve it.
However, this article by Kuby points out that the maxsum results does not guarantee that there will be no objects very close to each other; and indeed, from some tests I did by brute force on simulated smaller cases, I found that solutions with very high maxsum sometimes contain very close objects.
So now I'm thinking that the p-dispersion (maxmin) approach could be more suited to what I want to achieve. This is also originally a quadratic problem.
As I don't have CPLEX yet, I can't try the quadratic formulation, so I looked at linearization approaches. These 2 articles seem quite interesting to me:
Franco, Uchoa
Sayah, 2015
The latter points to another article, which I find very interesting, too:
Pisinger, 2006
My next step was to try out the following:
- linearized p-dispersion according to Kuby/Erkut, with N binary variables for the objects and 1 continuous variable for the maximal min distance, bounded between the smallest and largest distance in the distance matrix
- brute force, enumerating all combinations of n objects out of N, and finding the one with the largest minimal distance
- like 1, but setting a tighter upper bound for the continuous variable using the method by Sayah/Pisinger
- linearized p-dispersion according to Sayah, with N binary variables for the objects, and up to N*(N-1)/2 additional binary variables for the pairwise distances
I did not try to tighten the lower bound or to add more inequalities, because the methods suggested in the articles are beyond my level of maths.
What puzzles me is that method 4, which is supposed to be 'compact', in fact has a huge number of binary variables and consequent constraints, and in the tests I ran it performed much worse than methods 1 and 2. Tightening the upper bound on the other hand had a huge effect, and in fact method 2 at the moment is the only one that seems to be able to tackle large-ish problems in a reasonable time.
But it's true that I did not implement exactly the method in Sayah's paper, so maybe my observations are not valid.
Questions: what do you think of the various linearization methods described in these articles? Can you suggest better ones? Do you think keeping the max min distance as a continuous variable like in Kuby's formulation is better than making it 'quantized' like in Sayah's formulation?
In fact further complications and developments came up in the meantime, e.g. the presence of 'forced' objects and the need to use scores for each object, but I'd like to address the above first.
I pasted below the R code I used for testing this.
Thanks!
#Test of linearized methods for the solution of p-dispersion (maxmin) problems
#-----------------------------------------------------------------------------
#Definitions
#Given N objects, whose distance matrix 'distmat' is available:
#p-dispersion (maxmin): select n (n >= 2, n < N) objects such that the minimal distance between any two objects is maximised
#p-dispersion sum (maxsum): select n (n >= 2, n < N) objects such that the sum of all the pairwise distances between them is maximised
#Literature
#Kuby, 1987: https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1538-4632.1987.tb00133.x
#Pisinger, 1999: https://pdfs.semanticscholar.org/1eb3/810077c0af9d46ed5ff2b0819d954c97dcae.pdf
#Pisinger, 2006: http://yalma.fime.uanl.mx/~roger/work/teaching/clase_tso/docs_project/problems/PDP/cor-2006-Pisinger.pdf
#Franco, Uchoa: https://pdfs.semanticscholar.org/4092/d2c98cdb46d5d625a580bac08fcddc4c1e60.pdf
#Sayah, 2015: https://download.uni-mainz.de/RePEc/pdf/Discussion_Paper_1517.pdf
#Initialization
require(Matrix)
if (length(find.package(package="Rsymphony",quiet=TRUE))==0) install.packages("Rsymphony")
require(Rsymphony)
par(mfrow = c(2,2))
#0. Choose N, n and which methods to run
N = 20
n = ceiling(0.17*N)
run_PD_Erkut = TRUE
run_PD_brute_force = TRUE
run_PD_Erkut_UB_Sayah = TRUE
run_PD_Sayah = TRUE
#1. Make random distance matrix for testing
set.seed(1)
coords <- cbind(runif(N,-5,5),runif(N,-5,5))
distmat <- t(as.matrix(dist(coords,diag=T)))
distmat[lower.tri(distmat)] <- 0
distmat <- Matrix(distmat,sparse=T)
N.i <- NROW(distmat)
colnames(distmat) <- paste("j",1:N.i,sep="_")
rownames(distmat) <- paste("i",1:N.i,sep="_")
#2. Make a 2D representation of the points using classic multidimensional scaling
cmds <- cmdscale(as.dist(t(distmat)))
#3. Link the pairwise distances to the rows and columns of the distmat
distmat_summary <- summary(distmat)
N.ij <- NROW(distmat_summary)
distmat_summary["ID"] <- 1:(N.ij)
i.mat <- xtabs(~ID+i,distmat_summary,sparse=T)
j.mat <- xtabs(~ID+j,distmat_summary,sparse=T)
ij.mat <- cbind(i.mat,0)+cbind(0,j.mat)
colnames(ij.mat)[[N.i]] <- as.character(N.i)
zij.mat <- .sparseDiagonal(n=N.ij,x=1)
#4. MaxMin task by Kuby/Erkut (N binary variables + 1 continuous variable for max Dmin)
if (run_PD_Erkut == TRUE) {
#4a. Building the constraint matrix (mat), direction (dir), right-hand-side (rhs) and objective (obj) for the LP task
dij <- distmat_summary$x
M <- max(dij)
m <- min(dij)
#Erkut's condition: for each i,j i<j, D (min distance to maximise) + M*xi + M*xj <= 2*M + dij
constr.dij <- cbind("D"=1,ij.mat*M)
dir.dij <- rep("<=",N.ij)
rhs.dij <- 2*M+dij
constr.D <- c(1,rep(0,N.i))
dir.DM <- "<="
rhs.DM <- M
dir.Dm <- ">="
rhs.Dm <- m
#constraining the total number of objects to be n
constr.n <- c(0,rep(1,N.i))
dir.n <- "=="
rhs.n <- n
#assembling the constraints
mat <- rbind(constr.n,constr.dij,constr.D,constr.D)
dir <- c(dir.n,dir.dij,dir.DM,dir.Dm)
rhs <- c(rhs.n,rhs.dij,rhs.DM,rhs.Dm)
#objective
obj <- setNames(c(1,rep(0,N.i)), c("D",colnames(ij.mat)))
#4.b. Solution
st <- system.time(LP.sol <- Rsymphony_solve_LP(obj,mat,dir,rhs,types=c("C",rep("B",N.i)),max=TRUE,verbosity = -2, time_limit = 5*60))
ij.sol <- names(obj[-1])[as.logical(LP.sol$solution[-1])]
items.sol <- rownames(distmat)[as.numeric(ij.sol)]
Dmin <- LP.sol$solution[1]
#4.c. Plotting the results
plot(cmds,main=paste(c("p-dispersion (Erkut), N =",N,", n =",n,"\nUB =",round(M,2),", time =",round(st[3],2),"s, Dmin =",round(Dmin,2)),collapse=" ") )
points(cmds[as.numeric(ij.sol),],pch=16,col="red")
text(cmds[as.numeric(ij.sol),],ij.sol,cex=0.9,col="red",adj=c(0,1))
}
#5. MaxMin task by brute force
if (run_PD_brute_force == TRUE) {
if (choose(N,n) <= 200000) {
st <- system.time({combs <- as.data.frame(t(combn(N,n)))
combs["maxmin"] <- apply(combs, 1, function(x) {min(distmat_summary[(distmat_summary$j %in% x) & (distmat_summary$i %in% x),"x"])})
combs["maxsum"] <- apply(combs, 1, function(x) {sum(distmat_summary[(distmat_summary$j %in% x) & (distmat_summary$i %in% x),"x"])})
combs_maxmin_max <- combs[combs$maxmin == max(combs$maxmin),][1,]})
ij.sol <- as.character(combs_maxmin_max[,1:n])
items.sol <- rownames(distmat)[as.numeric(ij.sol)]
Dmin <- combs_maxmin_max[1,"maxmin"]
plot(cmds,main=paste(c("p-dispersion (brute force), N =",N,", n =",n,"\ntime =",round(st[3],2),"s, Dmin =",round(Dmin,2)),collapse=" ") )
points(cmds[as.numeric(ij.sol),],pch=16,col="red")
text(cmds[as.numeric(ij.sol),],ij.sol,cex=0.9,col="red",adj=c(0,1))
}
}
#6. MaxMin task by Erkut with Sayah's upper bound
if (run_PD_Erkut_UB_Sayah == TRUE) {
#6a. Building the constraint matrix (mat), direction (dir), right-hand-side (rhs) and objective (obj) for the LP task
m <- min(distmat_summary$x)
M <- sort(sapply(1:(N.i), function(it) {min((sort(distmat_summary[(distmat_summary$i == it) | (distmat_summary$j == it),"x"],decreasing = TRUE)[1:(n-1)]))}),decreasing=TRUE)[n]
#Erkut's condition: for each i,j i<j, D (min distance to maximise) + M*xi + M*xj <= 2*M + dij
constr.dij <- cbind("D"=1,ij.mat*M)
dir.dij <- rep("<=",N.ij)
rhs.dij <- 2*M+dij
constr.D <- c(1,rep(0,N.i))
dir.DM <- "<="
rhs.DM <- M
dir.Dm <- ">="
rhs.Dm <- m
#constraining the total number of objects to be n
constr.n <- c(0,rep(1,N.i))
dir.n <- "=="
rhs.n <- n
#assembling the constraints
mat <- rbind(constr.n,constr.dij,constr.D,constr.D)
dir <- c(dir.n,dir.dij,dir.DM,dir.Dm)
rhs <- c(rhs.n,rhs.dij,rhs.DM,rhs.Dm)
#objective
obj <- setNames(c(1,rep(0,N.i)), c("D",colnames(ij.mat)))
#6.b. Solution
st <- system.time(LP.sol <- Rsymphony_solve_LP(obj,mat,dir,rhs,types=c("C",rep("B",N.i)),max=TRUE,verbosity = -2, time_limit = 5*60))
ij.sol <- names(obj[-1])[as.logical(LP.sol$solution[-1])]
items.sol <- rownames(distmat)[as.numeric(ij.sol)]
Dmin <- LP.sol$solution[1]
#6.c. Plotting the results
plot(cmds,main=paste(c("p-dispersion (Erkut, UB by Sayah), N =",N,", n =",n,"\nUB =",round(M,2),", time =",round(st[3],2),"s, Dmin =",round(Dmin,2)),collapse=" ") )
points(cmds[as.numeric(ij.sol),],pch=16,col="red")
text(cmds[as.numeric(ij.sol),],ij.sol,cex=0.9,col="red",adj=c(0,1))
}
#7. MaxMin task by Sayah (N binary variables + binary variables from unique values of dij)
if (run_PD_Sayah == TRUE) {
#7a. Building the constraint matrix (mat), direction (dir), right-hand-side (rhs) and objective (obj) for the LP task
#7a.1. Finding the upper (M) and lower (m) bound for the minimal distance
m <- min(distmat_summary$x)
M <- sort(sapply(1:(N.i), function(it) {min((sort(distmat_summary[(distmat_summary$i == it) | (distmat_summary$j == it),"x"],decreasing = TRUE)[1:(n-1)]))}),decreasing=TRUE)[n]
dijs <- unique(sort(distmat_summary$x))
dijs <- dijs[dijs <= M]
N.dijs <- length(dijs)
z.mat <- .sparseDiagonal(N.dijs,1)
#Sayah's formulation:
#applying z[k] <= z[k-1]
constr.z <- cbind(rep(0,N.i*(N.dijs-1)),cbind(0,z.mat[-1,-1])-z.mat[-NROW(z.mat),])
dir.z <- rep("<=",N.dijs-1)
rhs.z <- rep(0,N.dijs-1)
#applying x[i]+x[j]+z[k] <= 2
constr.ijk <- NULL
for (k in 2:N.dijs) {
IDs <- distmat_summary[distmat_summary$x < dijs[k],"ID"]
constr.ijk <- rbind(constr.ijk,cbind(ij.mat[IDs,,drop=F],z.mat[rep(k,length(IDs)),,drop=F]))
}
dir.ijk <- rep("<=",NROW(constr.ijk))
rhs.ijk <- rep(2,NROW(constr.ijk))
#constraining the total number of objects to be n
constr.n <- c(rep(1,N.i),rep(0,N.dijs))
dir.n <- "=="
rhs.n <- n
#assembling the constraints
mat <- rbind(constr.n,constr.z,constr.ijk)
dir <- c(dir.n,dir.z,dir.ijk)
rhs <- c(rhs.n,rhs.z,rhs.ijk)
#objective
obj <- setNames(c(rep(0,N.i),1,diff(dijs)), c(colnames(ij.mat),paste("z",1:N.dijs,sep="_")))
#7.b. Solution
st <- system.time(LP.sol <- Rsymphony_solve_LP(obj,mat,dir,rhs,types="B",max=TRUE,verbosity = -2, time_limit = 5*60))
ij.sol <- names(obj[1:N.i])[as.logical(LP.sol$solution[1:N.i])]
items.sol <- rownames(distmat)[as.numeric(ij.sol)]
Dmin <- sum(LP.sol$solution[(1+N.i):(N.dijs+N.i)]*obj[(1+N.i):(N.dijs+N.i)])
#7.c. Plotting the results
plot(cmds,main=paste(c("p-dispersion (Sayah), N =",N,", n =",n,"\nUB =",round(M,2),", time =",round(st[3],2),"s, Dmin =",round(Dmin,2)),collapse=" ") )
points(cmds[as.numeric(ij.sol),],pch=16,col="red")
text(cmds[as.numeric(ij.sol),],ij.sol,cex=0.9,col="red",adj=c(0,1))
}
You don't mention if you can tolerate non-optimal solutions. But you should be able to because you cannot expect to be able to generally find optimal solutions to this problem. In this case, there is a factor-2 approximation.
This approximation algorithm is guaranteed to find a solution with a value no more than twice the optimal value and, unless P=NP, no polynomial-time heuristic can provide a better performance guarantee.
The optimality bound is proved in White (1991) and Ravi et al. (1994). The latter proves the heuristic is the best possible.
For reference, I ran the full MIP for p=50, n=400. After 6000s, the optimality gap was still 568%. The approximation algorithm took 0.47s to obtain an optimality gap of 100% (or less).
A Python (sorry, I don't model in R) representation of the approximation algorithm is as follows:
Whereas a Gurobi Python representation might look like this: