how to count the number of distinct items in a lin

I am studying some linear programming problems with all-binary variables, where it is necessary to count (and then either constrain or maximise/minimise) the number of distinct items in the solution.

This is the post I could find that seemed closest to it:

https://stats.stackexchange.com/questions/136608/constrained-assignment-problem-linear-programming-genetic-algorithm-etc

The 'items' being counted in this case are the supply centers used. I am trying to understand if the approach suggested in the above post is correct for my purposes.

In the answer by user 'TAS', the example is 3 shops by 2 supply centers, and the idea is (A) to assign one (and only one) supply center to each shop, so that: (B) the distance travelled is minimal, (C) no supply center must supply more than a given maximal number of shops (in this case 3, i.e. no limit), and (D) the max total number of supply centers used is limited (in this case to 2).

I tried to reconstruct how the problem was set up, starting from a dataset like the one I would have in my case.

df <- cbind(expand.grid(shop=1:3,supply=1:2),distance=c(2.8,5.4,1.4,4.2,3.0,6.3))

df["Entry"] <- 1:dim(df)[[1]]

shop.mat <- table(df$shop,df$Entry)
shop.mat

    1 2 3 4 5 6
  1 1 0 0 1 0 0
  2 0 1 0 0 1 0
  3 0 0 1 0 0 1

supply.mat <- table(df$supply,df$Entry)
supply.mat

    1 2 3 4 5 6
  1 1 1 1 0 0 0
  2 0 0 0 1 1 1

N_supply <- dim(supply.mat)[[1]]
N_shop <- dim(shop.mat)[[1]]
N_entry <- dim(df)[[1]]

The solution vector will have length N_entry + N_supply, and each row of the constraint matrix will need to have the same length.

constr.mat <- NULL
dir <- NULL
rhs <- NULL

(A) is addressed by constraining each line in the shop.mat to be == 1:

constr.mat <- rbind(constr.mat,cbind(shop.mat,matrix(0,N_shop,N_supply)))
dir <- c(dir,rep("==",N_shop))
rhs <- c(rhs,rep(1,N_shop))

(B) is addressed by setting the objective vector to the distance for each Entry, and 0 for each shop (because there is no cost in adding one more supply center, although in reality there might be):

obj <- c(aggregate(distance~Entry,df,c)[["distance"]],rep(0,N_supply))

constr.mat <- rbind(constr.mat,cbind(supply.mat,-diag(table(df$supply))))
dir <- c(dir,rep("<=",N_supply))
rhs <- c(rhs,rep(0,N_supply))

(D) is addressed by adding a constraint <= 2:

constr.mat <- rbind(constr.mat,c(rep(0,N_entry),rep(1,N_supply)))
dir <- c(dir,"<=")
rhs <- c(rhs,2)

The problem can then be solved using lpSolve:

require(lpSolve)
sol <- lp("min", obj, constr.mat, dir, rhs, all.bin = TRUE,num.bin.solns = 1, use.rw=FALSE, transpose.constr=TRUE)

sol$solution
[1] 1 0 1 0 1 0 1 1

sol$objval
[1] 7.2

selected_Entry <- dimnames(shop.mat)[[2]][as.logical(sol$solution[1:N_entry])]
selected_Entry
[1] "1" "3" "5"

df[df$Entry %in% selected_Entry,]
  shop supply distance Entry
1    1      1      2.8     1
3    3      1      1.4     3
5    2      2      3.0     5

I can see that in this specific case the solution vector is forced (by constraints (C)) to have '1' in any of the 'supply' variables for which at least one corresponding Entry is selected. If this were not the case, the row sums for constraints (C) would be > 0.

But: suppose the distances were different and only supply center 1 were chosen for all 3 shops. What would stop the solution vector variable for supply center 2 from being set to '1'?

The current solution gives:

constr.mat %*% sol$solution
  [,1]
1    1
2    1
3    1
1   -1
2   -2
     2

But this alternative solution would still meet all the constraints:

constr.mat %*% c(1,1,1,0,0,0,1,1)
  [,1]
1    1
2    1
3    1
1    0
2   -3
     2

despite the fact that supplier center 2 was not used.

In this case this would not affect the solution, because there is no cost associated to including the supply centers (the corresponding elements of the objective vector are 0).

But if I wanted to get from the solution the count of distinct supply centers used, I think this would not work.

A few years ago I asked for advice on this problem on another forum, and someone immediately gave me the solution, however saying that he/she 'was not sure it was the most efficient'.

It was the following: all the same as above, and then for each of the supply centers, add to the constr.mat twice the supply.mat augmented by the negated diagonal matrix of the number of entries per supply center, constraining the first N_supply added rows to be <= 0, and the last N_supply rows to be >= 1 - the diagonal of the above mentioned diagonal matrix.

constr.mat <- rbind(constr.mat,cbind(supply.mat,-diag(table(df$supply))),cbind(supply.mat,-diag(table(df$supply))))
dir <- c(dir,rep("<=",N_supply),rep(">=",N_supply))
rhs <- c(rhs,rep(0,N_supply),1-table(df$supply))

The addition of these constraints makes sure that the 'supply' variables in the solution vector are 1 if and only if the corresponding supply center has been used, and 0 if and only if it hasn't been used.

For instance, the original solution would still work:

paste(t(unlist(constr.mat %*% sol$solution)),dir,rhs)
 [1] "1 == 1"   "1 == 1"   "1 == 1"   "-1 <= 0" 
 [5] "-2 <= 0"  "2 <= 2"   "-1 <= 0"  "-2 <= 0" 
 [9] "-1 >= -2" "-2 >= -2"

[BTW I would not know how to turn this into an evaluated logical vector; any idea?]

whereas the other solution, which erroneously set the variable for supply center 2 to 1 although this supply center wasn't used, would instead not be valid:

paste(t(unlist(constr.mat %*% c(1,1,1,0,0,0,1,1))),dir,rhs)
 [1] "1 == 1"   "1 == 1"   "1 == 1"   "0 <= 0"  
 [5] "-3 <= 0"  "2 <= 2"   "0 <= 0"   "-3 <= 0" 
 [9] "0 >= -2"  "-3 >= -2"

(the last constraint would not be met).

Q1 Do you think the above makes sense, i.e. is it true that we need the additional constraint rows I mentioned to make sure that the 'supply' variables in the solution vector are appropriately set, or am I wrong?

Q2 Can you think of a more efficient way to count occurrences of distinct items in such problems (the example here is small, but I am often dealing with VERY large ones, where adding so many more constraints is not really helping, despite all the presolve in the world)?

Thanks!

Note: this question was originally posted in another community. I deleted it from there.

EDIT after consulting the Wikipedia page on 'Uncapacited Facility Location Problem', mentioned in the original post I linked above.

In fact there is a cost associated to opening a new supply center, so the objective vector should not have 0's at the end, but some cost ($f_i$ in the Wikipedia formulation).
Only then the issue of $\sum_iy_i$ not always being the number of open supply centers disappears, because the $\sum_jx_{i,j} \le m \cdot y_i$ contraints will still ensure that whenever a given center is used, the corresponding $y_i$ is 1; and there will be no need for the other condition I imposed, because there is now a cost associated with setting to 1 each $y_i$, therefore only the strictly necessary $y_i$'s will be set to 1.

So in short, if the objective vector is properly constructed, with costs for each supply center, I can do without several constraints.
In fact, depending on the value for the supply center opening cost, the constraint on the max total number of centers may even be superseded.
Then it would be interesting to evaluate the suggestion made in the Wikipedia discussion, namely to split the 'big M' constraints into several smaller ones. If it's true that it makes the problem easier to solve computationally, why not...

EDIT: The algorithm is fixed.

I had a similar issue that I wanted to minimize the number of distinct values in a solution. The following is how I came up with the answer to mathematically calculate the number of distinct items.

Suppose we have the following set:

11 12 13 11 11

We can see there are 3 distinct numbers in there (11, 12, and 13). Following is the way to compute it. Write the numbers in triangle matrix like this:

11 12 13 11 11    row=0
   12 13 11 11    row=1
      13 11 11    row=2
         11 11    row=3

if I get the difference of 11 and 12 and assign a binary variable to

1 if |a1 - a2| != 0

0 if |a1 - a2| == 0

then I have the following:

1 1 0 0   --> 0
  1 1 1   --> 1
    1 1   --> 1
      0   --> 0

1 + 1 + (extra 1) = 3

if a number is distinct then its row should be all 1s. So for the above case we have 2 rows of full 1s, meaning we have 2 numbers that are distinct from the first number. So, in total we have 3.

Now to translate into Linear Programming:

Assume:

Variables = a(1), a(2), a(3), a(4), ..., a(n)
Binary Variables b(i,j) where i,j in [0...n]
Binary Variable c(i) where i in [0...n]

The Linear Program would be:

obj = 1
for i in range(0, n):
  for j in range(i+1, n):
    # This is |a(i) - a(j)| part
    addConstr( b(i,j) * BigM >= a(i) - a(j)) 
    addConstr( b(i,j) * BigM >= -(a(i) - a(j)))
  # So here c(i) will be 0 if there is any 0 in the row otherwise it is 1.
  addConstr(c(i) * BigM >= sum(b(i,j) for all j) - (n-i))
  obj = obj + c(i)
Minimize(Sum(obj))