I believe that the question Is there a good way to do this type of mining? could be solved using linear programming techniques. But I am completely new to this and do not know the best way to frame this as a minimization.
Would the following approach be OK?
- Have a continuous variable for each row and column which is the "length" spanned by all members in that row/column
- Have a variable for each "point" (each black dot) that indicates whether it is a member of the row or column group
- Minimize the sum of the first variables
And is there a better way of doing this? Is it possible to somehow frame this as a pure constraint problem (ie without the minimisation)? Do I have my terminology correct? Thanks!
I finally worked out how to represent this question in a linear form. There's a full description in my answer at Is there a good way to do this type of mining? but here is a quick summary:
Use binary (0/1) variables for each neighbouring pair on a row,
F_i
. This will be 1 when the pair are in the same group and 0 otherwise.Use constants
S_i
to describe the number of spaces between each pair of points.Minimise the sum of two terms:
The sum of
1 - F-i
. Minimising this pushes pairs together into larger groups.The sum of
F_i * S_i
. Minimising this separates paris with large spacings.By altering the relative weighting of the two terms you can change the importance of spacing between horizontal groups.
This relies on an asymmetry in the problem, in which horizontal groups are sensitive to spacing but vertical groups are not.
Yes, you could definitely use linear programming for this, but it is hard and I think you have to define your problem more precisely. I have too many questions for a comment, I hope you don't mind I write this as an answer...
Your points can be either in the "column group" or in the "row group". From your proposition above, I understand that you know the number of column groups and row groups in advance?
So you know your groups composition, you just want to find a repartition of the points in those groups in order to minimise the sum of the costs, determined by:
c(H) = max (i,j in H) |yi - yj|
)c(V) = max (i,j in V) |xi - xj|
)With
H
an horizontal cluster,V
a vertical cluster, and the total cost will be:with
n
(number of horizontal clusters) andp
(number of vertical clusters) known in advance. Is this correct?For the horizontal groups, you say you can't have "holes". I would represent this as a constraint of your problem, if you can quantify the size of the holes. For instance:
will insure that you don't have a gap of more than r in the horizontal cluster C. Is this what you want? Is
r
a fixed number?Is this the complete problem, or do you have other constraints (minimal number of points per group, or something)?
Do you need an exact minimal solution, or a "good" solution would be enough?
Finally, for the technical part, since your previous post was tagged 'python' and this one is not, do you have to use python to solve the model?