The distance between a long sequence and a short sequence, is the minimum distance between the short sequence and any subsequence of the long sequence that is the same length as the short sequence.
The distance I am using is I think the Manhattan distance. (But this should be unimportant as I would like to be able to change distance functions).
This first version shows a naive implementation without early abandon. I generate all subsequence of the same length, map these to find the distance between them and the short sequence and then use aggregate/3 to find the min.
abs(X,Y,Z):-
Z is abs(X-Y).
seq_seq_absdis(Seq1,Seq2,Dis):-
same_length(Seq1,Seq2),
maplist(abs,Seq1,Seq2,Dislist),
sumlist(Dislist,Dis).
seq_subseq(List1,List2):-
append(List2,_,List1),
dif(List2,[]).
seq_subseq([_|T],Subseq):-
seq_subseq(T,Subseq).
smallseq_largeseq_dis(Sseq,Lseq,Dis):-
findall(Subseq, (same_length(Subseq,Sseq),seq_subseq(Lseq,Subseq)),Subseqs),
maplist(seq_seq_absdis(Sseq),Subseqs,Distances),
aggregate(min(D),member(D,Distances),Dis).
Example query:
?-smallseq_largeseq_dis([1,2,4],[1,2,3,1,2,5],Dis).
Dis = 1
This next version should be more efficient, as it will abandon calculating the distance between a subsequence of the long sequence and the short sequence once the distance is over the minimum already found.
ea_smallseq_largeseq_dis(Sseq,Lseq,Subseq,Dis):-
retractall(best(_,_)),
assert(best(initial,10000)),
findall(Subseq-Dis, ea_smallseq_largeseq_dis_h(Sseq,Lseq,10000,Subseq,Dis),Pairs),
append(_,[Subseq-Dis|[]],Pairs).
ea_smallseq_largeseq_dis_h(Sseq,Lseq,BestSofar1,Subseq,Dis):-
same_length(Sseq,Subseq),
seq_subseq(Lseq,Subseq),
best(_,BestSofar2),
( ( BestSofar2 < BestSofar1) ->
accumulate_dis(Sseq,Subseq,BestSofar2,Dis),
retractall(best(_,_)),
assert(best(Subseq,Dis))
;(
accumulate_dis(Sseq,Subseq,BestSofar1,Dis),
retractall(best(_,_)),
assert(best(Subseq,Dis))
)
).
accumulate_dis(Seq1,Seq2,Best,Dis):-
accumulate_dis(Seq1,Seq2,Best,Dis,0).
accumulate_dis([],[],_Best,Dis,Dis).
accumulate_dis(Seq1,Seq2,Best,Dis,Ac):-
Seq1=[H1|T1],
Seq2=[H2|T2],
abs(H1,H2,Dis1),
Ac1 is Dis1 + Ac,
Ac1 <Best,
accumulate_dis(T1,T2,Best,Dis,Ac1).
Query:
?-ea_smallseq_largeseq_dis([1,2,3],[1,2,4,5,6,7,8,1,2,3],Subseq,Dis).
Dis = 0,
Subseq = [1, 2, 3]
But in this I have used assert and retract so I want to have a version which does the same algorithm but with out these. I think I should be able to do this with a dcg with semicontext notation but find it hard to grasp... how do I keep track of the subsequences I am generating by backtracking and at the same time the 'state' of the minimum distance found so far?
The problem I have.. seq_subseq/2 is generating the sub-sequences by back tracking. The first subseq tested needs to be set to the min distance. I then want to loop, so back track to generate another sequence. But to back track I have to fail. But then I cant bring back the min distance so far to check on the next sequence.
If I don't want to use backtracking, I think I need to define a state transition predicate for generating the sub-sequences in order.
At the moment
? seq_subseq([1,2,3,4],X).
X = [1]
X = [1, 2]
X = [1, 2, 3]
X = [1, 2, 3, 4]
X = [2]
X = [2, 3]
X = [2, 3, 4]
X = [3]
X = [3, 4]
X = [4]
So I think I need to define a relation:
subseq0_seq_subseq1(Subseq0,Seq,Subseq1)
That would work like:
?-subseq0_seq_subseq1([1,2,3,4],[1,2,3,4],Subseq1).
Subseq1 = [2].
and
?-subseq0_seq_subseq1([1,2,3],[1,2,3,4],Subseq1).
Subseq1 = [1,2,3,4].
But I need to do this in an efficient way.
Update- Thanks to the answer from Mat I now have this, which is a great improvement I think. Can anyone see any further improvements to this? I have a double nested -> structure and a ! in the accumulate_dis/4 definition both of which seem ugly. I have also made it return the sub-sequence of the long-sequence which is the shortest distance away from the short sequence.
It needs to work with non integers so clpfd is not appropriate I think.
abs(X,Y,Z):-
Z is abs(X-Y).
list_subseq_min(Ls, Subs, Min,BestSeq1) :-
prefix_dist(Ls, Subs, 1000, Front, D0),
BestSeq0=Front,
min_sublist(Ls, Subs,BestSeq0,BestSeq1, D0, Min).
prefix_dist(Ls, Ps, Best,Front,D) :-
same_length(Front, Ps),
append(Front, _, Ls),
accumulate_dis(Front, Ps, Best, D).
min_sublist(Ls0, Subs, BestSeq0,BestSeq2, D0, D) :-
( prefix_dist(Ls0, Subs, D0, Front,D1) ->
min_list([D0,D1],D2),
Ls0 = [_|Ls],
( D0 < D1 ->
BestSeq1 =BestSeq0
;
BestSeq1 =Front
),
min_sublist(Ls, Subs, BestSeq1,BestSeq2, D2, D)
; D = D0,BestSeq0 =BestSeq2
).
accumulate_dis(Seq1,Seq2,Best,Dis):-
accumulate_dis(Seq1,Seq2,Best,Dis,0),!.
accumulate_dis([],[],_Best,Dis,Dis).
accumulate_dis(Seq1,Seq2,Best,Dis,Ac):-
Seq1=[H1|T1],
Seq2=[H2|T2],
abs(H1,H2,Dis1),
Ac1 is Dis1 + Ac,
Ac1 <Best,
accumulate_dis(T1,T2,Best,Dis,Ac1).
accumulate_dis(Seq1,Seq2,Best,Dis):-Dis is Best+1.
query:
?- list_subseq_min([2.1,3.4,4,1.1,2,4,10,12,15],[1,2,3],D,B).
D = 1.1,
B = [1.1, 2, 4].
One important note: You should have clearified that you are talking about the Manhatten distance between lists. This was only clear from your code, whereas your wording can easily lead readers to assume you are talking about the edit distance.
Here is a solution that simply walks the list, keeps track of the minimum, and eventually yields the found minimum.
Example query and its result:
It's quite straight-forward, and since the bookkeeping is limited to only one predicate, using semicontext notations does not really pay off. It is especially useful to use semicontext notation—and DCGs in general—when what is being described spans different rules and communication between them would otherwise be harder.
The running time is in O(N×M).
And now the point, which I leave as an exercise: Modify this solution to prune earlier, if the previously found minimum is already exceeded. Do so in a pure way, or at least as pure as possible:
assertz/1
and friends are definitely out of the question, since they prevent your testing these predicates in isolation. Pass around arguments and build the distance more incrementally! This may help you to improve the average running time, though of course not the worst case complexity.It is for this passing around states between different clauses that semicontext notation may become useful at last.
EDIT: Very well, you have implemented a solution that does the pruning. I will now also show mine. I will reuse the auxiliary predicates
absdiff/3
andlists_dist/3
from above, and the following additional auxiliary predicate:list_subseq_min/3
is now slightly different:And now the point:
min_sublist//2
is a DCG nonterminal that concisely describes the main idea of the algorithm:From this description, it is very clear that we are considering the list element by element. It uses fewer (explicit) arguments than previously. The additional two arguments are implicitly passed around as a pair
D-Front
, which keeps track of the best distance and subsequence found so far. Note how DCG notation exposes the core of the computation, and hides what is not relevant at this place.The rest is quite self-explanatory, and is analogous to the pruning you have implemented. I highlight the single use of semicontext notation in this program, which lets us express any change of the optimal sequence found so far.
I can't bring myself to accept the nastiness and primitiveness of contemporary floating-point numbers, so I have retained the integer arithmetic and simply multiply all numbers you show so that they become integers:
I leave extending this so that it also shows the found subsequence as an easy exercise.
One important note: The capping only affects the calculation of the distance; note in particular that the running time is Θ(N×M) due to the way
same_length_prefix/3
is used infront//2
! I leave improving this as a slightly harder exercise.