I'm reading this tutorial on context free grammars in Prolog, and they mention at the bottom of the page implementing a context free grammar in Prolog using difference lists, with the following code block included:
s(X,Z):- np(X,Y), vp(Y,Z).
np(X,Z):- det(X,Y), n(Y,Z).
vp(X,Z):- v(X,Y), np(Y,Z).
vp(X,Z):- v(X,Z).
det([the|W],W).
det([a|W],W).
n([woman|W],W).
n([man|W],W).
v([shoots|W],W).
It mentions:
Consider these rules carefully. For example, the s rule says: I know that the pair of lists X and Z represents a sentence if (1) I can consume X and leave behind a Y , and the pair X and Y represents a noun phrase, and (2) I can then go on to consume Y leaving Z behind , and the pair Y Z represents a verb phrase . The np rule and the second of the vp rules work similarly.
And
Think of the first list as what needs to be consumed (or if you prefer: the input list ), and the second list as what we should leave behind (or: the output list ). Viewed from this (rather procedural) perspective the difference list
[a,woman,shoots,a,man] [].
represents the sentence a woman shoots a man because it says: If I consume all the symbols on the left, and leave behind the symbols on the right, then I have the sentence I am interested in. That is, the sentence we are interested in is the difference between the contents of these two lists.
That’s all we need to know about difference lists to rewrite our recogniser. If we simply bear in mind the idea of consuming something, and leaving something behind in mind, we obtain the following recogniser:
As an explanation, but that just isn't doing anything to clarify this code to me. I understand it's more efficient than using append/3
, but as for the notion of consuming things and leaving others behind, it seems so much more confusing than the previous append/3
implementation, and I just can't make heads or tails of it. Furthermore when I copy and paste that code into a Prolog visualizer to try to understand it, the visualizer says there are errors. Could anyone shed some light on this?
Listing as facts
Lets try to explain this with a counter example. Let's specify the nouns, verbs, etc. with simple facts:
det(the).
det(a).
n(woman).
n(man).
v(shoots).
Now we can implement a noun phrase np
as:
np([X,Y]) :-
det(X),
n(Y).
In other words we say "a noun phrase is a sentence with two words the first being a det
, the second a n
". And this will work: if we query np([a,woman])
, it will succeed etc.
But now we need to do something more advance, define the verb phrase. There are two possible verb phrases: the one with a verb and a noun phrase that was originally defined as:
vp(X,Z):- v(X,Y),np(Y,Z).
We could define it as:
vp([X|Y]) :-
v(X),
np(Y).
And the one with only a single verb:
vp(X,Z):- v(X,Z).
That can be converted into:
vp([X]) :-
v(X).
The guessing problem
The problem however is that both variants have a different number of words: there are verb phrases with one word and with three words. That's not really a problem, but now say - I know this is not correct English - there exists a sentence defined as vp
followed by np
, so this would be:
s(X,Z):- vp(X,Y), np(Y,Z).
in the original grammar.
The problem is that if we want to transform this into our new way of representing it, we need to know how much vp
will consume (how much words will be eaten by vp
). We cannot know this in advance: since at this point we don't know much about the sentence, we cannot make a guess whether vp
will eat one or three words.
We might of course guess the number of words with:
s([X|Y]) :-
vp([X]),
np(Y).
s([X,Y,Z|T]) :-
vp([X,Y,Z]),
np(Z).
But I hope you can imagine that if you would define verb phrases with 1, 3, 5 and 7 words, things will get problematic. Another way to solve this, is leave this to Prolog:
s(S) :-
append(VP,NP,S),
vp(VP),
np(NP).
Now Prolog will guess how to subdivide the sentence into two parts first and then try to match every part. But the problem is that for a sentence with n words, there are n breakpoints.
So Prolog will for instance first split it as:
VP=[],NP=[shoots,the,man,the,woman]
(remember we swapped the order of the verb phrase and noun phrase). Evidently vp
will not be very happy if it gets an empty string. So that will be rejected easily. But next it says:
VP=[shoots],NP=[the,man,the,woman]
Now vp
is happy with only shoots
, but it will require some computational effort to realize that. np
is however not amused with this long part. So Prolog backtracks again:
VP=[shoots,the],NP=[man,the,woman]
now vp
will complain again that it has been given too much words. Finally Prolog will split it correctly with:
VP=[shoots,the,woman],NP=[the,woman]
The point is that it requires a large number of guesses. And for each of these guesses vp
and np
will require work as well. For a real complicated grammar, vp
and np
might split the sentence further resulting in a huge amount of try-and-error.
The true reason is the append/3
has no "semantical" clue how to split the sentence, so it tries all possibilities. One is however more interested in an approach where vp
could provide information about what share of the sentence it really wants.
Furthermore if you have to split the sentence into 3 parts, the number of ways to do this even increases to O(n^2) and so on. So guessing will not do the trick.
You might also try to generate a random verb phrase, and then hope the verb phrase matches:
s(S) :-
vp(VP),
append(VP,NP,S),
np(NP).
But in that case the number of guessed verb phrases will blow up exponentially. Of course you can give "hints" etc. to speed up the process, but it will still take some time.
The solution
What you want to do is to provide the part of the sentence to every predicate such that a predicate looks like:
predicate(Subsentence,Remaining)
Subsentence
is a list of words that start for that predicate. For instance for a noun phrase, it might look like [the,woman,shoots,the,man]
. Every predicate consumes the words it is interested in: the words up to a certain point. In this case, the noun-phrase is only interested in ['the','woman']
, because that is a noun-phrase. To do the remaining parsing, it returns the remaining part [shoots,the,woman]
, in the hope that some other predicate can consume the remainder of the sentence.
For our table of facts that is easy:
det([the|W],W).
det([a|W],W).
n([woman|W],W).
n([man|W],W).
v([shoots|W],W).
It thus means that if you query a setence: [the,woman,shoots,...]
, and you ask the det/2
whether that's a determiner, it will say: "yes, the
is a determiner, but the remaining part [woman,shoots,...]
" is not part of the determiner, please match it with something else.
This matching is done because a list is represented as a linked list. [the,woman,shoots,...]
, is actually represented as [the|[woman|[shoots|...]]]
(so it points to the next "sublist"). If you match:
[the|[woman|[shoots|...]]]
det([the|W] ,W)
It will unify [woman|[shoots|...]]
with W
and thus result in:
det([the|[woman|[shoots|...]],[woman|[shoots|...]]).
Thus returning the remaining list, it has thus consumed the the
part.
Higher level predicates
Now in case we define the noun phrase:
np(X,Z):- det(X,Y), n(Y,Z).
And we again call with [the,woman,shoots,...]
, it will query unifying X
with that list. It will first call det
that will consume the
, without any need for backtracking. Next Y
is equal to [woman,shoots,...]
, now the n/2
will consume woman and return [shoots,...]
. This is also the result the np
will return, and another predicate will have to consume this.
Usefulness
Say you introduce your name as an additional noun:
n([doug,smith|W],W).
(sorry for using small cases, but otherwise Prolog sees these as variables).
It will simply try to match the first two words with doug
and smith
, and if that succeeds, try to match the remaining of the sentence. So one can make a setence like: [the,doug,smith,shoots,the,woman]
(sorry about that, furthermore in English, some noun phrases map to a noun directly np(X,Y) :- n(X,Y)
so the the
can be removed for a more complex English grammar).
Guessing completely eliminated?
Is the guessing completely eliminated? No. It is still possible that there is overlap in consuming. For instance you might add:
n([doug,smith|W],W).
n([doug|W],W).
In that case if you query for [the,doug,smith,shoots,the,woman]
. It will first consume/eat the in det
, next it will look for a noun to consume from [doug,smith,...]
. There are two candidates. Prolog will first try to eat only doug
, and match [smith,shoots,...]
as an entire verb phrase, but since smith
is not a verb, it will backtrack, reconsider eating only one word and thus decinde to eat both doug
and smith
as a noun instead.
The point is that when using append, Prolog would have to backtrack as well.
Conclusion
By using difference lists, you can eat an arbitrary amount of words. The remainder is returned such that other sentence parts like the verb phrase aim to consume the remainder. The list is furthermore always fully grounded, so one definitely does not use brute force to generate all kinds of verb phrases first.
This answer is a follow-up to the answer by @mat. Let's proceed step-by-step:
We start with the code shown in @mat's answer:
s --> np, vp.
np --> det, n.
vp --> v, np.
vp --> v.
det --> [the].
det --> [a].
n --> [woman].
n --> [man].
v --> [shoots].
So far, we have used (',')/2
: (A,B)
means sequence A
followed by sequence B
.
Next, we will also use ('|')/2
: (A|B)
means sequence A
or sequence B
.
For information on control-constructs in grammar bodies read this manual section.
s --> np, vp.
np --> det, n.
vp --> v, np | v.
det --> [the] | [a].
n --> [woman] | [man].
v --> [shoots].
We can make the code more concise by "inlining" a little:
s --> np, vp.
np --> ([the] | [a]), ([woman] | [man]).
vp --> v,np | v.
v --> [shoots].
How about some more inlining---as suggested by @mat in his comment...
s --> np, [shoots], (np | []).
np --> ([the] | [a]), ([woman] | [man]).
Take it to the max! (The following could be written as a one-liner.)
?- phrase((( [the]
| [a]), ( [woman]
| [man]), [shoots], ( ( [the]
| [a]), ( [woman]
| [man])
| [])),
Ws).
Different formulations all have their up-/down-sides, e.g., very compact code is harder to extend, but may be required when space is scarce—like when putting code on presentation slides.
Sample query:
?- phrase(s,Ws).
Ws = [the, woman, shoots, the, woman]
; Ws = [the, woman, shoots, the, man]
; Ws = [the, woman, shoots, a, woman]
; Ws = [the, woman, shoots, a, man]
; Ws = [the, woman, shoots]
; Ws = [the, man, shoots, the, woman]
; Ws = [the, man, shoots, the, man]
; Ws = [the, man, shoots, a, woman]
; Ws = [the, man, shoots, a, man]
; Ws = [the, man, shoots]
; Ws = [a, woman, shoots, the, woman]
; Ws = [a, woman, shoots, the, man]
; Ws = [a, woman, shoots, a, woman]
; Ws = [a, woman, shoots, a, man]
; Ws = [a, woman, shoots]
; Ws = [a, man, shoots, the, woman]
; Ws = [a, man, shoots, the, man]
; Ws = [a, man, shoots, a, woman]
; Ws = [a, man, shoots, a, man]
; Ws = [a, man, shoots]. % 20 solutions
I know exactly what you mean. It is quite hard at first to think in terms of list differences. The good news is that you usually do not have to do this.
There is a built-in formalism called Definite Clause Grammars (DCGs) which make it completely unnecessary to manually encode list differences in cases like yours.
In your example, I recommend you simply write this as:
s --> np, vp.
np --> det, n.
vp --> v, np.
vp --> v.
det --> [the].
det --> [a].
n --> [woman].
n --> [man].
v --> [shoots].
and accept the fact that within DCGs, [T]
represents the terminal T
and,
is read as "and then". That's how to describe lists with DCGs.
You can already use this to ask for all solutions, using the phrase/2
interface to DCGs:
?- phrase(s, Ls).
Ls = [the, woman, shoots, the, woman] ;
Ls = [the, woman, shoots, the, man] ;
Ls = [the, woman, shoots, a, woman] ;
etc.
It is hard to understand this in procedural terms at first, so I suggest that you do not try this. Instead, focus an a declarative reading of the grammar and see that it describes lists.
Later, you can go into more implementation details. Start with the way how simple terminals are translated, and then move on to more advanced grammar constructs: Rules containing a single terminal, then rules containing a terminal and a nonterminal etc.