Let's say I want to tokenize a string of words (symbols) and numbers separated by whitespaces. For example, the expected result of tokenizing "aa 11"
would be [tkSym("aa"), tkNum(11)]
.
My first attempt was the code below:
whitespace --> [Ws], { code_type(Ws, space) }, whitespace.
whitespace --> [].
letter(Let) --> [Let], { code_type(Let, alpha) }.
symbol([Sym|T]) --> letter(Sym), symbol(T).
symbol([Sym]) --> letter(Sym).
digit(Dg) --> [Dg], { code_type(Dg, digit) }.
digits([Dg|Dgs]) --> digit(Dg), digits(Dgs).
digits([Dg]) --> digit(Dg).
token(tkSym(Token)) --> symbol(Token).
token(tkNum(Token)) --> digits(Digits), { number_chars(Token, Digits) }.
tokenize([Token|Tokens]) --> whitespace, token(Token), tokenize(Tokens).
tokenize([]) --> whitespace, [].
Calling tokenize
on "aa bb"
leaves me with several possible responses:
?- tokenize(X, "aa bb", []).
X = [tkSym([97|97]), tkSym([98|98])] ;
X = [tkSym([97|97]), tkSym(98), tkSym(98)] ;
X = [tkSym(97), tkSym(97), tkSym([98|98])] ;
X = [tkSym(97), tkSym(97), tkSym(98), tkSym(98)] ;
false.
In this case, however, it seems appropriate to expect only one correct answer. Here's another, more deterministic approach:
whitespace --> [Space], { char_type(Space, space) }, whitespace.
whitespace --> [].
symbol([Sym|T]) --> letter(Sym), !, symbol(T).
symbol([]) --> [].
letter(Let) --> [Let], { code_type(Let, alpha) }.
% similarly for numbers
token(tkSym(Token)) --> symbol(Token).
tokenize([Token|Tokens]) --> whitespace, token(Token), !, tokenize(Tokens).
tokenize([]) --> whiteSpace, [].
But there is a problem: although the single answer to token
called on "aa"
is now a nice list, the tokenize
predicate ends up in an infinite recursion:
?- token(X, "aa", []).
X = tkSym([97, 97]).
?- tokenize(X, "aa", []).
ERROR: Out of global stack
What am I missing? How is the problem usually solved in Prolog?
The underlying problem is that in your second version,
token//1
also succeeds for the "empty" token:Therefore, unintentionally, the following succeeds too, as does an arbitrary number of tokens:
To fix this, I recommend you adjust the definitions so that a token must consist of at least one lexical element, as is also typical. A good way to ensure that at least one element is described is to split the DCG rules into two sets. For example, shown for
symbol///1
:This way, you avoid an unbounded recursion that can endlessly consume empty tokens.
Other points:
Always use
phrase/2
to access DCGs in a portable way, i.e., independent of the actual implementation method used by any particular Prolog system.The
[]
in the final DCG clause is superfluous, you can simply remove it.Also, avoid using so many
!/0
. It is OK to commit to the first matching tokenization, but do it only at a single place, like via aonce/1
wrapped around thephrase/2
call.For naming, see my comment above. I recommend to use
tokens//1
to make this more declarative. Sample queries, using the above definition ofsymbol//1
: