Suppose we have a simple grammar specification. There is a way to enumerate terms of that grammar that guarantees that any finite term will have a finite position, by iterating it diagonally. For example, for the following grammar:
S ::= add
add ::= mul | add + mul
mul ::= term | mul * term
term ::= number | ( S )
number ::= digit | digit number
digit ::= 0 | 1 | ... | 9
You can enumerate terms like that:
0
1
0+0
0*0
0+1
(0)
1+0
0*1
0+0*0
00
... etc
My question is: is there a way to do the opposite? That is, to take a valid term of that grammar, say, 0+0*0
, and find its position on such enumeration - in that case, 9?
For this specific problem, we can cook up something fairly simple, if we allow ourselves to choose a different enumeration ordering. The idea is basically the one in Every Bit Counts, which I also mentioned in the comments. First, some preliminaries: some imports/extensions, a data type representing the grammar, and a pretty-printer. For the sake of simplicity, my digits only go up to 2 (big enough to not be binary any more, but small enough not to wear out my fingers and your eyes).
Now let's define the enumeration order. We'll use two basic combinators,
+++
for interleaving two lists (mnemonic: the middle character is a sum, so we're taking elements from either the first argument or the second) and+*+
for the diagonalization (mnemonic: the middle character is a product, so we're taking elements from both the first and second arguments). More information on these in the universe documentation. One invariant we'll maintain is that our lists -- with the exception ofdigits
-- are always infinite. This will be important later.Let's see a few terms:
Okay, now let's get to the good bit. Let's assume we have two infinite lists
a
andb
. There's two things to notice. First, ina +++ b
, all the even indices come froma
, and all the odd indices come fromb
. So we can look at the last bit of an index to see which list to look in, and the remaining bits to pick an index in that list. Second, ina +*+ b
, we can use the standard bijection between pairs of numbers and single numbers to translate between indices in the big list and pairs of indices in thea
andb
lists. Nice! Let's get to it. We'll define a class for Godel-able things that can be translated back and forth between numbers -- indices into the infinite list of inhabitants. Later we'll check that this translation matches the enumeration we defined above.The instance for pairs here is the standard Cantor diagonal. It's just a bit of algebra: use the triangle numbers to figure out where you're going/coming from. Now building up instances for this class is a breeze.
Number
s are just represented in base 3:For the remaining three types, we will, as suggested above, check the tag bit to decide which constructor to emit, and use the remaining bits as indices into a diagonalized list. All three instances necessarily look very similar.
And that's it! We can now "efficiently" translate back and forth between parse trees and their Godel numbering for this grammar. Moreover, this translation matches the above enumeration, as you can verify:
We did abuse many nice properties of this particular grammar -- non-ambiguity, the fact that almost all the nonterminals had infinitely many derivations -- but variations on this technique can get you quite far, especially if you are not too strict on requiring every number to be associated with something unique.
Also, by the way, you might notice that, except for the instance for
(Nat, Nat)
, these Godel numberings are particularly nice in that they look at/produce one bit (or trit) at a time. So you could imagine doing some streaming. But the(Nat, Nat)
one is pretty nasty: you have to know the whole number ahead of time to compute thesqrt
. You actually can turn this into a streaming guy, too, without losing the property of being dense (everyNat
being associated with a unique(Nat, Nat)
), but that's a topic for another answer...