What's a common way of generating sentences from a grammar?
I want an algorithm that's sort of the opposite of a parser. That is, given a formal context-free grammar (say LL), I want to generate an arbitrary sentence that conforms to that grammar. I use sentence here to mean any valid body of text, so it can actually be a whole program (even if it doesn't make any sense—as long as it's syntactially correct).
Example grammar:
program : <imports> NEWLINE? <namespace>
imports : ("import" <identifier> NEWLINE)*
namespace : "namespace " <identifier> NEWLINE "{" <classes> "}"
identifier: (A-Za-z_) (A-Za-z0-9_)*
...
Example generated program:
import jkhbhhuob
import aaaaa888_
namespace u8nFGubgykb
{ class ui0op_np { ... }
}
I don't know that there's a "common" algorithm for doing this. Random program generation is used in genetic programming so you could look for a grammar based GP system and see how they handle program generation. I would do a recursive rule generation algorithm like the pseudo-code:
This assumes that you've read in all of the parts into some data structure. You'd also need to handle the repetitions(randomly generate the number of times they occur) and optional rules (flip a coin to see if they are there or not).
Edit: Oh, and if the rule has more than one option, you'd just pick one of the options to go with, and process it the same way. So if some rule was (Literal|Variable), you'd randomly pick between the two.
As usual, I’m going to advise against reinventing the wheel. I’ve written one of these for ARM assembler, but I’m on record as regretting it (Software: Practice and Experience April 2007):
“In retrospect, an off-the-shelf expression generator should have been used to generate random ARM assembly instructions for comparison. Instead a Perl script was built incrementally, taking each ARM instruction definition and generating instances. An advantage, however, of the incremental in-house approach was that simple substitutions detected simple bugs, and bug hunting could proceed incrementally.”
I’m afraid I don’t recall what made me change my mind, and I doubt it would be relevant to your specific needs, but I do suggest looking harder for a preexisting solution. It requires less discipline to write stuff like this yourself, but it always takes longer than you expect.
Off the top of my head:
I'd work recursively (basically the opposite of a recursive decent parser) using some heuristics about what to do with ranges (
(...)
: probably pick at random) optionals (?
: see [], below), repetitions('' Poisson distribution?). Literals ("..."
) are simple written to the output, and subtokens (`<...>') generate a recursion.This shouldn't be too hard unless you want to guarantee some sort of complete coverage. Even then, just generating a bunch of data would be a help...
[*] You need to include optionals less than 50% of the time to prevent infinite regress when processing rules like
Good catch by plinth.
Likewise with repetitions, throw a distributions that converges strongly.
You'll need to parse the input grammar first if it is presented in a BNF form as here. Simplest thing to do would be use a mapping
(name, string)
, then start with the highest level token (which you might assume means the first one...).This gives you:
The you start with "program", hit "<imports>" so you recur...on coming back, hist "NEWLINE?", so throw the dice and write or not, hit "<namespace>" so recur...on return you're done.
I find my self suspecting that this has been done before. If you just need the output, I'd search the web... Perhaps http://portal.acm.org/citation.cfm?doid=966137.966142, though the huge number of parser generators out there clutter up the search space... Try this paper, too.
BTW-- You local university probably has online subscriptions to these journals, so you can get them for free by hooking up at the library.
The problem you will have is that the recursive nature of the graph is such that you can generate correct grammars of infinite size. You will probably want to do something like set up a hash of node types in your grammar with counts and limits of how many times you're allowing yourself to hit that node. Then depth first search to your heart's content.
Your solution should follow the inductive structure of the grammar. How do you generate a random utterance for each of the following?
This will all be much clearer if you write down the data structure you use to represent a grammar. The structure of your set of mutually recursive generator functions will mirror that data structure very closely.
Dealing with infinite recursion is a bit dicey. The easiest way is to generate a stream of utterances and keep a depth cutoff. Or if you're using a lazy language like Haskell you can generate all utterances and peel off as many finite ones as you like (a trickier problem than the original question, but very entertaining).
not an answer, but check the wikipedia entry on grammar generation: http://en.wikipedia.org/wiki/Context-free_grammar_generation_algorithms
it described some common algorithms used.