I have a dictionary of Stata keywords and reasonable knowledge of Stata syntax. I would like to devote a few hours to turn it into a Stata lexer for Pygments.
However, I cannot find enough documentation about the syntax of lexers and find myself unable to start coding the lexer. Could someone point out a good tutorial for writing new lexers for Pygments?
I know about the Pygments API and the lexer development page, but honestly, these are not enough for someone like me with very limited knowledge of Python.
My strategy so far has been to look for examples. I have found quite a few, e.g. Puppet, Sass, Scala, Ada. They helped only that much. Any help with how to get started from my Stata keywords would be welcome.
If you just wanted to highlight the keywords, you'd start with this (replacing the keywords with your own list of Stata keywords):
class StataLexer(RegexLexer):
name = 'Stata'
aliases = ['stata']
filenames = '*.stata'
flags = re.MULTILINE | re.DOTALL
tokens = {
'root': [
(r'(abstract|case|catch|class|do|else|extends|false|final|'
r'finally|for|forSome|if|implicit|import|lazy|match|new|null|'
r'object|override|package|private|protected|requires|return|'
r'sealed|super|this|throw|trait|try|true|type|while|with|'
r'yield)\b', Keyword),
],
}
I think your problem is not that you don't know any Python, but that you don't have much experience with writing a lexer or understanding how a lexer works? Because this implementation is fairly straightforward.
Then, if you want to add more stuff, add an extra element to the root
list, a two-element tuple, where the first element is a regular expression and the second element designates a syntactic class.
I attempted to write a pygments lexer (for BibTeX, which has a simple syntax) recently and agree with your assessment that the resources out there aren't very helpful for people unfamiliar with Python or general code parsing concepts.
What I found to be most helpful was the collection of lexers included with Pygments.
There is a file _mapping.py
that lists all of the recognized language formats and links to the lexer object for each one. To construct my lexer, I tried to think of languages that had similar constructs to the ones I was handling and checked if I could tease out something useful. Some of the built-in lexers are more complex than I wanted, but others were helpful.