Pythonic way to implement a tokenizer

2019-02-06 12:30发布

I'm going to implement a tokenizer in Python and I was wondering if you could offer some style advice?

I've implemented a tokenizer before in C and in Java so I'm fine with the theory, I'd just like to ensure I'm following pythonic styles and best practices.

Listing Token Types:

In Java, for example, I would have a list of fields like so:

public static final int TOKEN_INTEGER = 0

But, obviously, there's no way (I think) to declare a constant variable in Python, so I could just replace this with normal variable declarations but that doesn't strike me as a great solution since the declarations could be altered.

Returning Tokens From The Tokenizer:

Is there a better alternative to just simply returning a list of tuples e.g.

[ (TOKEN_INTEGER, 17), (TOKEN_STRING, "Sixteen")]?

Cheers,

Pete

12条回答
爱情/是我丢掉的垃圾
2楼-- · 2019-02-06 12:48

"Is there a better alternative to just simply returning a list of tuples"

I had to implement a tokenizer, but it required a more complex approach than a list of tuples, therefore I implemented a class for each token. You can then return a list of class instances, or if you want to save resources, you can return something implementing the iterator interface and generate the next token while you progress in the parsing.

查看更多
在下西门庆
3楼-- · 2019-02-06 12:53

There's an undocumented class in the re module called re.Scanner. It's very straightforward to use for a tokenizer:

import re
scanner=re.Scanner([
  (r"[0-9]+",       lambda scanner,token:("INTEGER", token)),
  (r"[a-z_]+",      lambda scanner,token:("IDENTIFIER", token)),
  (r"[,.]+",        lambda scanner,token:("PUNCTUATION", token)),
  (r"\s+", None), # None == skip token.
])

results, remainder=scanner.scan("45 pigeons, 23 cows, 11 spiders.")
print results

will result in

[('INTEGER', '45'),
 ('IDENTIFIER', 'pigeons'),
 ('PUNCTUATION', ','),
 ('INTEGER', '23'),
 ('IDENTIFIER', 'cows'),
 ('PUNCTUATION', ','),
 ('INTEGER', '11'),
 ('IDENTIFIER', 'spiders'),
 ('PUNCTUATION', '.')]

I used re.Scanner to write a pretty nifty configuration/structured data format parser in only a couple hundred lines.

查看更多
\"骚年 ilove
4楼-- · 2019-02-06 12:55

I'd turn to the excellent Text Processing in Python by David Mertz

查看更多
家丑人穷心不美
5楼-- · 2019-02-06 12:57

This being a late answer, there is now something in the official documentation: Writing a tokenizer with the re standard library. This is content in the Python 3 documentation that isn't in the Py 2.7 docs. But it is still applicable to older Pythons.

This includes both short code, easy setup, and writing a generator as several answers here have proposed.

If the docs are not Pythonic, I don't know what is :-)

查看更多
可以哭但决不认输i
6楼-- · 2019-02-06 12:58

I've implemented a tokenizer for a C-like programming language. What I did was to split up the creation of tokens into two layers:

  • a surface scanner: This one actually reads the text and uses regular expression to split it up into only the most primitve tokens (operators, identifiers, numbers,...); this one yields tuples (tokenname, scannedstring, startpos, endpos).
  • a tokenizer: This consumes the tuples from the first layer, turning them into token objects (named tuples would do as well, I think). Its purpose is to detect some long-range dependencies in the token stream, particularly strings (with their opening and closing quotes) and comments (with their opening an closing lexems; - yes, I wanted to retain comments!) and coerce them into single tokens. The resulting stream of token objects is then returned to a consuming parser.

Both are generators. The benefits of this approach were:

  • Reading of the raw text is done only in the most primitive way, with simple regexps - fast and clean.
  • The second layer is already implemented as a primitive parser, to detect string literals and comments - re-use of parser technology.
  • You don't have to strain the surface scanner with complex detections.
  • But the real parser gets tokens on the semantic level of the language to be parsed (again strings, comments).

I feel quite happy with this layered approach.

查看更多
趁早两清
7楼-- · 2019-02-06 12:59

When I start something new in Python I usually look first at some modules or libraries to use. There's 90%+ chance that there already is somthing available.

For tokenizers and parsers this is certainly so. Have you looked at PyParsing ?

查看更多
登录 后发表回答