Pythonic way to implement a tokenizer

I'm going to implement a tokenizer in Python and I was wondering if you could offer some style advice?

I've implemented a tokenizer before in C and in Java so I'm fine with the theory, I'd just like to ensure I'm following pythonic styles and best practices.

Listing Token Types:

In Java, for example, I would have a list of fields like so:

public static final int TOKEN_INTEGER = 0

But, obviously, there's no way (I think) to declare a constant variable in Python, so I could just replace this with normal variable declarations but that doesn't strike me as a great solution since the declarations could be altered.

Returning Tokens From The Tokenizer:

Is there a better alternative to just simply returning a list of tuples e.g.

[ (TOKEN_INTEGER, 17), (TOKEN_STRING, "Sixteen")]?

Cheers,

Pete

标签： python coding-style tokenize

12条回答

趁早两清

2楼-- · 2019-02-06 12:38

Python takes a "we're all consenting adults" approach to information hiding. It's OK to use variables as though they were constants, and trust that users of your code won't do something stupid.

0人赞添加讨论(0) 举报

贪生不怕死

3楼-- · 2019-02-06 12:40

I have recently built a tokenizer, too, and passed through some of your issues.

Token types are declared as "constants", i.e. variables with ALL_CAPS names, at the module level. For example,

_INTEGER = 0x0007
_FLOAT = 0x0008
_VARIABLE = 0x0009

and so on. I have used an underscore in front of the name to point out that somehow those fields are "private" for the module, but I really don't know if this is typical or advisable, not even how much Pythonic. (Also, I'll probably ditch numbers in favour of strings, because during debugging they are much more readable.)

Tokens are returned as named tuples.

from collections import namedtuple
Token = namedtuple('Token', ['value', 'type'])
# so that e.g. somewhere in a function/method I can write...
t = Token(n, _INTEGER)
# ...and return it properly

I have used named tuples because the tokenizer's client code (e.g. the parser) seems a little clearer while using names (e.g. token.value) instead of indexes (e.g. token[0]).

Finally, I've noticed that sometimes, especially writing tests, I prefer to pass a string to the tokenizer instead of a file object. I call it a "reader", and have a specific method to open it and let the tokenizer access it through the same interface.

def open_reader(self, source):
    """
    Produces a file object from source.
    The source can be either a file object already, or a string.
    """
    if hasattr(source, 'read'):
        return source
    else:
        from io import StringIO
        return StringIO(source)

0人赞添加讨论(0) 举报

Anthone

4楼-- · 2019-02-06 12:44

"Is there a better alternative to just simply returning a list of tuples?"

That's the approach used by the "tokenize" module for parsing Python source code. Returning a simple list of tuples can work very well.

0人赞添加讨论(0) 举报

放我归山

5楼-- · 2019-02-06 12:45

In many situations, exp. when parsing long input streams, you may find it more useful to implement you tokenizer as a generator function. This way you can easily iterate over all the tokens without the need for lots of memory to build the list of tokens first.

For generator see the original proposal or other online docs

0人赞添加讨论(0) 举报

Emotional °昔

6楼-- · 2019-02-06 12:48

Thanks for your help, I've started to bring these ideas together, and I've come up with the following. Is there anything terribly wrong with this implementation (particularly I'm concerned about passing a file object to the tokenizer):

class Tokenizer(object):

  def __init__(self,file):
     self.file = file

  def __get_next_character(self):
      return self.file.read(1)

  def __peek_next_character(self):
      character = self.file.read(1)
      self.file.seek(self.file.tell()-1,0)
      return character

  def __read_number(self):
      value = ""
      while self.__peek_next_character().isdigit():
          value += self.__get_next_character()
      return value

  def next_token(self):
      character = self.__peek_next_character()

      if character.isdigit():
          return self.__read_number()

0人赞添加讨论(0) 举报

疯言疯语

7楼-- · 2019-02-06 12:48

"Is there a better alternative to just simply returning a list of tuples?"

Nope. It works really well.

0人赞添加讨论(0) 举报

1 2 下一页

Pythonic way to implement a tokenizer

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间