Umlauts in JSON files lead to errors in Python cod

I've created python modules from the JSON grammar on github / antlr4 with

antlr4 -Dlanguage=Python3 JSON.g4

I've written a main program "JSON2.py" following this guide: https://github.com/antlr/antlr4/blob/master/doc/python-target.md and downloaded the example1.json also from github.

python3 ./JSON2.py example1.json # works perfectly, but 
python3 ./JSON2.py bookmarks-2017-05-24.json # the bookmarks contain German Umlauts like "ü"

...
File "/home/xyz/lib/python3.5/site-packages/antlr4/FileStream.py", line 27, in readDataFrom
    return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 227: ordinal not in range(128)

The offending line in JSON2.py is

input = FileStream(argv[1])

I've searched stackoverflow and tried this instead of using the above FileStream:

fp = codecs.open(argv[1], 'rb', 'utf-8')
try:
    input = fp.read()
finally:
    fp.close()
lexer = JSONLexer(input)
stream = CommonTokenStream(lexer)
parser = JSONParser(stream)
tree = parser.json() # This is line 39, mentioned in the error message

Execution of this program ends with an error message, even if the input file doesn't contain Umlauts:

python3 ./JSON2.py example1.json 
Traceback (most recent call last):
  File "./JSON2.py", line 46, in <module>
    main(sys.argv)
  File "./JSON2.py", line 39, in main
    tree = parser.json()    
  File "/home/x/Entwicklung/antlr/links/JSONParser.py", line 108, in json
    self.enterRule(localctx, 0, self.RULE_json)
  File "/home/xyz/lib/python3.5/site-packages/antlr4/Parser.py", line 358, in enterRule
    self._ctx.start = self._input.LT(1)
  File "/home/xyz/lib/python3.5/site-packages/antlr4/CommonTokenStream.py", line 61, in LT
    self.lazyInit()
  File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 186, in lazyInit
    self.setup()
  File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 189, in setup
    self.sync(0)
  File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 111, in sync
    fetched = self.fetch(n)
  File "/home/xyz/lib/python3.5/site-packages/antlr4/BufferedTokenStream.py", line 123, in fetch
    t = self.tokenSource.nextToken()
  File "/home/xyz/lib/python3.5/site-packages/antlr4/Lexer.py", line 111, in nextToken
    tokenStartMarker = self._input.mark()
AttributeError: 'str' object has no attribute 'mark'

This parses correctly:

javac *.java grun JSON json -gui bookmarks-2017-05-24.json So the grammar itself is not the problem.

So finally the question: How should I process the input file in python, so that lexer and parser can digest it?

Thanks in advance.

回答1:

Make sure your input file is actually encoded as UTF-8. Many problems with character recognition by the lexer are caused by using other encodings. I just took a testbed application, added ëto the list of available characters for an IDENTIFIER and it works again. UTF-8 is the key -- and make sure your grammar also allows these characters where you want to accept them.

回答2:

I solved it by passing the encoding info:

input = FileStream(sys.argv[1], encoding = 'utf8')

If without the encoding info, I will have the same issue as yours.

Traceback (most recent call last):
  File "test.py", line 20, in <module>
    main()
  File "test.py", line 9, in main
    input = FileStream(sys.argv[1])
  File ".../lib/python3.5/site-packages/antlr4/FileStream.py", line 20, in __init__
    super().__init__(self.readDataFrom(fileName, encoding, errors))
  File ".../lib/python3.5/site-packages/antlr4/FileStream.py", line 27, in readDataFrom
    return codecs.decode(bytes, encoding, errors)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 1: ordinal not in range(128)

Where my input data is [今明]天(台南|高雄)的？天氣如何