Can Bison parse UTF-8 characters?

2020-06-16 04:16发布

I'm trying to make a Bison parser to handle UTF-8 characters. I don't want the parser to actually interpret the Unicode character values, but I want it to parse the UTF-8 string as a sequence of bytes.

Right now, Bison generates the following code which is problematic:

  if (yychar <= YYEOF)
    {
      yychar = yytoken = YYEOF;
      YYDPRINTF ((stderr, "Now at end of input.\n"));
    }

The problem is that many bytes of the UTF-8 string will have a negative value, and Bison interprets negative values as an EOF, and stops.

Is there a way around this?

标签： c++ utf-8 bison

3条回答

仙女界的扛把子

2楼-- · 2020-06-16 04:30

This is an question from 4 years ago, but I'm facing the same issues and I'd like to share my ideas.

The problem is that in UTF-8 you don't know how many bytes to read. As suggested above you can use your own lexer, and have it either read whole lines, or have it read 4 bytes every time. Then extract the UTF-8 character from that, and read more bytes to complete again to 4 bytes.

0人赞添加讨论(0) 举报

Luminary・发光体

3楼-- · 2020-06-16 04:37

flex being the issue here, you might want to take a look at zlex.

0人赞添加讨论(0) 举报

Rolldiameter

4楼-- · 2020-06-16 04:52

bison yes, flex no. The one time I needed a bison parser to work with UTF-8 encoded files I ended up writing my own yylex function.

edit: To help, I used a lot of the Unicode operations available in glib (there's a gunicode type and some file/string manipulation functions that I found useful).

0人赞添加讨论(0) 举报

Can Bison parse UTF-8 characters?

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间