Implementation of string literal concatenation in

AFAIK, this question applies equally to C and C++

Step 6 of the "translation phases" specified in the C standard (5.1.1.2 in the draft C99 standard) states that adjacent string literals have to be concatenated into a single literal. I.e.

printf("helloworld.c" ": %d: Hello "
       "world\n", 10);

Is equivalent (syntactically) to:

printf("helloworld.c: %d: Hello world\n", 10);

However, the standard doesn't seem to specify which part of the compiler has to handle this - should it be the preprocessor (cpp) or the compiler itself. Some online research tells me that this function is generally expected to be performed by the preprocessor (source #1, source #2, and there are more), which makes sense.

However, running cpp in Linux shows that cpp doesn't do it:

eliben@eliben-desktop:~/test$ cat cpptest.c 
int a = 5;

"string 1" "string 2"
"string 3"

eliben@eliben-desktop:~/test$ cpp cpptest.c 
# 1 "cpptest.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "cpptest.c"
int a = 5;

"string 1" "string 2"
"string 3"

So, my question is: where should this feature of the language be handled, in the preprocessor or the compiler itself?

Perhaps there's no single good answer. Heuristic answers based on experience, known compilers, and general good engineering practice will be appreciated.

P.S. If you're wondering why I care about this... I'm trying to figure out whether my Python based C parser should handle string literal concatenation (which it doesn't do, at the moment), or leave it to cpp which it assumes runs before it.

标签： c++ c c-preprocessor string-literals

5条回答

等我变得足够好

2楼-- · 2020-03-17 09:27

I would handle it in the scanning token part of the parser, so in the compiler. It seems more logical. The preprocessor has not to know the "structure" of the language, and in fact it ignores it usually so that macros can generate uncompilable code. It handles nothing more than what it is entitled to handle by directives that are specifically addressed to it (# ...), and the "consequences" of them (like those of a #define x h, which would make the preprocessor change a lot of x into h)

0人赞添加讨论(0) 举报

叼着烟拽天下

3楼-- · 2020-03-17 09:28

The standard doesn't specify a preprocessor vs. a compiler, it just specifies the phases of translation you already noted. Traditionally, phases 1 through 4 were in the preprocessor, Phases 5 though 7 in the compiler, and phase 8 the linker -- but none of that is required by the standard.

0人赞添加讨论(0) 举报

别忘想泡老子

4楼-- · 2020-03-17 09:28

Unless the preprocessor is specified to handle this, it's safe to assume it's the compiler's job.

Edit:

Your "I.e." link at the beginning of the post answers the question:

Adjacent string literals are concatenated at compile time; this allows long strings to be split over multiple lines, and also allows string literals resulting from C preprocessor defines and macros to be appended to strings at compile time...

0人赞添加讨论(0) 举报

小情绪 Triste *

5楼-- · 2020-03-17 09:32

There are tricky rules for how string literal concatenation interacts with escape sequences. Suppose you have

const char x1[] = "a\15" "4";
const char y1[] = "a\154";
const char x2[] = "a\r4";
const char y2[] = "al";

then x1 and x2 must wind up equal according to strcmp, and the same for y1 and y2. (This is what Heath is getting at in quoting the translation steps - escape conversion happens before string constant concatenation.) There's also a requirement that if any of the string constants in a concatenation group has an L or U prefix, you get a wide or Unicode string. Put it all together and it winds up being significantly more convenient to do this work as part of the "compiler" rather than the "preprocessor."

0人赞添加讨论(0) 举报

够拽才男人

6楼-- · 2020-03-17 09:41

In the ANSI C standard, this detail is covered in section 5.1.1.2, item (6):

5.1.1.2 Translation phases
...

4. Preprocessing directives are executed and macro invocations are expanded. ...

5. Each source character set member and escape sequence in character constants and string literals is converted to a member of the execution character set.

6. Adjacent character string literal tokens are concatenated and adjacent wide string literal tokens are concatenated.

The standard does not define that the implementation must use a pre-processor and compiler, per se.

Step 4 is clearly a preprocessor responsibility.

Step 5 requires that the "execution character set" be known. This information is also required by the compiler. It is easier to port the compiler to a new platform if the preprocessor does not contain platform dependendencies, so the tendency is to implement step 5, and thus step 6, in the compiler.

0人赞添加讨论(0) 举报

Implementation of string literal concatenation in

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间