How does pointer comparison work in C? Is it ok to

2020-02-10 01:55发布

In K&R (The C Programming Language 2nd Edition) chapter 5 I read the following:

First, pointers may be compared under certain circumstances. If p and q point to members of the same array, then relations like ==, !=, <, >=, etc. work properly.

Which seems to imply that only pointers pointing to the same array can be compared.

However when I tried this code

    char t = 't';
    char *pt = &t;
    char x = 'x';
    char *px = &x;

    printf("%d\n", pt > px);

1 is printed to the screen.

First of all, I thought I would get undefined or some type or error, because pt and px aren't pointing to the same array (at least in my understanding).

Also is pt > px because both pointers are pointing to variables stored on the stack, and the stack grows down, so the memory address of t is greater than that of x? Which is why pt > px is true?

I get more confused when malloc is brought in. Also in K&R in chapter 8.7 the following is written:

There is still one assumption, however, that pointers to different blocks returned by sbrk can be meaningfully compared. This is not guaranteed by the standard which permits pointer comparisons only within an array. Thus this version of malloc is portable only among machines for which the general pointer comparison is meaningful.

I had no issue comparing pointers that pointed to space malloced on the heap to pointers that pointed to stack variables.

For example, the following code worked fine, with 1 being printed:

    char t = 't';
    char *pt = &t;
    char *px = malloc(10);
    strcpy(px, pt);
    printf("%d\n", pt > px);

Based on my experiments with my compiler, I'm being led to think that any pointer can be compared with any other pointer, regardless of where they individually point. Moreover, I think pointer arithmetic between two pointers is fine, no matter where they individually point because the arithmetic is just using the memory addresses the pointers store.

Still, I am confused by what I am reading in K&R.

The reason I'm asking is because my prof. actually made it an exam question. He gave the following code:

struct A {
    char *p0;
    char *p1;
};

int main(int argc, char **argv) {
    char a = 0;
    char *b = "W";
    char c[] = [ 'L', 'O', 'L', 0 ];

   struct A p[3];
    p[0].p0 = &a;
    p[1].p0 = b;
    p[2].p0 = c;

   for(int i = 0; i < 3; i++) {
        p[i].p1 = malloc(10);
        strcpy(p[i].p1, p[i].p0);
    }
}

What do these evaluate to:

  1. p[0].p0 < p[0].p1
  2. p[1].p0 < p[1].p1
  3. p[2].p0 < p[2].p1

The answer is 0, 1, and 0.

(My professor does include the disclaimer on the exam that the questions are for a Ubuntu Linux 16.04, 64-bit version programming environment)

(editor's note: if SO allowed more tags, that last part would warrant , , and maybe . If the point of the question / class was specifically low-level OS implementation details, rather than portable C.)

7条回答
唯我独甜
2楼-- · 2020-02-10 02:34

What A Provocative Question!

Even cursory scanning of the responses and comments in this thread will reveal how emotive your seemingly simple and straight forward query turns out to be.

It should not be surprising.

Inarguably, misunderstandings around the concept and use of pointers represents a predominant cause of serious failures in programming in general.

Recognition of this reality is readily evident in the ubiquity of languages designed specifically to address, and preferably to avoid the challenges pointers introduce altogether. Think C++ and other derivatives of C, Java and its relations, Python and other scripts -- merely as the more prominent and prevalent ones, and more or less ordered in severity of dealing with the issue.

Developing a deeper understanding of the principles underlying, therefore must be pertinent to every individual that aspires to excellence in programming -- especially at the systems level.

I imagine this is precisely what your teacher means to demonstrate.

And the nature of C makes it a convenient vehicle for this exploration. Less clearly than assembly -- though perhaps more readily comprehensible -- and still far more explicitly than languages based on deeper abstraction of the execution environment.

Designed to facilitate deterministic translation of the programmer’s intent into instructions that machines can comprehend, C is a system level language. While classified as high-level, it really belongs in a ‘medium’ category; but since none such exists, the ‘system’ designation has to suffice.

This characteristic is largely responsible for making it a language of choice for device drivers, operating system code, and embedded implementations. Furthermore, a deservedly favoured alternative in applications where optimal efficiency is paramount; where that means the difference between survival and extinction, and therefore is a necessity as opposed to a luxury. In such instances, the attractive convenience of portability loses all its allure, and opting for the lack-lustre performance of the least common denominator becomes an unthinkably detrimental option.

What makes C -- and some of its derivatives -- quite special, is that it allows its users complete control -- when that is what they desire -- without imposing the related responsibilities upon them when they do not. Nevertheless, it never offers more than the thinnest of insulations from the machine, wherefore proper use demands exacting comprehension of the concept of pointers.

In essence, the answer to your question is sublimely simple and satisfyingly sweet -- in confirmation of your suspicions. Provided, however, that one attaches the requisite significance to every concept in this statement:

  • The acts of examining, comparing and manipulating pointers are always and necessarily valid, while the conclusions derived from the result depends on the validity of the values contained, and thus need not be.

The former is both invariably safe and potentially proper, while the latter can only ever be proper when it has been established as safe. Surprisingly -- to some -- so establishing the validity of the latter depends on and demands the former.

Of course, part of the confusion arises from the effect of the recursion inherently present within the principle of a pointer -- and the challenges posed in differentiating content from address.

You have quite correctly surmised,

I'm being led to think that any pointer can be compared with any other pointer, regardless of where they individually point. Moreover, I think pointer arithmetic between two pointers is fine, no matter where they individually point because the arithmetic is just using the memory addresses the pointers store.

And several contributors have affirmed: pointers are just numbers. Sometimes something closer to complex numbers, but still no more than numbers.

The amusing acrimony in which this contention has been received here reveals more about human nature than programming, but remains worthy of note and elaboration. Perhaps we will do so later...

As one comment begins to hint; all this confusion and consternation derives from the need to discern what is valid from what is safe, but that is an oversimplification. We must also distinguish what is functional and what is reliable, what is practical and what may be proper, and further still: what is proper in a particular circumstance from what may be proper in a more general sense. Not to mention; the difference between conformity and propriety.

Toward that end, we first need to appreciate precisely what a pointer is.

  • You have demonstrated a firm grip on the concept, and like some others may find these illustrations patronizingly simplistic, but the level of confusion evident here demands such simplicity in clarification.

As several have pointed out: the term pointer is merely a special name for what is simply an index, and thus nothing more than any other number.

This should already be self-evident in consideration of the fact that all contemporary mainstream computers are binary machines that necessarily work exclusively with and on numbers. Quantum computing may change that, but that is highly unlikely, and it has not come of age.

Technically, as you have noted, pointers are more accurately addresses; an obvious insight that naturally introduces the rewarding analogy of correlating them with the ‘addresses’ of houses, or plots on a street.

  • In a flat memory model: the entire system memory is organized in a single, linear sequence: all houses in the city lie on the same road, and every house is uniquely identified by its number alone. Delightfully simple.

  • In segmented schemes: a hierarchical organization of numbered roads is introduced above that of numbered houses so that composite addresses are required.

    • Some implementations are still more convoluted, and the totality of distinct ‘roads’ need not sum to a contiguous sequence, but none of that changes anything about the underlying.
    • We are necessarily able to decompose every such hierarchical link back into a flat organization. The more complex the organization, the more hoops we will have to hop through in order to do so, but it must be possible. Indeed, this also applies to ‘real mode’ on x86.
    • Otherwise the mapping of links to locations would not be bijective, as reliable execution -- at the system level -- demands that it MUST be.
      • multiple addresses must not map to singular memory locations, and
      • singular addresses must never map to multiple memory locations.

Bringing us to the further twist that turns the conundrum into such a fascinatingly complicated tangle. Above, it was expedient to suggest that pointers are addresses, for the sake of simplicity and clarity. Of course, this is not correct. A pointer is not an address; a pointer is a reference to an address, it contains an address. Like the envelope sports a reference to the house. Contemplating this may lead you to glimpse what was meant with the suggestion of recursion contained in the concept. Still; we have only so many words, and talking about the addresses of references to addresses and such, soon stalls most brains at an invalid op-code exception. And for the most part, intent is readily garnered from context, so let us return to the street.

Postal workers in this imaginary city of ours are much like the ones we find in the ‘real’ world. No one is likely to suffer a stroke when you talk or enquire about an invalid address, but every last one will balk when you ask them to act on that information.

Suppose there are only 20 houses on our singular street. Further pretend that some misguided, or dyslexic soul has directed a letter, a very important one, to number 71. Now, we can ask our carrier Frank, whether there is such an address, and he will simply and calmly report: no. We can even expect him to estimate how far outside the street this location would lie if it did exist: roughly 2.5 times further than the end. None of this will cause him any exasperation. However, if we were to ask him to deliver this letter, or to pick up an item from that place, he is likely to be quite frank about his displeasure, and refusal to comply.

Pointers are just addresses, and addresses are just numbers.

Verify the output of the following:

void foo( void *p ) {
   printf(“%p\t%zu\t%d\n”, p, (size_t)p, p == (size_t)p);
}

Call it on as many pointers as you like, valid or not. Please do post your findings if it fails on your platform, or your (contemporary) compiler complains.

Now, because pointers are simply numbers, it is inevitably valid to compare them. In one sense this is precisely what your teacher is demonstrating. All of the following statements are perfectly valid -- and proper! -- C, and when compiled will run without encountering problems, even though neither pointer need be initialized and the values they contain therefore may be undefined:

  • We are only calculating result explicitly for the sake of clarity, and printing it to force the compiler to compute what would otherwise be redundant, dead code.
void foo( size_t *a, size_t *b ) {
   size_t result;
   result = (size_t)a;
   printf(“%zu\n”, result);
   result = a == b;
   printf(“%zu\n”, result);
   result = a < b;
   printf(“%zu\n”, result);
   result = a - b;
   printf(“%zu\n”, result);
}

Of course, the program is ill-formed when either a or b is undefined (read: not properly initialized) at the point of testing, but that is utterly irrelevant to this part of our discussion. These snippets, as too the following statements, are guaranteed -- by the ‘standard’ -- to compile and run flawlessly, notwithstanding the IN-validity of any pointer involved.

Problems only arise when an invalid pointer is dereferenced. When we ask Frank to pick up or deliver at the invalid, non-existent address.

Given any arbitrary pointer:

int *p;

While this statement must compile and run:

printf(“%p”, p);

... as must this:

size_t foo( int *p ) { return (size_t)p; }

... the following two, in stark contrast, will still readily compile, but fail in execution unless the pointer is valid -- by which we here merely mean that it references an address to which the present application has been granted access:

printf(“%p”, *p);
size_t foo( int *p ) { return *p; }

How subtle the change? The distinction lies in the difference between the value of the pointer -- which is the address, and the value of the contents: of the house at that number. No problem arises until the pointer is dereferenced; until an attempt is made to access the address it links to. In trying to deliver or pick up the package beyond the stretch of the road...

By extension, the same principle necessarily applies to more complex examples, including the aforementioned need to establish the requisite validity:

int* validate( int *p, int *head, int *tail ) { 
    return p >= head && p <= tail ? p : NULL; 
}

Relational comparison and arithmetic offer identical utility to testing equivalence, and are equivalently valid -- in principle. However, what the results of such computation would signify, is a different matter entirely -- and precisely the issue addressed by the quotations you included.

In C, an array is a contiguous buffer, an uninterrupted linear series of memory locations. Comparison and arithmetic applied to pointers that reference locations within such a singular series are naturally, and obviously meaningful in relation both to each other, and to this ‘array’ (which is simply identified by the base). Precisely the same applies to every block allocated through malloc, or sbrk. Because these relationships are implicit, the compiler is able to establish valid relationships between them, and therefore can be confident that calculations will provide the answers anticipated.

Performing similar gymnastics on pointers that reference distinct blocks or arrays do not offer any such inherent, and apparent utility. The more so since whatever relation exists at one moment may be invalidated by a reallocation that follows, wherein that is highly likely to change, even be inverted. In such instances the compiler is unable to obtain the necessary information to establish the confidence it had in the previous situation.

You, however, as the programmer, may have such knowledge! And in some instances are obliged to exploit that.

There ARE, therefore, circumstances in which EVEN THIS is entirely VALID and perfectly PROPER.

In fact, that is exactly what malloc itself has to do internally when time comes to try merging reclaimed blocks -- on the vast majority of architectures. The same is true for the operating system allocator, like that behind sbrk; if more obviously, frequently, on more disparate entities, more critically -- and relevant also on platforms where this malloc may not be. And how many of those are not written in C?

The validity, security and success of an action is inevitably the consequence of the level of insight upon which it is premised and applied.

In the quotes you have offered, Kernighan and Ritchie are addressing a closely related, but nonetheless separate issue. They are defining the limitations of the language, and explaining how you may exploit the capabilities of the compiler to protect you by at least detecting potentially erroneous constructs. They are describing the lengths the mechanism is able -- is designed -- to go to in order to assist you in your programming task. The compiler is your servant, you are the master. A wise master, however, is one that is intimately familiar with the capabilities of his various servants.

Within this context, undefined behaviour serves to indicate potential danger and the possibility of harm; not to imply imminent, irreversible doom, or the end of the world as we know it. It simply means that we -- ‘meaning the compiler’ -- are not able to make any conjecture about what this thing may be, or represent and for this reason we choose to wash our hands of the matter. We will not be held accountable for any misadventure that may result from the use, or mis-use of this facility.

In effect, it simply says: ‘Beyond this point, cowboy: you are on your own...’

Your professor is seeking to demonstrate the finer nuances to you.

Notice what great care they have taken in crafting their example; and how brittle it still is. By taking the address of a, in

p[0].p0 = &a;

the compiler is coerced into allocating actual storage for the variable, rather than placing it in a register. It being an automatic variable, however, the programmer has no control over where that is assigned, and so unable to make any valid conjecture about what would follow it. Which is why a must be set equal to zero for the code to work as expected.

Merely changing this line:

char a = 0;

to this:

char a = 1;  // or ANY other value than 0

causes the behaviour of the program to become undefined. At minimum, the first answer will now be 1; but the problem is far more sinister.

Now the code is inviting of disaster.

While still perfectly valid and even conforming to the standard, it now is ill-formed and although sure to compile, may fail in execution on various grounds. For now there are multiple problems -- none of which the compiler is able to recognize.

strcpy will start at the address of a, and proceed beyond this to consume -- and transfer -- byte after byte, until it encounters a null.

The p1 pointer has been initialized to a block of exactly 10 bytes.

  • If a happens to be placed at the end of a block and the process has no access to what follows, the very next read -- of p0[1] -- will elicit a segfault. This scenario is unlikely on the x86 architecture, but possible.

  • If the area beyond the address of a is accessible, no read error will occur, but the program still is not saved from misfortune.

  • If a zero byte happens to occur within the ten starting at the address of a, it may still survive, for then strcpy will stop and at least we will not suffer a write violation.

  • If it is not faulted for reading amiss, but no zero byte occurs in this span of 10, strcpy will continue and attempt to write beyond the block allocated by malloc.

    • If this area is not owned by the process, the segfault should immediately be triggered.

    • The still more disastrous -- and subtle --- situation arises when the following block is owned by the process, for then the error cannot be detected, no signal can be raised, and so it may ‘appear’ still to ‘work’, while it actually will be overwriting other data, your allocator’s management structures, or even code (in certain operating environments).

This is why pointer related bugs can be so hard to track. Imagine these lines buried deep within thousands of lines of intricately related code, that someone else has written, and you are directed to delve through.

Nevertheless, the program must still compile, for it remains perfectly valid and standard conformant C.

These kinds of errors, no standard and no compiler can protect the unwary against. I imagine that is exactly what they are intending to teach you.

Paranoid people constantly seek to change the nature of C to dispose of these problematic possibilities and so save us from ourselves; but that is disingenuous. This is the responsibility we are obliged to accept when we choose to pursue the power and obtain the liberty that more direct and comprehensive control of the machine offers us. Promoters and pursuers of perfection in performance will never accept anything less.

Portability and the generality it represents is a fundamentally separate consideration and all that the standard seeks to address:

This document specifies the form and establishes the interpretation of programs expressed in the programming language C. Its purpose is to promote portability, reliability, maintainability, and efficient execution of C language programs on a variety of computing systems.

Which is why it is perfectly proper to keep it distinct from the definition and technical specification of the language itself. Contrary to what many seem to believe generality is antithetical to exceptional and exemplary.

To conclude:

  • Examining and manipulating pointers themselves is invariably valid and often fruitful. Interpretation of the results, may, or may not be meaningful, but calamity is never invited until the pointer is dereferenced; until an attempt is made to access the address linked to.

Were this not true, programming as we know it -- and love it -- would not have been possible.

查看更多
▲ chillily
3楼-- · 2020-02-10 02:37

It's simple: Comparing pointers does not make sense as memory locations for objects are never guaranteed to be in the same order as you declared them. The exception is arrays. &array[0] is lower than &array[1]. Thats what K&R points out. In practice struct member addresses are also in the order you declare them in my experience. No guarantees on that.... Another exception is if you compare a pointer for equal. When one pointer is equal to another you know it's pointing to the same object. Whatever it is. Bad exam question if you ask me. Depending on Ubuntu Linux 16.04, 64-bit version programming environment for an exam question ? Really ?

查看更多
老娘就宠你
4楼-- · 2020-02-10 02:38

Pointers are just integers, like everything else in a computer. You absolutely can compare them with < and > and produce results without causing a program to crash. That said, the standard does not guarantee that those results have any meaning outside of array comparisons.

In your example of stack allocated variables, the compiler is free to allocate those variables to registers or stack memory addresses, and in any order it so choose. Comparisons such as < and > therefore won't be consistent across compilers or architectures. However, == and != aren't so restricted, comparing pointer equality is a valid and useful operation.

查看更多
等我变得足够好
5楼-- · 2020-02-10 02:42

The primary issue with comparing pointers to two distinct arrays of the same type is that the arrays themselves need not be placed in a particular relative positioning--one could end up before and after the other.

First of all, I thought I would get undefined or some type or error, because pt an px aren't pointing to the same array (at least in my understanding).

No, the result is dependent on implementation and other unpredictable factors.

Also is pt>px because both pointers are pointing to variables stored on the stack, and the stack grows down, so the memory address of t is greater than that of x? Which is why pt>px is true?

There isn't necessarily a stack. When it exists, it need not to grow down. It could grow up. It could be non-contiguous in some bizarre way.

Moreover, I think pointer arithmetic between two pointers is fine, no matter where they individually point because the arithmetic is just using the memory addresses the pointers store.

Let's look at the C specification, §6.5.8 on page 85 which discusses relational operators (i.e. the comparison operators you're using). Note that this does not apply to direct != or == comparison.

When two pointers are compared, the result depends on the relative locations in the address space of the objects pointed to. ... If the objects pointed to are members of the same aggregate object, ... pointers to array elements with larger subscript values compare greater than pointers to elements of the same array with lower subscript values.

In all other cases, the behavior is undefined.

The last sentence is important. While I cut down some unrelated cases to save space, there's one case that's important to us: two arrays, not part of the same struct/aggregate object1, and we're comparing pointers to those two arrays. This is undefined behavior.

While your compiler just inserted some sort of CMP (compare) machine instruction which numerically compares the pointers, and you got lucky here, UB is a pretty dangerous beast. Literally anything can happen--your compiler could optimize out the whole function including visible side effects. It could spawn nasal demons.

1Pointers into two different arrays that are part of the same struct can be compared, since this falls under the clause where the two arrays are part of the same aggregate object (the struct).

查看更多
Juvenile、少年°
6楼-- · 2020-02-10 02:44

Then asked what

p[0].p0 < p[0].p1
p[1].p0 < p[1].p1
p[2].p0 < p[2].p1

Evaluate to. The answer is 0, 1, and 0.

These questions reduce to:

  1. Is the heap above or below the stack.
  2. Is the heap above or below the string literal section of the program.
  3. same as [1].

And the answer to all three is "implementation defined". Your prof's questions are bogus; they have based it in traditional unix layout:

<empty>
text
rodata
rwdata
bss
< empty, used for heap >
...
stack
kernel

but several modern unices (and alternative systems) do not conform to those traditions. Unless they prefaced the question with " as of 1992 "; make sure to give a -1 on the eval.

查看更多
\"骚年 ilove
7楼-- · 2020-02-10 02:51

On almost any remotely-modern platform, pointers and integers have an isomorphic ordering relation, and pointers to disjoint objects are not interleaved. Most compilers expose this ordering to programmers when optimizations are disabled, but the Standard makes no distinction between platforms that have such an ordering and those that don't and does not require that any implementations expose such an ordering to the programmer even on platforms that would define it. Consequently, some compiler writers perform various kinds of optimizations and "optimizations" based upon an assumption that code will never compare use relational operators on pointers to different objects.

According to the published Rationale, the authors of the Standard intended that implementations extend the language by specifying how they will behave in situations the Standard characterizes as "Undefined Behavior" (i.e. where the Standard imposes no requirements) when doing so would be useful and practical, but some compiler writers would rather assume programs will never try to benefit from anything beyond what the Standard mandates, than allow programs to usefully exploit behaviors the platforms could support at no extra cost.

I'm not aware of any commercially-designed compilers that do anything weird with pointer comparisons, but as compilers move to the non-commercial LLVM for their back end, they're increasingly likely to process nonsensically code whose behavior had been specified by earlier compilers for their platforms. Such behavior isn't limited to relational operators, but can even affect equality/inequality. For example, even though the Standard specifies that a comparison between a pointer to one object and a "just past" pointer to an immediately-preceding object will compare equal, gcc and LLVM-based compilers are prone to generate nonsensical code if programs perform such comparisons.

As an example of a situation where even equality comparison behaves nonsensically in gcc and clang, consider:

extern int x[],y[];
int test(int i)
{
    int *p = y+i;
    y[0] = 4;
    if (p == x+10)
        *p = 1;
    return y[0];
}

Both clang and gcc will generate code that will always return 4 even if x is ten elements, y immediately follows it, and i is zero resulting in the comparison being true and p[0] being written with the value 1. I think what happens is that one pass of optimization rewrites the function as though *p = 1; were replaced with x[10] = 1;. The latter code would be equivalent if the compiler interpreted *(x+10) as equivalent to *(y+i), but unfortunately a downstream optimization stage recognizes that an access to x[10] would only defined if x had at least 11 elements, which would make it impossible for that access to affect y.

If compilers can get that "creative" with pointer equality scenario which is described by the Standard, I would not trust them to refrain from getting even more creative in cases where the Standard doesn't impose requirements.

查看更多
登录 后发表回答