Using arrays of character strings: arrays of point

2019-06-23 23:43发布

问题:

I've been reading C++ for dummies lately and either the title is a misnomer or they didn't count on me. On a section about utilizing arrays of pointers with characters strings they show a function on which I've been completely stumped and don't know where to turn.

char* int2month(int nMonth)
{
//check to see if value is in rang
if ((nMonth < 0) || (nMonth > 12))
    return "invalid";

//nMonth is valid - return the name of the month
char* pszMonths[] = {"invalid", "January", "February", "March", "April", "May", "June", 
                     "July", "August", "September", "October", "November", "December"};

return pszMonths[nMonth];
} 

First of (but not the main question), I don't understand why the return type is a pointer and how you can return pszMonths without it going out of scope. I've read about it in this book and online but I don't get it in this example.

The main question I have is "how does this work?!?!". I don't understand how you can create an array of pointers and actually initialize them. If I remember correctly you can't do this with numeric data types. Is each pointer in the "array of pointers" like an array itself, containing the individual characters which make up the words? This whole thing just boggles my mind.

August 20 - Since there seems to me some confusion by the people trying to help me at as to where my confusion actually stems from I'll try to explain it better. The section of code in particular I am concerned with is the following:

//nMonth is valid - return the name of the month
char* pszMonths[] = {"invalid", "January", "February", "March", "April", "May", "June", 
                 "July", "August", "September", "October", "November", "December"};

I thought that when you made a pointer you could only assign it to another predetermined value. I'm confused that what seems to be an array of pointers (going by the book here) initializes the month names. I did not think pointers could actually initialize values. Is the array dynamically allocating memory? Is "invalid" essentially equivalent to a "new char;" statement or something similar?

I'll try re-reading the posts in case they answered my questions but I just didn't understand the first time around.

回答1:

ok, let's take one line at a time.
 

char* int2month(int nMonth)

This line is most probably WRONG, because it says the function returns a pointer to a modifiable char (by convention this will be the first char element of an array). Instead it should say char const* or const char* as the result type. These two specifications mean exactly the same, namely a pointer to a char that you cannot modify.
 

{

This is just the opening brace of the function body. The function body ends at corresponding closing brace.
 

//check to see if value is in rang

This is a comment. It is ignored by the compiler.
 

if ((nMonth < 0) || (nMonth > 12))
    return "invalid";

Here the return statement is executed if and only if the condition in the if holds. The purpose is to deal in a predictable way with incorrect argument value. However, the checking is probably WRONG because it allows both values 0 and 12 as valid, which gives a total of 13 valid values, whereas a calendar year has only 12 months.

By the way, technically, for the return statement the specified return value is an array of 8 char elements, namely the 7 characters plus a nullbyte at the end. This array is implicitly converted to a pointer to its first element, which is called a type decay. This particular decay, from string literal to pointer to non-const char, is specially supported in C++98 and C++03 in order to be compatible with old C, but is invalid in the upcoming C++0x standard.

The book should not teach such ugly things; use const for the result type.


 

//nMonth is valid - return the name of the month
char* pszMonths[] = {"invalid", "January", "February", "March", "April", "May", "June", 
                     "July", "August", "September", "October", "November", "December"};

This array initialization again involves that decay. It's an array of pointers. Each pointer is initialized with a string literal, which type-wise is an array, and decays to pointer.

By the way, the "psz" prefix is a monstrosity called Hungarian Notation. It was invented for C programming, supporting the help system in Microsoft's Programmer's Workbench. In modern programming it serves no useful purpose but instead just akes the simplest code read like gibberish. You really don't want to adopt that.
 

return pszMonths[nMonth];

This indexing has formal Undefined Behavior, also known affectionately as just "UB", if nMonth is the value 12, since there is no array element at index 12. In practice you'll get some gibberish result.

EDIT: oh I didn't notice that the author has placed the month name "invalid" at the front, which makes for 13 array elements. how to obscure code... i didn't notice it because it's very bad and unexpected; the checking for "invalid" is done higher up in the function.


 

} 

And this is the closing brace of the function body.

Cheers & hth.,



回答2:

Perhaps a line-by-line explanation will help.

/* This function takes an int and returns the corresponding month
 0 returns invalid
 1 returns January
 2 returns February
 3 returns March
 ...
 12 returns December
*/
char* int2month(int nMonth)
{
// if nMonth is less than 0 or more than 12, it's an invalid number
if ((nMonth < 0) || (nMonth > 12))
    return "invalid";

// this line creates an array of char* (strings) and fills it with the names of the months
//
char* pszMonths[] = {"invalid",  // index 0
                     "January",  // index 1
                     "February", // index 2
                     "March",    // index 3
                     "April",    // index 4
                     "May",      // index 5
                     "June",     // index 6
                     "July",     // index 7
                     "August",   // index 8
                     "September",// index 9
                     "October",  // index 10
                     "November", // index 11
                     "December"  // index 12
                    };

// use nMonth to index the pszMonths array to return the appropriate month
// if nMonth is 1, returns January because pszMonths[1] is January
// if nMonth is 2, returns February because pszMonths[2] is February
// etc
return pszMonths[nMonth];
} 

First thing to get out of the way that you might not know is that a string literal in your program (stuff with double quotes around it) is really of the char* type1.

Second thing that you might not have realized is that indexing into an array of char*s (which is char* pszStrings[]) yields a char*, which is a string.

The reason why you can return something from local scope in this instance is because string literals are stored in the program at compile time and do not get destroyed. For instance, this is perfectly fine:

char* blah() { return "blah"; }

And it's almost like doing this2:

int blah() { return 5; }

Secondly, when you have an = {/* stuff */} after an array declaration, that's called an initializer list. If you leave off the size of the array like you're doing, the compiler figures out how big to make the array by how many elements are in the initializer list. So char* pszMonths[] means "an array of char*" and since you have "invalid", "January", "February", etc. in the initializer list and they are char*s1, you're just initializing your array of char*s with some char*s. And you misremembered about not being able to do this with numeric types, because you can do this with any type, numeric types and strings included.

1 It's not really a char*, it's a char const[x], and you cannot modify that memory like you could with a char*, but that's not important to you right now.

2 It's not really like that, but if it helps you to think of it that way, feel free until you get better at C++ and can handle the various subtleties without dying.



回答3:

What's your expectation on what int2month is supposed to do?

Do you have a mental model of what the memory looks like? Here's my pictorial representation of the memory, for example:

pszMonths =      [   .       ,     .   ,   .    , ...]
                     |             |       |
                     |             |       |
                     V             |       |   
                     "invalid"     |       V
                                   |    "February"
                                   V
                               "January"

pszMonths is an array, which you should already be familiar with. The array's elements are pointers, though. You have to follow the arrows down to their values, in which case are strings. This kind of indirect representation is necessary: it's not easy to do this with a flat representation, because each month name has its own, variable length.

It's very hard to tell where you're getting stuck on without more discussion. You need to say more.

[Edit]

Ok, you've said a little more. It sounds like you need to know a little more about C's program model. When your program compiles, it reduces down to a code part, and a data part.

What's included in the data part? Things like string literals. Each string literal is laid out somewhere in memory. If your compiler is good, and if you use the same literal twice, your compiler won't have two copies, but will reuse them.

Here's a small program to demonstrate.

#include <stdio.h>
int main(void) {
  char *name1 = "foo";
  char *name2 = "foo";
  char *name3 = "bar";

  printf("The address of the string in the data segment is: %d\n", (int) name1);
  printf("The address of the string in the data segment is: %d\n", (int) name2);
  printf("The address of the string in the data segment is: %d\n", (int) name3);
  return 0;
}

Here's what things look like when I run this program:

$ ./a.out
The address of the string in the data segment is: 134513904
The address of the string in the data segment is: 134513904
The address of the string in the data segment is: 134513908

When you run a C program, the data part of your program, (as well as the code part of your program, of course), gets loaded into memory. Any pointer that refers to a location in data is good, as long as your program continues to run. A pointer to somewhere in the data is valid across function calls, in particular.

Look at the outputs more closely. name1 and name2 are pointers to the same place in data, because it's the same literal string. Your C compiler is often very good at keeping the data compact and unfragmented, which is why you can see that the bytes for "bar" is stored right up against the bytes for "foo".

(What we're seeing is a low-level detail, and potentially not always the case that the compiler will pack the string literals side-by-side: your compiler has freedom to put the representation of those strings pretty much anywhere. But it's cute to see that it's doing so here.)

As an related note, that's why it's ok for a C program to do something like this:

char* good_function() {
  char* msg = "ok";
  return msg;
}

but not ok to do something like this:

char* bad_function() {
  char msg[] = "uh oh";
  return msg;
}

These two functions have entirely different meanings!

  1. The first tells the compiler: "Store this string in the data segment. When you run this function, give me back the address into the data segment".
  2. The second, bad function here says "When you run this function: make a temporary variable on the stack with enough space to write 'uh oh'. Now pop off the temporary space and return an address into the stack... oh, wait, that address is not pointing anywhere good, it is..."


回答4:

In C, the strings are simply sequences of bytes stored in sequential memory locations, byte 0 marking the end of string. For example,

char *s = "abcd"

would result in compiler allocating 2 memory locations: one five bytes long (abcd plus the terminating 0) and one large enough to hold the address of the first one (s). The second location is a pointer variable, the first one is what it points to.

For the array of string, the compiler allocates two memory locations again. For

char *strings[] = {"abc", "def"}

strings will have two pointers in it, and the other locations will have bytes abc\0def\0. Then the first pointer points at a and second at d.



回答5:

This code does not return pszMonths, but it returns one of the pointers contained in pszMonths. These point to string literals, which remain valid even when going out of scope.

One part of this code that is confusing is that it returns a char* rather than a char const*. This means that it is easy to accidentally modify the strings. Attempting to do so would result in undefined behaviour.

Typically string literals are implemented by placing the strings in the data section of the executable. This means that pointers to them always remain valid. When the code in int2month is executed, pszMonths is filled up with pointers, but the underlying data is sitting elsewhere in the executable.

As I said earlier, this code is very unsafe, and doesn't deserve to be enshrined by being published in a book. String literals can be bound to char*, but they are actually made up of char consts. This makes it very easy to accidentally attempt to modify them, which will actually result in undefined behaviour. The only reason that this behaviour exists is to maintain compatibility with C, and it should never be used in new code.



回答6:

First of all, let's assume char* can be replaced with string.

So:

string int2month(int nMonth)
{ /* ... */ }

You return a pointer to char because you can't return an array of chars in C or C++.


In this line:

return "invalid";

"invalid" lives in the program's memory. This means it's always there for you. (But it's undefined behaviour if you try to change it directly without using strcpy() first!1)


Imagine this:

char* szInvalid = "invalid";
char* szJanuary = "January";
char* szFebruary = "February";

string szMarch = "March";

char* pszMonths[] = {szInvalid, szJanuary, szFebruary, szMarch};

Do you see why it's an array of char*s?


1 If you do this:

char* szFoo = "invalid";
szFoo[0] = '!'; szFoo[1] = '?';

char* szBar = "invalid"; // This *might* happen: szBar == "!?valid"