Is using an unsigned rather than signed int more l

2019-03-08 00:17发布

站内文章 / C++

76 0

啃猪蹄的小仙女

女 | 书童

私信

可以将文章内容翻译成中文,广告屏蔽插件可能会导致该功能失效(如失效，请关闭广告屏蔽插件后再试):

问题:

In the Google C++ Style Guide, on the topic of "Unsigned Integers", it is suggested that

Because of historical accident, the C++ standard also uses unsigned integers to represent the size of containers - many members of the standards body believe this to be a mistake, but it is effectively impossible to fix at this point. The fact that unsigned arithmetic doesn't model the behavior of a simple integer, but is instead defined by the standard to model modular arithmetic (wrapping around on overflow/underflow), means that a significant class of bugs cannot be diagnosed by the compiler.

What is wrong with modular arithmetic? Isn't that the expected behaviour of an unsigned int?

What kind of bugs (a significant class) does the guide refer to? Overflowing bugs?

Do not use an unsigned type merely to assert that a variable is non-negative.

One reason that I can think of using signed int over unsigned int, is that if it does overflow (to negative), it is easier to detect.

回答1:

Some of the answers here mention the surprising promotion rules between signed and unsigned values, but this seems more like a problem relating to mixing signed and unsigned values, and doesn't necessarily explain why signed is preferred over unsigned, outside of mixing scenarios.

In my experience, outside of mixed comparisons and promotion rules, there are two primary reasons why unsigned values are big bug magnets.

Unsigned values have a discontinuity at zero, the most common value in programming

Both unsigned and signed integers have a discontinuities at their minimum and maximum values, where they wrap around (unsigned) or cause undefined behavior (signed). For unsigned these points are at zero and UINT_MAX. For int they are at INT_MIN and INT_MAX. Typical values of INT_MIN and INT_MAX on system with 4-byte int values are -2^31 and 2^31-1, and on such a system UINT_MAX is typically 2^32-1.

The primary bug-inducing problem with unsigned that doesn't apply to int is that it has a discontinuity at zero. Zero, of course, is a very common value in programs, along with other small values like 1,2,3. It is common to add and subtract small values, especially 1, in various constructs, and if you subtract anything from an unsigned value and it happens to be zero, you just got a massive positive value and an almost certain bug.

Consider code iterates over all values in a vector by index except the last^0.5:

for (size_t i = 0; i < v.size() - 1; i++) { // do something }

This works fine until one day you pass in an empty vector. Instead of doing zero iterations, you get v.size() - 1 == a giant number¹ and you'll do 4 billion iterations and almost have a buffer overflow vulnerability.

You need to write it like this:

for (size_t i = 0; i + 1 < v.size(); i++) { // do something }

So it can be "fixed" in this case, but only by carefully thinking about the unsigned nature of size_t. Sometimes you can't apply the fix above because instead of a constant one you have some variable offset you want to apply, which may be positive or negative: so which "side" of the comparison you need to put it on depends on the signedness - now the code gets really messy.

There is a similar issue with code that tries to iterate down to and including zero. Something like while (index-- > 0) works fine, but the apparently equivalent while (--index >= 0) will never terminate for an unsigned value. Your compiler might warn you when the right hand side is literal zero, but certainly not if it is a value determined at runtime.

Counterpoint

Some might argue that signed values also have two discontinuities, so why pick on unsigned? The difference is that both discontinuities are very (maximally) far away from zero. I really consider this a separate problem of "overflow", both signed and unsigned values may overflow at very large values. In many cases overflow is impossible due to constraints on the possible range of the values, and overflow of many 64-bit values may be physically impossible). Even if possible, the chance of an overflow related bug is often minuscule compared to an "at zero" bug, and overflow occurs for unsigned values too. So unsigned combines the worst of both worlds: potentially overflow with very large magnitude values, and a discontinuity at zero. Signed only has the former.

Many will argue "you lose a bit" with unsigned. This is often true - but not always (if you need to represent differences between unsigned values you'll lose that bit anyways: so many 32-bit things are limited to 2 GiB anyways, or you'll have a weird grey area where say a file can be 4 GiB, but you can't use certain APIs on the second 2 GiB half).

Even in the cases where unsigned buys you a bit: it doesn't buy you much: if you had to support more than 2 billion "things", you'll probably soon have to support more than 4 billion.

Logically, unsigned values are a subset of signed values

Mathematically, unsigned values (non-negative integers) are a subset of signed integers (just called _integers).². Yet signed values naturally pop out of operations solely on unsigned values, such as subtraction. We might say that unsigned values aren't closed under subtraction. The same isn't true of signed values.

Want to find the "delta" between two unsigned indexes into a file? Well you better do the subtraction in the right order, or else you'll get the wrong answer. Of course, you often need a runtime check to determine the right order! When dealing with unsigned values as numbers, you'll often find that (logically) signed values keep appearing anyways, so you might as well start of with signed.

Counterpoint

As mentioned in footnote (2) above, signed values in C++ aren't actually a subset of unsigned values of the same size, so unsigned values can represent the same number of results that signed values can.

True, but the range is less useful. Consider subtraction, and unsigned numbers with a range of 0 to 2N, and signed numbers with a range of -N to N. Arbitrary subtractions result in results in the range -2N to 2N in _both cases, and either type of integer can only represent half of it. Well it turns out that the region centered around zero of -N to N is usually way more useful (contains more actual results in real world code) than the range 0 to 2N. Consider any of typical distribution other than uniform (log, zipfian, normal, whatever) and consider subtracting randomly selected values from that distribution: way more values end up in [-N, N] than [0, 2N] (indeed, resulting distribution is always centered at zero).

64-bit closes the door on many of the reasons to use signed values as numbers

I think the arguments above were already compelling for 32-bit values, but the overflow cases, which affect both signed and unsigned at different thresholds, do occur for 32-bit values, since "2 billion" is a number that can exceeded by many abstract and physical quantities (billions of dollars, billions of nanoseconds, arrays with billions of elements). So if someone is convinced enough by the doubling of the positive range for unsigned values, they can make the case that overflow does matter and it slightly favors unsigned.

Outside of specialized domains 64-bit values largely remove this concern. Signed 64-bit values have an upper range of 9,223,372,036,854,775,807 - more than nine quintillion. That's a lot of nanoseconds (about 292 years worth), and a lot of money. It's also a larger array than any computer is likely to have RAM in a coherent address space for a long time. So maybe 9 quintillion is enough for everybody (for now)?

When to use unsigned values

Note that the style guide doesn't forbid or even necessarily discourage use of unsigned numbers. It concludes with:

Do not use an unsigned type merely to assert that a variable is non-negative.

Indeed, there are good uses for unsigned variables:

When you want to treat an N-bit quantity not as an integer, but simply a "bag of bits". For example, as a bitmask or bitmap, or N boolean values or whatever. This use often goes hand-in-hand with the fixed width types like uint32_t and uint64_t since you often want to know the exact size of the variable. A hint that a particular variable deserves this treatment is that you only operate on it with with the bitwise operators such as ~, |, &, ^, >> and so on, and not with the arithmetic operations such as +, -, *, / etc.

Unsigned is ideal here because the behavior of the bitwise operators is well-defined and standardized. Signed values have several problems, such as undefined and unspecified behavior when shifting, and an unspecified representation.
When you actually want modular arithmetic. Sometimes you actually want 2^N modular arithmetic. In these cases "overflow" is a feature, not a bug. Unsigned values give you what you want here since they are defined to use modular arithmetic. Signed values cannot be (easily, efficiently) used at all since they have an unspecified representation and overflow is undefined.

^0.5 After I wrote this I realized this is nearly identical to Jarod's example, which I hadn't seen - and for good reason, it's a good example!

¹ We're talking about size_t here so usually 2^32-1 on a 32-bit system or 2^64-1 on a 64-bit one.

² In C++ this isn't exactly the case because unsigned values contain more values at the upper end than the corresponding signed type, but the basic problem exists that manipulating unsigned values can result in (logically) signed values, but there is no corresponding issue with signed values (since signed values already include unsigned values).

回答2:

As stated, mixing unsigned and signed might lead to unexpected behaviour (even if well defined).

Suppose you want to iterate over all elements of vector except for the last five, you might wrongly write:

for (int i = 0; i < v.size() - 5; ++i) { foo(v[i]); } // Incorrect
// for (int i = 0; i + 5 < v.size(); ++i) { foo(v[i]); } // Correct

Suppose v.size() < 5, then, as v.size() is unsigned, s.size() - 5 would be a very large number, and so i < v.size() - 5 would be true for a more expected range of value of i. And UB then happens quickly (out of bound access once i >= v.size())

If v.size() would have return signed value, then s.size() - 5 would have been negative, and in above case, condition would be false immediately.

On the other side, index should be between [0; v.size()[ so unsigned makes sense. Signed has also its own issue as UB with overflow or implementation-defined behaviour for right shift of a negative signed number, but less frequent source of bug for iteration.

回答3:

One of the most hair-raising examples of an error is when you MIX signed and unsigned values:

#include <iostream>
int main()  {
    auto qualifier = -1 < 1u ? "makes" : "does not make";
    std::cout << "The world " << qualifier << " sense" << std::endl;
}

The output:

The world does not make sense

Unless you have a trivial application, it's inevitable you'll end up with either dangerous mixes between signed and unsigned values (resulting in runtime errors) or if you crank up warnings and make them compile-time errors, you end up with a lot of static_casts in your code. That's why it's best to strictly use signed integers for types for math or logical comparison. Only use unsigned for bitmasks and types representing bits.

Modeling a type to be unsigned based on the expected domain of the values of your numbers is a Bad Idea. Most numbers are closer to 0 than they are to 2 billion, so with unsigned types, a lot of your values are closer to the edge of the valid range. To make things worse, the final value may be in a known positive range, but while evaluating expressions, intermediate values may underflow and if they are used in intermediate form may be VERY wrong values. Finally, even if your values are expected to always be positive, that doesn't mean that they won't interact with other variables that can be negative, and so you end up with a forced situation of mixing signed and unsigned types, which is the worst place to be.

回答4:

Why is using an unsigned int more likely to cause bugs than using a signed int?

Using an unsigned type is not more likely to cause bugs than using a signed type with certain classes of tasks.

Use the right tool for the job.

What is wrong with modular arithmetic? Isn't that the expected behaviour of an unsigned int?
Why is using an unsigned int more likely to cause bugs than using a signed int?

If the task if well-matched: nothing wrong. No, not more likely.

Security, encryption, and authentication algorithm count on unsigned modular math.

Compression/decompression algorithms too as well as various graphic formats benefit and are less buggy with unsigned math.

Any time bit-wise operators and shifts are used, the unsigned operations do not get messed up with the sign-extension issues of signed math.

Signed integer math has an intuitive look and feel readily understood by all including learners to coding. C/C++ was not targeted originally nor now should be an intro-language. For rapid coding that employs safety nets concerning overflow, other languages are better suited. For lean fast code, C assumes that coders knows what they are doing (they are experienced).

A pitfall of signed math today is the ubiquitous 32-bit int that with so many problems is well wide enough for the common tasks without range checking. This leads to complacency that overflow is not coded against. Instead, for (int i=0; i < n; i++) int len = strlen(s); is viewed as OK because n is assumed < INT_MAX and strings will never be too long, rather than being full ranged protected in the first case or using size_t, unsigned or even long long in the 2nd.

C/C++ developed in an era that included 16-bit as well as 32-bit int and the extra bit an unsigned 16-bit size_t affords was significant. Attention was needed in regard to overflow issues be it int or unsigned.

With 32-bit (or wider) applications of Google on non-16 bit int/unsigned platforms, affords the lack of attention to +/- overflow of int given its ample range. This makes sense for such applications to encourage int over unsigned. Yet int math is not well protected.

The narrow 16-bit int/unsigned concerns apply today with select embedded applications.

Google's guidelines apply well for code they write today. It is not a definitive guideline for the larger wide scope range of C/C++ code.

One reason that I can think of using signed int over unsigned int, is that if it does overflow (to negative), it is easier to detect.

In C/C++, signed int math overflow is undefined behavior and so not certainly easier to detect than defined behavior of unsigned math.

As @Chris Uzdavinis well commented, mixing signed and unsigned is best avoided by all (especially beginners) and otherwise coded carefully when needed.

回答5:

I have some experience with Google's style guide, AKA the Hitchhiker's Guide to Insane Directives from Bad Programmers Who Got into the Company a Long Long Time Ago. This particular guideline is just one example of the dozens of nutty rules in that book.

Errors only occur with unsigned types if you try to do arithmetic with them (see Chris Uzdavinis example above), in other words if you use them as numbers. Unsigned types are not intended to be used to store numeric quantities, they are intended to store counts such as the size of containers, which can never be negative, and they can and should be used for that purpose.

The idea of using arithmetical types (like signed integers) to store container sizes is idiotic. Would you use a double to store the size of a list, too? That there are people at Google storing container sizes using arithmetical types and requiring others to do the same thing says something about the company. One thing I notice about such dictates is that the dumber they are, the more they need to be strict do-it-or-you-are-fired rules because otherwise people with common sense would ignore the rule.

回答6:

Using unsigned types to represent non-negative values...

is more likely to cause bugs involving type promotion, when using signed and unsigned values, as other answer demonstrate and discuss in depth, but
is less likely to cause bugs involving choice of types with domains capable of representing undersirable/disallowed values. In some places you'll assume the value is in the domain, and may get unexpected and potentially hazardous behavior when other value sneak in somehow.

The Google Coding Guidelines puts emphasis on the first kind of consideration. Other guideline sets, such as the C++ Core Guidelines, put more emphasis on the second point. For example, consider Core Guideline I.12:

I.12: Declare a pointer that must not be null as not_null

Reason

To help avoid dereferencing nullptr errors. To improve performance by avoiding redundant checks for nullptr.

Example
int length(const char* p);            // it is not clear whether length(nullptr) is valid
length(nullptr);                      // OK?
int length(not_null<const char*> p);  // better: we can assume that p cannot be nullptr
int length(const char* p);            // we must assume that p can be nullptr
By stating the intent in source, implementers and tools can provide better diagnostics, such as finding some classes of errors through static analysis, and perform optimizations, such as removing branches and null tests.

Of course, you could argue for a non_negative wrapper for integers, which avoids both categories of errors, but that would have its own issues...