I read the memcpy implementation in the http://androidxref.com/4.0.4/xref/bionic/libc/string/bcopy.c find following code is hard to understand, can anybody explain it?
36 /*
37 * sizeof(word) MUST BE A POWER OF TWO
38 * SO THAT wmask BELOW IS ALL ONES
39 */
40 typedef long word; /* "word" used for optimal copy speed */
41
42 #define wsize sizeof(word)
43 #define wmask (wsize - 1)
44
...
/*
78 * Copy forward.
79 */
80 t = (long)src; /* only need low bits */
81 if ((t | (long)dst) & wmask) {
82 /*
83 * Try to align operands. This cannot be done
84 * unless the low bits match.
85 */
86 if ((t ^ (long)dst) & wmask || length < wsize)
87 t = length;
88 else
89 t = wsize - (t & wmask);
What is mean of these bitwise operations? What is their intent?
Just do it step by step:
These check if at least one of
src
anddst
isn't a multiple ofsizeof(long)
.This checks if
src
anddst
are aligned differently w.r.t.sizeof(long)
(IOW, aren't "equal" multiples ofsizeof(long)
) or iflength < sizeof(long)-1
.At the end you receive in
t
how many bytes have to be copied as bytes between unaligned locations, either all (length
), or just enough (less thansizeof(long)
) to reach addresses that are multiple ofsizeof(long)
from where the rest can be copied in units oflong
. And the latter is the speed optimization.To see all that you have to know that an integer, when expressed in binary, is a multiple of a certain power of 2 when its least significant bits below that power of 2 are all zeroes.
Examples:
1002 (410) is a multiple of 1002 (410)
11002 (1210) is a multiple of 1002 (410)
100002 (1610) is a multiple of 1002 (410)
02 (010) is a multiple of 1002 (410)
112 (310) is not a multiple of 1002 (410)
11012 (1310) is not a multiple of 1002 (410)
This is what
& (sizeof(long)-1)
is used for.You also need to know that a value
XORed
with itself gives 0 and when youXOR
different values the result is non-zero. So you can useXOR
for comparison. And that's what(t ^ (long)dst)
is for.The basic idea is to meet alignment constraints: each "word" to be copied one word at a time must be aligned on a "word" boundary.
Some CPUs have this as a fundamental constraint, that loads and stores must occur on a "natural" boundary. On older ARM processors the low bits of the address are actually ignored entirely, so that a load or store of two bytes from an odd-valued address have the same effect as:
for instance. On some other CPUs an unaligned load or store results in a trap (MIPS and SPARC for instance), and still others will just do it, but with a performance penalty (x86).
So, suppose you're copying a large number of bytes (say, 4096 of them) from address 0x12345 to address 0x22345, and that the "word size" is 4 bytes. If we first copy three bytes, the addresses will now be 0x12348 and 0x22348. At this point we can copy just 1023 4-byte words, one word at a time, without tripping over any alignment issues. After that we'll have one remaining byte to copy, because 4096 = 3 + (4 * 1023) + 1.
This all makes the assumption that bytes are all addressed individually, even when loading and storing "words". This assumption is false on some machines: for instance, the old Data General MV10000 CPU would address "words" using "word addresses", which are essentially byte addresses divided by two. (It's thus impossible to address a "word" that spans two bytes: the word at location 0 has byte addresses 0 and 1 but word address 0; the word at location 1 has byte addresses 2 and 3; the word at location 2 has byte addresses 4 and 5; and so on.) On machines like this, you may need to use a different version of bcopy.c.
As @Alex noted, the XOR is just making sure that it is actually possible to align the two addresses. If you're copying from 0x12345 to 0x22345, it is; but if you're copying from 0x12345 to 0x22344, the two addresses will never line up.