Edit: For reference purposes (if anyone stumbles across this question), Igor Ostrovsky wrote a great post about cache misses. It discusses several different issues and shows example numbers. End Edit
I did some testing <long story goes here>
and am wondering if a performance difference is due to memory cache misses. The following code demonstrates the issue and boils it down to the critical timing portion. The following code has a couple of loops that visit memory in random order and then in ascending address order.
I ran it on an XP machine (compiled with VS2005: cl /O2) and on a Linux box (gcc –Os). Both produced similar times. These times are in milliseconds. I believe all loops are running and are not optimized out (otherwise it would run “instantly”).
*** Testing 20000 nodes Total Ordered Time: 888.822899 Total Random Time: 2155.846268
Do these numbers make sense? Is the difference primarily due to L1 cache misses or is something else going on as well? There are 20,000^2 memory accesses and if every one were a cache miss, that is about 3.2 nanoseconds per miss. The XP (P4) machine I tested on is 3.2GHz and I suspect (but don’t know) has a 32KB L1 cache and 512KB L2. With 20,000 entries (80KB), I assume there is not a significant number of L2 misses. So this would be (3.2*10^9 cycles/second) * 3.2*10^-9 seconds/miss) = 10.1 cycles/miss
. That seems high to me. Maybe it’s not, or maybe my math is bad. I tried measuring cache misses with VTune, but I got a BSOD. And now I can’t get it to connect to the license server (grrrr).
typedef struct stItem
{
long lData;
//char acPad[20];
} LIST_NODE;
#if defined( WIN32 )
void StartTimer( LONGLONG *pt1 )
{
QueryPerformanceCounter( (LARGE_INTEGER*)pt1 );
}
void StopTimer( LONGLONG t1, double *pdMS )
{
LONGLONG t2, llFreq;
QueryPerformanceCounter( (LARGE_INTEGER*)&t2 );
QueryPerformanceFrequency( (LARGE_INTEGER*)&llFreq );
*pdMS = ((double)( t2 - t1 ) / (double)llFreq) * 1000.0;
}
#else
// doesn't need 64-bit integer in this case
void StartTimer( LONGLONG *pt1 )
{
// Just use clock(), this test doesn't need higher resolution
*pt1 = clock();
}
void StopTimer( LONGLONG t1, double *pdMS )
{
LONGLONG t2 = clock();
*pdMS = (double)( t2 - t1 ) / ( CLOCKS_PER_SEC / 1000 );
}
#endif
long longrand()
{
#if defined( WIN32 )
// Stupid cheesy way to make sure it is not just a 16-bit rand value
return ( rand() << 16 ) | rand();
#else
return rand();
#endif
}
// get random value in the given range
int randint( int m, int n )
{
int ret = longrand() % ( n - m + 1 );
return ret + m;
}
// I think I got this out of Programming Pearls (Bentley).
void ShuffleArray
(
long *plShuffle, // (O) return array of "randomly" ordered integers
long lNumItems // (I) length of array
)
{
long i;
long j;
long t;
for ( i = 0; i < lNumItems; i++ )
plShuffle[i] = i;
for ( i = 0; i < lNumItems; i++ )
{
j = randint( i, lNumItems - 1 );
t = plShuffle[i];
plShuffle[i] = plShuffle[j];
plShuffle[j] = t;
}
}
int main( int argc, char* argv[] )
{
long *plDataValues;
LIST_NODE *pstNodes;
long lNumItems = 20000;
long i, j;
LONGLONG t1; // for timing
double dms;
if ( argc > 1 && atoi(argv[1]) > 0 )
lNumItems = atoi( argv[1] );
printf( "\n\n*** Testing %u nodes\n", lNumItems );
srand( (unsigned int)time( 0 ));
// allocate the nodes as one single chunk of memory
pstNodes = (LIST_NODE*)malloc( lNumItems * sizeof( LIST_NODE ));
assert( pstNodes != NULL );
// Create an array that gives the access order for the nodes
plDataValues = (long*)malloc( lNumItems * sizeof( long ));
assert( plDataValues != NULL );
// Access the data in order
for ( i = 0; i < lNumItems; i++ )
plDataValues[i] = i;
StartTimer( &t1 );
// Loop through and access the memory a bunch of times
for ( j = 0; j < lNumItems; j++ )
{
for ( i = 0; i < lNumItems; i++ )
{
pstNodes[plDataValues[i]].lData = i * j;
}
}
StopTimer( t1, &dms );
printf( "Total Ordered Time: %f\n", dms );
// now access the array positions in a "random" order
ShuffleArray( plDataValues, lNumItems );
StartTimer( &t1 );
for ( j = 0; j < lNumItems; j++ )
{
for ( i = 0; i < lNumItems; i++ )
{
pstNodes[plDataValues[i]].lData = i * j;
}
}
StopTimer( t1, &dms );
printf( "Total Random Time: %f\n", dms );
}
3.2ns for an L1 cache miss is entirely plausible. For comparison, on one particular modern multicore PowerPC CPU, an L1 miss is about 40 cycles -- a little longer for some cores than others, depending on how far they are from the L2 cache (yes really). An L2 miss is at least 600 cycles.
Cache is everything in performance; CPUs are so much faster than memory now that you're really almost optimizing for the memory bus instead of the core.
It's difficult to say anything for sure without a lot more testing, but in my experience that scale of difference definitely can be attributed to the CPU L1 and/or L2 cache, especially in a scenario with randomized access. You could probably make it even worse by ensuring that each access is at least some minimum distance from the last.