Sunday, April 18, 2010

Load-Hit-Store

90% of the time is spent in 10% of the code, so make that 10% the fastest code it can be.


Load-Hit-Store: is one of those quirky CPU implementation details that can cause significant performance problems in high-level code. It happens when the compiler writes data to an address 'x' and the tries to load the data from 'x' again too song.

This sequence of a memory read operation (LOAD), the assignment of the value to a register (HIT) and the actual writing of the value into a register (LOAD) is usually hidden away in stages of the pipelines, so these operations cause no stalls. However, if the memory location being read was one recently written to by a previous write operation, it can take as many at 40 cycles before the Store operation can complete.

stfs fr3, 0(r3) // Store the float - takes up to 40 cycles
lwz r9, 0(r3) // Load r3 into r9
add r9, r1, r9 // Stall: use r9 before the store operation has finished

There are different ways to generate LHS:

Using member values or references pointers as iterators in tight loops

Example A:
for( int i = 0; i < 100; i++ )
{
m_iData++; // As member function it is stored in memory
}

//-----------------------------------------------------------
Example B:
void foo( int & count ) // the variable count is memory bound
{
for( int i = 0; i < 100; i++ )
{
count++; // As member function it is stored in memory
}
}

Solution: use registers that invoke no penalty

Example A:

int iData = m_iData;
for( int i = 0; i < 100; i++ )
{
iData++; // The local variable is stores in a register
}
m_iData = iData;

//-----------------------------------------------------------
Example B:
void foo( int & output )
{
int count = output;
for( int i = 0; i < 100; i++ )
{
count++; // As member function it is stored in memory
}
output = count;
}

Conversion between int and float

Try to avoid int to float conversions like:

float fAngle = (float)i * fAngleDelta;

Solution: It will be better to have int and float duplicated members.

typedef struct ScreenSize
{
int m_iWidth;
int m_iHeight;
float m_fWidth;
float m_fHeight;
// Update both, int and float
inline void SetHeight( int iWidth)
{
m_iWidth = iWidth;
m_fWidht = static_cast(iWidth);
}
}

C++ constructors that have just one parameter automatically
perform implicit type conversion. If you pass anint when the
constructor expects a float, the compiler will add the
necessary code to convert int to float. This will cause a
Load-Hit-Store issue. It is possible to add the explicit
keyword to the constructor declaration to prevent implicit
conversions. This, forces the code to either use a parameter of
the correct type, or cast the parameter to the correct type.

Read and write in memory too close

int CauseLHS( int *ptrA )
{
int a,b;
int * ptrB = ptrA; // B and A point to the same direction
*
ptrA = 5; // Write data to address prtA
b = *ptrB; // Read that data back again
//(won't be available for 40/80 cycles)

a = b + 10;// Stall! The data b isn't available yet
}

Solution: this seems like the sort of thing the compiler should notice and fix by simply keeping content of *ptrA in a register. But it doesn't, so it is obliged to read memory back from a pointer every time yo dereference it, because any other pointer in the function might have aliased and modified the data. The keyword __restrict on a pointer promises the compiler that it has no aliases: nothing else in the function points to that same data. Thus, this keyword helps to avoid LHS.

The compiler knows that if it writes data to a pointer, it doesn't need to read it back into a register later on because nothing else could have written to that address. Without __restrict, the compiler is forced to read data from every pointer every time it is used, because another pointer may have aliased x.

This keyword is a promise you make to the compiler. If you break your promise, you can get incorrect results. If pointer pA and pB are __restrict and pA==pB that will cause mysterious bugs.

int slow( int * a, int * b)
{
*
a = 5;
*
b = 7;
return *a + *b; // Stall! The compiler doesn't know whether
// a==b, so it has to reload both
// before the add

}
int fast( int *__restrict a, int *__restrict b)
{
*
a = 5;
*
b = 7; // Restrict promises that a!=b
return *a + *b; // No stall, a & b are in registers
}

There is no way to mark references as __restrict. In this case, copy the parameters to local variables inside your function, then write the final values back out again at the end, as we saw in the previous solutions.

Bibliography

Gamasutra article
__restric

6 comments: