As a follow up to my original question about a small piece of this code I decided to ask a follow up to see if you can do better then what we came up with so far.
The code below iterates over a binary tree (left/right = child/next ). I do believe there is room for one less conditional in here (the down
boolean). The fastest answer wins!
- The
cnt
statement can be multiple statements so lets make sure this appears only once
- The
child()
and next()
member functions are about 30x as slow as the hasChild() and hasNext() operations.
- Keep it iterative <-- dropped this requirement as the recursive solution presented was faster.
- This is C++ code
- visit order of the nodes must stay as they are in the example below. ( hit parents first then the children then the 'next' nodes).
- BaseNodePtr is a boost::shared_ptr as thus assignments are slow, avoid any temporary BaseNodePtr variables.
Currently this code takes 5897ms to visit 62200000 nodes in a test tree, calling this function 200,000 times.
void processTree (BaseNodePtr current, unsigned int & cnt )
{
bool down = true;
while ( true )
{
if ( down )
{
while (true) {
cnt++; // this can/will be multiple statesments
if (!current->hasChild()) break;
current = current->child();
}
}
if ( current->hasNext() )
{
down = true;
current = current->next();
}
else
{
down = false;
current = current->parent();
if (!current)
return; // done.
}
}
}
Why not a recursive solution?
void processTree (const BaseNodePtr ¤t, unsigned int & cnt )
{
cnt++;
if (current->hasChild())
processTree(current->child());
if (current->hasNext())
processTree(current->next());
}
Since shared_ptr
seems to be your bottleneck, why not improve it? Are you using threads? If not, then undefine the symbol BOOST_HAS_THREADS
. The shared_ptr
reference count is guarded by a mutex which is probably the cause of the slow performance.
Why not change your data structure to not use shared_ptr
altogether? Manage the raw pointers yourself? Maybe use scoped_ptr
instead?
For the ultimate speed up what you need to do is order the nodes in memory so that they are stored in a contiguous block in the order that you visit them.
e.g If you have a tree defined as follows.
1
/ \
2 3
/ \ /\
4 5 6 7
/\ / /\
8 9 10 11 12
/ \ \
13 14 15
Then the visit function as described will visit the nodes in the following order
1
2
4
8
13
14
9
5
3
6
10
7
11
12
15
Now if you order the nodes in memory as a contiguous block of 15 allocations and store the nodes in the order demonstrated above then you will generally visiting a node that has "spatial locality". This can improve your cache hits, depending upon the size of your node structure and thus make things run faster.
To create a quick iterative method of visiting all the nodes in a tree only once and with no recursion.
unsigned int g_StackDepth = 0;
BaseNodePtr* g_Stack[MAX_STACK_DEPTH];
void processTree (BaseNodePtr root, unsigned int & cnt )
{
g_Stack[g_StackDepth++] = root;
while( g_StackDepth > 0 )
{
BaseNodePtr curr = g_Stack[--g_StackDepth];
cnt++;
if ( curr->HasNext() )
{
g_Stack[g_StackDepth++] = curr->Next();
}
if ( curr->HasChild() )
{
g_Stack[g_StackDepth++] = curr->Child();
}
}
}
Combined with the above ordering you should get just about the best speed you CAN get, to my knowledge.
Obviously this has limitations as you have to know how big your stack can possibly grow in advance. Though you could get around this by using a std::vector instead. Using a std::vector however would eliminate all the advantages the iterative method above provides, however.
Hope that is some help :)
I HATE when answers dismiss the question with a "don't do that" but here I go...
Say there was a way to remove the down bool... will that really make any REAL difference in execution time? We're talking about a small handful of CPU operations and a few byte extra on the stack.
Focus on making the child() and parent() calls faster if you need speed. Otherwise you're wasting your time (IMOHO).
EDIT:
Maybe walk the tree (w/ this "slow" code) ONCE and build an array of pointers into the tree in the desired order. Use this "index" later.
What I'm saying is I think you're approaching optimization from the wrong angle.
Here is how to have only one recursion call instead of two:
void processTree (const BaseNodePtr ¤t, unsigned int & cnt )
{
for(bool gotNext = true; gotNext; current = current->next()) {
cnt++;
if (current->hasChild())
processTree(current->child());
gotNext = current->hasNext();
}
}
Create a "nextvisit" function, and keep calling that, to simplify the code; next to that, use const references iso value-semantics for the shared-pointers... this may save you valuable shared-ptr copies:
// define the order of visitation in here
BaseNodePtr& next( const BaseNodePtr& p ) {
if( p->hasChild() ) return p->child();
if( p->hasNext() ) return p->next();
BaseNodePtr ancestor = p->parent();
while( ancestor != 0 && !ancestor->hasNext() ) ancestor = ancestor->parent();
return ancestor;
}
void processTree( const BaseNodePtr& p, unsigned int& cnt ) {
while( p != NULL ) {
++cnt;
p = next(p);
}
}
But for readability, clarity, maintainability, ... for god's sake, use recursion. Unless your stack isn't big enough.