Now I've always heard binary search trees are faster to build from randomly selected data than ordered data, simply because ordered data requires explicit rebalancing to keep the tree height at a minimum.
Recently I implemented an immutable treap, a special kind of binary search tree which uses randomization to keep itself relatively balanced. In contrast to what I expected, I found I can consistently build a treap about 2x faster and generally better balanced from ordered data than unordered data -- and I have no idea why.
Here's my treap implementation:
And here's a test program:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Diagnostics;
namespace ConsoleApplication1
{
class Program
{
static Random rnd = new Random();
const int ITERATION_COUNT = 20;
static void Main(string[] args)
{
List<double> rndTimes = new List<double>();
List<double> orderedTimes = new List<double>();
rndTimes.Add(TimeIt(50, RandomInsert));
rndTimes.Add(TimeIt(100, RandomInsert));
rndTimes.Add(TimeIt(200, RandomInsert));
rndTimes.Add(TimeIt(400, RandomInsert));
rndTimes.Add(TimeIt(800, RandomInsert));
rndTimes.Add(TimeIt(1000, RandomInsert));
rndTimes.Add(TimeIt(2000, RandomInsert));
rndTimes.Add(TimeIt(4000, RandomInsert));
rndTimes.Add(TimeIt(8000, RandomInsert));
rndTimes.Add(TimeIt(16000, RandomInsert));
rndTimes.Add(TimeIt(32000, RandomInsert));
rndTimes.Add(TimeIt(64000, RandomInsert));
rndTimes.Add(TimeIt(128000, RandomInsert));
string rndTimesAsString = string.Join("\n", rndTimes.Select(x => x.ToString()).ToArray());
orderedTimes.Add(TimeIt(50, OrderedInsert));
orderedTimes.Add(TimeIt(100, OrderedInsert));
orderedTimes.Add(TimeIt(200, OrderedInsert));
orderedTimes.Add(TimeIt(400, OrderedInsert));
orderedTimes.Add(TimeIt(800, OrderedInsert));
orderedTimes.Add(TimeIt(1000, OrderedInsert));
orderedTimes.Add(TimeIt(2000, OrderedInsert));
orderedTimes.Add(TimeIt(4000, OrderedInsert));
orderedTimes.Add(TimeIt(8000, OrderedInsert));
orderedTimes.Add(TimeIt(16000, OrderedInsert));
orderedTimes.Add(TimeIt(32000, OrderedInsert));
orderedTimes.Add(TimeIt(64000, OrderedInsert));
orderedTimes.Add(TimeIt(128000, OrderedInsert));
string orderedTimesAsString = string.Join("\n", orderedTimes.Select(x => x.ToString()).ToArray());
Console.WriteLine("Done");
}
static double TimeIt(int insertCount, Action<int> f)
{
Console.WriteLine("TimeIt({0}, {1})", insertCount, f.Method.Name);
List<double> times = new List<double>();
for (int i = 0; i < ITERATION_COUNT; i++)
{
Stopwatch sw = Stopwatch.StartNew();
f(insertCount);
sw.Stop();
times.Add(sw.Elapsed.TotalMilliseconds);
}
return times.Average();
}
static void RandomInsert(int insertCount)
{
Treap<double> tree = new Treap<double>((x, y) => x.CompareTo(y));
for (int i = 0; i < insertCount; i++)
{
tree = tree.Insert(rnd.NextDouble());
}
}
static void OrderedInsert(int insertCount)
{
Treap<double> tree = new Treap<double>((x, y) => x.CompareTo(y));
for(int i = 0; i < insertCount; i++)
{
tree = tree.Insert(i + rnd.NextDouble());
}
}
}
}
And here's a chart comparing random and ordered insertion times in milliseconds:
Insertions Random Ordered RandomTime / OrderedTime
50 1.031665 0.261585 3.94
100 0.544345 1.377155 0.4
200 1.268320 0.734570 1.73
400 2.765555 1.639150 1.69
800 6.089700 3.558350 1.71
1000 7.855150 4.704190 1.67
2000 17.852000 12.554065 1.42
4000 40.157340 22.474445 1.79
8000 88.375430 48.364265 1.83
16000 197.524000 109.082200 1.81
32000 459.277050 238.154405 1.93
64000 1055.508875 512.020310 2.06
128000 2481.694230 1107.980425 2.24
I don't see anything in the code which makes ordered input asymptotically faster than unordered input, so I'm at a loss to explain the difference.
Why is it so much faster to build a treap from ordered input than random input?
Try this: database on treap.
http://code.google.com/p/treapdb/
You're only seeing a difference of about 2x. Unless you've tuned the daylights out of this code, that's basically in the noise. Most well-written programs, especially those involving data structure, can easily have more room for improvement than that. Here's an example.
I just ran your code and took a few stackshots. Here's what I saw:
Random Insert:
Ordered Insert:
This is a pretty small number of samples, but it suggests in the case of random input that Make (line 43) is consuming a higher fraction of time. That is this code:
I then took 20 stackshots of the Random Insert code to get a better idea of what it was doing:
This reveals some information.
25% of time is spent in line Make:43 or its callees.
15% of time is spent in that line, not in a recognized routine, in other words, in
new
making a new node.90% of time is spent in lines Insert:64 and 68 (which call Make and heapify.
10% of time is spent in RotateLeft and Right.
15% of time is spent in Heapify or its callees.
I also did a fair amount of single-stepping (at the source level), and came to the suspicion that, since the tree is immutable, it spends a lot of time making new nodes because it doesn't want to change old ones. Then the old ones are garbage collected because nobody refers to them anymore.
This has got to be inefficient.
I'm still not answering your question of why inserting ordered numbers is faster than randomly generated numbers, but it doesn't really surprise me, because the tree is immutable.
I don't think you can expect any performance reasoning about tree algorithms to carry over easily to immutable trees, because the slightest change deep in the tree causes it to be rebuilt on the way back out, at a high cost in
new
-ing and garbage collection.