Now I've always heard binary search trees are faster to build from randomly selected data than ordered data, simply because ordered data requires explicit rebalancing to keep the tree height at a minimum.
Recently I implemented an immutable treap, a special kind of binary search tree which uses randomization to keep itself relatively balanced. In contrast to what I expected, I found I can consistently build a treap about 2x faster and generally better balanced from ordered data than unordered data -- and I have no idea why.
Here's my treap implementation:
And here's a test program:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Diagnostics;
namespace ConsoleApplication1
{
class Program
{
static Random rnd = new Random();
const int ITERATION_COUNT = 20;
static void Main(string[] args)
{
List<double> rndTimes = new List<double>();
List<double> orderedTimes = new List<double>();
rndTimes.Add(TimeIt(50, RandomInsert));
rndTimes.Add(TimeIt(100, RandomInsert));
rndTimes.Add(TimeIt(200, RandomInsert));
rndTimes.Add(TimeIt(400, RandomInsert));
rndTimes.Add(TimeIt(800, RandomInsert));
rndTimes.Add(TimeIt(1000, RandomInsert));
rndTimes.Add(TimeIt(2000, RandomInsert));
rndTimes.Add(TimeIt(4000, RandomInsert));
rndTimes.Add(TimeIt(8000, RandomInsert));
rndTimes.Add(TimeIt(16000, RandomInsert));
rndTimes.Add(TimeIt(32000, RandomInsert));
rndTimes.Add(TimeIt(64000, RandomInsert));
rndTimes.Add(TimeIt(128000, RandomInsert));
string rndTimesAsString = string.Join("\n", rndTimes.Select(x => x.ToString()).ToArray());
orderedTimes.Add(TimeIt(50, OrderedInsert));
orderedTimes.Add(TimeIt(100, OrderedInsert));
orderedTimes.Add(TimeIt(200, OrderedInsert));
orderedTimes.Add(TimeIt(400, OrderedInsert));
orderedTimes.Add(TimeIt(800, OrderedInsert));
orderedTimes.Add(TimeIt(1000, OrderedInsert));
orderedTimes.Add(TimeIt(2000, OrderedInsert));
orderedTimes.Add(TimeIt(4000, OrderedInsert));
orderedTimes.Add(TimeIt(8000, OrderedInsert));
orderedTimes.Add(TimeIt(16000, OrderedInsert));
orderedTimes.Add(TimeIt(32000, OrderedInsert));
orderedTimes.Add(TimeIt(64000, OrderedInsert));
orderedTimes.Add(TimeIt(128000, OrderedInsert));
string orderedTimesAsString = string.Join("\n", orderedTimes.Select(x => x.ToString()).ToArray());
Console.WriteLine("Done");
}
static double TimeIt(int insertCount, Action<int> f)
{
Console.WriteLine("TimeIt({0}, {1})", insertCount, f.Method.Name);
List<double> times = new List<double>();
for (int i = 0; i < ITERATION_COUNT; i++)
{
Stopwatch sw = Stopwatch.StartNew();
f(insertCount);
sw.Stop();
times.Add(sw.Elapsed.TotalMilliseconds);
}
return times.Average();
}
static void RandomInsert(int insertCount)
{
Treap<double> tree = new Treap<double>((x, y) => x.CompareTo(y));
for (int i = 0; i < insertCount; i++)
{
tree = tree.Insert(rnd.NextDouble());
}
}
static void OrderedInsert(int insertCount)
{
Treap<double> tree = new Treap<double>((x, y) => x.CompareTo(y));
for(int i = 0; i < insertCount; i++)
{
tree = tree.Insert(i + rnd.NextDouble());
}
}
}
}
And here's a chart comparing random and ordered insertion times in milliseconds:
Insertions Random Ordered RandomTime / OrderedTime
50 1.031665 0.261585 3.94
100 0.544345 1.377155 0.4
200 1.268320 0.734570 1.73
400 2.765555 1.639150 1.69
800 6.089700 3.558350 1.71
1000 7.855150 4.704190 1.67
2000 17.852000 12.554065 1.42
4000 40.157340 22.474445 1.79
8000 88.375430 48.364265 1.83
16000 197.524000 109.082200 1.81
32000 459.277050 238.154405 1.93
64000 1055.508875 512.020310 2.06
128000 2481.694230 1107.980425 2.24
I don't see anything in the code which makes ordered input asymptotically faster than unordered input, so I'm at a loss to explain the difference.
Why is it so much faster to build a treap from ordered input than random input?
Aaronaught has done a really decent job explaining this.
For these two special cases, I find it easier to grasp it in terms of the insertion path lengths.
For random input, your insertion path goes down to one of the leaves and the length of the path - thus the number of rotations - are bounded by the height of the tree.
In the sorted case, you walk on the right spine of the treap and the bound is the length of the spine, which is less than or equal to the the height.
Since you rotate nodes along the insertion path and your insertion path is the spine in this case, these rotations will always shorten the spine (which will result in a shorter insertion path at the next insertion, since the insertion path is just the spine etc.)
Edit: for the random case the insertion path is 1.75x longer.
I ran your code, and I think it has to do with the number of rotations. During ordered input, the number of rotations are optimal, and the tree will never have to rotate back.
During random input the tree will have to perform more rotations, because it may have to rotate back and forth.
To really find out, I would have to add counters for the numbers of left and right rotations for each run. You can probably do this yourself.
UPDATE:
I put breakpoints on rotateleft and rotateright. During ordered input rotateright is never used. During random input, both are hit, and it seems to me that they are used more frequently.
UPDATE 2:
I added some output to the 50 item ordered run (substituting with integers for clarity), to learn more:
The ordered items always gets added to the right side of the tree, naturally. When the right side gets bigger than the left, a rotateleft happens. Rotateright never happens. A new top node is selected roughly every time the tree doubles. The randomness of the priority value jitters it a little, so it goes 0, 1, 4, 11, 29 in this run.
A random run reveals something interesting:
Rotations happen not so much because the tree is unbalanced, but because of the priorities, which are randomly selected. For example we get 4 rotations at the 13th insertion. We have a tree balanced at 5/7 (which is fine), but get to 13/0! It would seem that the use of random priorities deserves further investigation. Anyhow, it is plain to see that the random inserts cause a lot more rotations, than the ordered inserts.
I added calculation of the standard deviation, and changed your test to run at the highest priority (to reduce noise as much as possible). This are the results:
I've ran a sampling profiler and here are the results (amount of times the program was in this method):
Conclusion: the random has a fairly reasonable distribution between left and right rotation, while the ordered never rotates left.
Here is my improved "benchmark" program:
@Guge is right. However there is a little tiny bit more to it. I am not saying that it is the biggest factor in this case - however it is there and it is hard to do anything about it.
For a sorted input, lookups likely touch the nodes that are hot in the cache. (This is true in general for balanced trees like AVL trees, red-black trees, B-trees, etc.)
Since inserts start with a lookup, this has an effect on insert/delete performance as well.
Again, I am not claiming that it is the most significant factor in every and all cases. It is there, however, and will likely result in sorted inputs being always faster than random ones in these data structures.
Self-balancing trees exist to fix the problems associated non-randomly-distributed data. By definition, they trade away a bit of the best-case performance to vastly improve the worst-case performance associated with non-balanced BSTs, specifically that of sorted input.
You're actually overthinking this problem, because slower insertion of random data vs. ordered data is a characteristic of any balanced tree. Try it on an AVL and you'll see the same results.
Cameron had the right idea, removing the priority check to force the worst case. If you do that and instrument your tree so you can see how many rebalances are happening for each insert, it actually becomes very obvious what's going on. When inserting sorted data, the tree always rotates left and the root's right child is always empty. Insertion always results in exactly one rebalance because the insertion node has no children and no recursion occurs. On the other hand, when you run it on the random data, almost immediately you start to see multiple rebalances happening on every insert, as many as 5 or 6 of them in the smallest case (50 inserts), because it's happening on subtrees as well.
With priority checking turned back on, not only are rebalances typically less expensive due to more nodes being pushed into the left subtree (where they never come out of because no insertions happen there), but they are also less likely to occur. Why? Because in the treap, high-priority nodes float to the top, and the constant left-rotations (not accompanied by right-rotations) start to push all the high-priority nodes into the left subtree as well. The result is that rebalances happen less frequently due to the uneven distribution of probability.
If you instrument the rebalancing code you'll see that this is true; for both the sorted and random input, you end up with almost identical numbers of left-rotations, but the random input also gives the same number of right-rotations, which makes for twice as many in all. This shouldn't be surprising - Gaussian input should result in a Gaussian distribution of rotations. You'll also see that there are only about 60-70% as many top-level rebalances for the sorted input, which perhaps is surprising, and again, that's due to the sorted input messing with the natural distribution of priorities.
You can also verify this by inspecting the full tree at the end of an insertion loop. With the random input, priorities tend to decrease fairly linearly by level; with the sorted input, priorities tend to stay very high until you get to one or two levels from the bottom.
Hopefully I've done a decent job explaining this... let me know if any of it is too vague.
Yes it's the number of rotations that is causing the extra time. Here's what I did:
HeapifyLeft
andHeapifyRight
so rotations are always done.Console.WriteLine
after the if inRotateLeft
andRotateRight
.Console.WriteLine
in theIsEmpty
part of theInsert
method to see what was being inserted.Output:
Result: 2 times as many rotations on random data.