EDIT: Wow, many great responses. Yes, I am using this as a fitness function for judging the quality of a sort performed by a genetic algorithm. So cost-of-evaluation is important (i.e., it has to be fast, preferably O(n)
.)
As part of an AI application I am toying with, I'd like to be able to rate a candidate array of integers based on its monotonicity, aka its "sortedness". At the moment, I'm using a heuristic that calculates the longest sorted run, and then divides that by the length of the array:
public double monotonicity(int[] array) {
if (array.length == 0) return 1d;
int longestRun = longestSortedRun(array);
return (double) longestRun / (double) array.length;
}
public int longestSortedRun(int[] array) {
if (array.length == 0) return 0;
int longestRun = 1;
int currentRun = 1;
for (int i = 1; i < array.length; i++) {
if (array[i] >= array[i - 1]) {
currentRun++;
} else {
currentRun = 1;
}
if (currentRun > longestRun) longestRun = currentRun;
}
return longestRun;
}
This is a good start, but it fails to take into account the possibility that there may be "clumps" of sorted sub-sequences. E.g.:
{ 4, 5, 6, 0, 1, 2, 3, 7, 8, 9}
This array is partitioned into three sorted sub-sequences. My algorithm will rate it as only 40% sorted, but intuitively, it should get a higher score than that. Is there a standard algorithm for this sort of thing?
It highly depends on what you're intending to use the measure for, but one easy way to do this is to feed the array into a standard sorting algorithm and measure how many operations (swaps and/or comparisons) need to be done to sort the array.
I have the same problem (monotonicity scoring), and I suggest you to try Longest Increasing Subsequence. The most efficient algorithm run in
O(n log n)
, not so bad.Taking example from the question, the longest increasing sequence of
{4, 5, 6, 0, 1, 2, 3, 7, 8, 9}
is{0, 1, 2, 3, 7, 8, 9}
(length of 7). Maybe it rate better (70%) than your longest-sorted-run algorithm.I expect that the choice of function to use depends very strongly on what you intend to use it for. Based on your question, I would guess that you are using a genetic system to create a sorting program, and this is to be the ranking function. If that is the case, then speed of execution is crucial. Based on that, I bet your longest-sorted-subsequence algorithm would work pretty well. That sounds like it should define fitness pretty well.
This seems like a good candidate for
LevenshteinDamerau–Levenshtein distance - the number of swaps needed to sort the array. This should be proportional to how far each item is from where it should be in a sorted array.Here's a simple ruby algorithm that sums the squares of the distances. It seems a good measure of sortedness - the result gets smaller every time two out-of-order elements are swapped.
I would suggest looking at the Pancake Problem and the reversal distance of the permutations. These algorithms are often used to find the distance between two permutations (the Identity and the permuted string). This distance measure should take into account more clumps of in order values, as well as reversals (monotonically decreasing instead of increasing subsequences). There are also approximations that are polynomial time[PDF].
It really all depends on what the number means and if this distance function makes sense in your context though.