Remove unsorted/outlier elements in nearly-sorted

2019-05-14 06:46发布

问题:

Given an array like [15, 14, 12, 3, 10, 4, 2, 1]. How can I determine which elements are out of order and remove them (the number 3 in this case). I don't want to sort the list, but detect outliers and remove them.

Another example:

[13, 12, 4, 9, 8, 6, 7, 3, 2]

I want to be able to remove #4 and #7 so that I end up with:

[13, 12, 9, 8, 6, 3, 2]

There's also a problem that arises when you have this scenario:

[15, 13, 12, 7, 10, 5, 4, 3]

You could either remove 7 or 10 to make this array sorted.

In general, the problem I'm trying to solve, is that given a list of numerical readings (some could be off by quite a bit). I want the array to only include values that follow the general trendline and remove any outliers. I'm just wondering if there is a simple way to do this.

回答1:

I would reduce your problem to the longest increasing (decreasing) subsequence problem.

https://en.wikipedia.org/wiki/Longest_increasing_subsequence

Since your sequence is nearly sorted, you are guaranteed to receive a satisfactory result (i.e. neatly following the trendline).

There exists a number of solutions to it; one of them is portrayed in the free book "Fundamentals of Computer Programming with C#" by Svetlin Nakov and Veselin Kolev; the problem is presented on page 257, exercise 6; solution is on page 260.

Taken from the book:

Write a program, which finds the maximal sequence of increasing elements in an array arr[n]. It is not necessary the elements to be consecutively placed. E.g.: {9, 6, 2, 7, 4, 7, 6, 5, 8, 4} -> {2, 4, 6, 8}.

Solution:

We can solve the problem with two nested loops and one more array len[0…n-1]. In the array len[i] we can keep the length of the longest consecutively increasing sequence, which starts somewhere in the array (it does not matter where exactly) and ends with the element arr[i]. Therefore len[0]=1, len[x] is the maximal sum max(1 + len[prev]), where prev < x and arr[prev] < arr[x]. Following the definition, we can calculate len[0…n-1] with two nested loops: the outer loop will iterate through the array from left to right with the loop variable x. The inner loop will iterate through the array from the start to position x-1 and searches for the element prev with maximal value of len[prev], where arr[prev] < arr[x]. After the search, we initialize len[x] with 1 + the biggest found value of len[prev] or with 1, if such a value is not found.

The described algorithm finds the lengths of all maximal ascending sequences, which end at each of the elements. The biggest one of these values is the length of the longest increasing sequence. If we need to find the elements themselves, which compose that longest sequence, we can start from the element, where the sequence ends (at index x), we can print it and we can search for a previous element (prev). By definition prev < x and len[x] = 1 + len[prev] so we can find prev with a for-loop from 1 to x-1. After that we can repeat the same for x=prev. By finding and printing the previous element (prev) many times until it exists, we can find the elements, which compose the longest sequence in reversed order (from the last to the first).



回答2:

A simple algorithm which has been described by higuaro can help you to generate a correct sequence:

For each element at index i , if a[i] < a[i + 1], we can simply remove that element a[i].

for(int i = 0; i < size; i++)
    while(a[i] < a[i + 1]){
       remove a[i];
       i--;
    }

However, this approach cannot guarantee that the number of removed element is minimum. For example, for this sequence [10, 9, 8, 100, 1, 0], remove 100 will be optimal, instead of remove 8, then 9 then 10.

To find the minimum number of element to be removed, we notice that we need to find the longest decreasing sub sequence, which is similar to the classic longest increasing sub sequence whose solution has been described here