What is best algorithm for removing duplicate values from a list ?
I've tried this:
for (int i = 0; i < AuthorCounter-1; i++)
{
for (int j = 0; j < AuthorCounter-1; j++)
{
if (i != j)
{
if (AuthorGroupNode.Nodes[i].Text == AuthorGroupNode.Nodes[j].Text)
{
AuthorGroupNode.Nodes[j].Remove();
AuthorCounter--;
}
}
}
}
Here, AuthorGroupNodes
is a list on nodes. It did things right to some extent, but not perfect. Any one have better solution ???
Your current algorithm is O(N-squared), which will perform quite poorly for a large list.
If space is not an issue, you could keep a HashSet<int>
of hashes of nodes. Traverse the list once. If the hash of the node is in the HashSet, you know this is a duplicate node. Skip it. If the hash is not in the HashSet, add this node to a new list, and add the hash of the node to the HashSet.
This will perform O(N), and requires memory for the original list, for a copy of the list less any duplicates, and for the HashSet. The algorithm is non-destructive.
If you can use Linq, simply do
var distinctList = originalList.Distinct().ToList();
UPDATE
Discovered that's pretty much exactly how Jon Skeet re-implemented Distinct.
public static IEnumerable<TSource> Distinct<TSource>(
this IEnumerable<TSource> source)
{
return source.Distinct(EqualityComparer<TSource>.Default);
}
public static IEnumerable<TSource> Distinct<TSource>(
this IEnumerable<TSource> source,
IEqualityComparer<TSource> comparer)
{
if (source == null)
{
throw new ArgumentNullException("source");
}
return DistinctImpl(source, comparer ?? EqualityComparer<TSource>.Default);
}
private static IEnumerable<TSource> DistinctImpl<TSource>(
IEnumerable<TSource> source,
IEqualityComparer<TSource> comparer)
{
HashSet<TSource> seenElements = new HashSet<TSource>(comparer);
foreach (TSource item in source)
{
if (seenElements.Add(item))
{
yield return item;
}
}
}
https://codeblog.jonskeet.uk/2010/12/30/reimplementing-linq-to-objects-part-14-distinct/
This works like a treat:
var xs = new []
{
2, 3, 2, 4, 3, 3, 5, 6,
};
var ys = xs
.ToLookup(z => z, z => z)
.Select(x => x.First());
For your code it would look something like this:
var nodes = AuthorGroupNode.Nodes
.ToLookup(z => z.Text, z => z)
.Select(x => x.First())
.ToArray();
Can't be much simpler than that. :-)
Piggy backing off of Eric J.'s answer... You'll want to implement an EqualityComparer to have complete control over how distinct items are identified.
class Program
{
static void Main(string[] args)
{
var list = new List<SampleClass>();
// add some items
var distinctItems = list.Distinct(new SampleClass());
}
}
public class SampleClass : EqualityComparer<SampleClass>
{
public string Text { get; set; }
public override bool Equals(SampleClass x, SampleClass y)
{
if (x == null || y == null) return false;
return x.Text == y.Text;
}
public override int GetHashCode(SampleClass obj)
{
if (obj == null) return 0;
if (obj.Text == null) return 0;
return obj.Text.GetHashCode();
}
}
more info: http://msdn.microsoft.com/en-us/library/bb338049
You never check the last element of the list, your second for needs to be changed to this to work:
for (int j = 0; j < AuthorCounter; j++)
You are checking each pair of nodes twice. First you'll check when i = 0 and j = 1, then later you'll check when i = 1 and j = 0. There's no need to start j before or equal to i. When i = 0, your inner loop will remove all duplicates of that element so you know AuthorGroupNodes.Nodes[0]
is unique. Next time through the outer loop you will be sure that AuthorGroupNodes.Nodes[1]
is unique. You can therefore start with j equal to i + 1 and remove your check for i == j. Also when you remove the node, j will still increase to the next node. This will skip the new node at j which is the one after the one you remove, so you should decrement j, or just increment j if you don't remove the node:
for (int j = i + 1; j < AuthorCounter;)
{
if (AuthorGroupNode.Nodes[i].Text == AuthorGroupNode.Nodes[j].Text)
{
AuthorGroupNode.Nodes[j].Remove();
AuthorCounter--;
}
else
{
j++;
}
}
You say that works but not perfect, so I'm assuming you're not using a standard List and that your nodes handle their own removal from the list with the Remove() method.
If the list is sorted by the field you're comparing, you could remove the inner for loop altogether and remove any duplicates of your current element until you find a different one:
for (int i = 0; i < AuthorCounter-1;)
{
if (AuthorGroupNode.Nodes[i].Text == AuthorGroupNode.Nodes[i + 1].Text)
{
AuthorGroupNode.Nodes[i].Remove();
AuthorCounter--;
}
else
{
i++;
}
}