Finding the difference between two lists of string

2019-02-25 04:04发布

I'm pretty sure this is a duplicate, but I have tried everything, and I still cannot seem to get the differences. I have two lists of strings: listA and listB. I'm trying to find items in listA that are not in B.

Example: listA: "1", "2", "4", "7" listB: "2", "4" The output I want is: "1", "7"

Here is a for loop and lambda expression that I tried, but these take a really long time:

//these two approaches take too long for huge lists

    foreach (var item in listA)
            {
                if (!listB.Contains(item))
                    diff.Add(id);
            }

    diff = listA.Where(id => !listB.Contains(id)).ToList();

//these don't give me the right differences

    listA.Except(listB).ToList();

    var set = new HashSet<string>(listA);
    set.SymmetricExceptWith(listB);

3条回答
在下西门庆
2楼-- · 2019-02-25 04:33
listA.Except(listB).ToList();

should give the correct answer, but

set.SymmetricExceptWith(listB);

should not. SymmetricExcept will give the items in listA not in listB plus the items in ListB not in ListA.

查看更多
乱世女痞
3楼-- · 2019-02-25 04:43

All code you posted should work fine so error is in another place anyway you write "these take a really long time" then I suppose you have a performance issue.

Let's do a very quick and dirty comparison (you know to do a good performance test is a long process, self-promotion: benchmark has been done with this free tool). Assumptions:

  • Lists are unordered.
  • There may be duplicates in our inputs but we don't want duplicates in result.
  • Second list is always a subset of first list (assumed because you're using SymmetricExceptWith and if not then its result is pretty different compared to Except). If it was a mistake just ignore tests for SymmetricExceptWith.

Two lists of 20,000 random items (test repeated 100 times then averaged, release mode).

Method                  Time [ms]
Contains *1                  49.4
Contains *2                  49.0
Except                        5.9
SymmetricExceptWith *3        4.1
SymmetricExceptWith *4        2.5

Notes:

1 Loop with foreach
2 Loop with for
3 Hashset creation measured
4 Hashset creation not measured. I included this for reference but if you don't have first list as Hashset you can't ignore creation time.

We see Contains() method is pretty slow so we can drop it in bigger tests (anyway I checked and its performance won't become better or even comparable). Let's see what will happen for 1,000,000 items list.

Method                        Time [ms]
Except                            244.4
SymmetricExceptWith               259.0

Let's try to make it parallel (please note that for this test I'm using a old Core 2 Duo 2 GHz):

Method                        Time [ms]
Except                            244.4
SymmetricExceptWith               259.0
Except (parallel partitions)      301.8
SymmetricExceptWith (p. p.)       382.6
Except (AsParallel)               274.4

Parallel performance are worse and LINQ Except is best option now. Let's see how it works on a better CPU (Xeon 2.8 GHz, quad core). Also note that with such big amount of data cache size won't affect testing too much.

Method                        Time [ms]
Except                            127.4
SymmetricExceptWith               149.2
Except (parallel partitions)      208.0
SymmetricExceptWith (p. p.)       170.0
Except (AsParallel)                80.2

To summarize: for relatively small lists SymmetricExceptWith() will perform better, for big lists Except() is always better. If you're targeting a modern multi-core CPU then parallel implementation will scale much better. In code:

var c = a.Except(b).ToList();
var c = a.AsParallel().Except(b.AsParallel()).ToList();

Please note that if you don't need List<string> as result and IEnumerable<string> is enough then performance will greatly increase (and difference with parallel execution will be higher).

Of course those two lines of code are not optimal and can be greatly increase (and if it's really performance critical you may pick ParallelEnumerable.Except() implementation as starting point for your own specific highly optimized routine).

查看更多
兄弟一词,经得起流年.
4楼-- · 2019-02-25 04:49

Use LINQ's Except method:

listA.Except(listB).ToList();
查看更多
登录 后发表回答