-->

plinq on large lists taking enormous time

2019-07-26 16:18发布

问题:

I have two in memory lists plays and consumers one having 15 mil objects and the other around 3 mil.

the following are a few of queries i'm firing..

consumersn=consumers.AsParallel()
                    .Where(w => plays.Any(x => x.consumerid == w.consumerid))
                    .ToList();


List<string> consumerids = plays.AsParallel()
                                .Where(w => w.playyear == group_period.year 
                                         && w.playmonth == group_period.month 
                                         && w.sixteentile == group_period.group)
                                .Select(c => c.consumerid)
                                .ToList();


int groupcount = plays.AsParallel()
                      .Where(w => w.playyear == period.playyear 
                               && w.playmonth == period.playmonth 
                               && w.sixteentile == group 
                               && consumerids.Any(x => x == w.consumerid))
                      .Count();

I'm using 16 core machine with 32 GB RAM, inspite of this.. the first query took around 20 hours to run..

Am I doing something wrong..

All help is sincerely appreciated.

Thanks

回答1:

The first LINQ query is very inefficient, parallelization can only help you so much.

Explanation: When you write consumers.Where(w => plays.Any(x => x.consumerid == w.consumerid)), it means that, for every object in consumer, you will potentially iterate over the whole plays list to find the affected consumers. So that is a maximum of 3 million consumers times 15 million plays = 45 trillion operations. Even across 16 cores, that is about 2.8 trillion operations per core.

So, the first step here would be to group all plays by their consumerIds, and to cache the result in an appropriate data structure:

var playsByConsumerIds = plays.ToLookup(x => x.consumerid, StringComparer.Ordinal);

Then, your first request becomes:

consumersn = consumers.Where(w => playsByConsumerIds.Contains(w.consumerid)).ToList();

This query should be much faster, even without any parallelization.

I cannot fix the following queries because I don't see exactly what you are doing exactly with group_period, but I would suggest using GroupBy or ToLookup to create all groups in a single pass.



回答2:

The first query took 20 hours to run because plays.Any(x => x.consumerid == w.consumerid) needs to walk through the entire list of 15,000,000 plays each time the consumerid is not there.

You can speed this up by constructing a hash set of all consumer IDs in plays, like this:

var consumerIdsInPlays = new HashSet<string>(plays.Select(p => p.consumerid));

Now your first query can be rewritten for an O(1) lookup:

consumersn=consumers
    .AsParallel()
    .Where(w => consumerIdsInPlays.Contains(w.consumerid))
    .ToList();


标签: c# c#-4.0 plinq