Performance: .Join vs .Contains - Linq to Entities

2019-04-26 14:09发布

问题:

I am using Linq to entities to query the database to get the list of int for further processing. I have two ways to get the list as below:

First is:

List<int> lstBizIds = new List<int>() { 1, 2, 3, 4, 5 };
List<int> lstProjectIds = context.Projects.Where(x => lstBizIds.Contains(x.businessId)).Select(x => x.projectId).ToList();

Second is:

List<int> lstBizIds = new List<int>() { 1, 2, 3, 4, 5 };
List<int> lstProjectIds = context.Projects.Join(lstBizIds, p => p.businessId, u => u, (p, u) => p.projectId).ToList();

Now my question is which one of the methods above is better performance wise? Also does it affect the performance if the first list i.e. lstBizIds grows in size? Suggest me other ways of implementation as well if that are performance reducing.

回答1:

You should go with Contains, because EF can produce a more efficient query.

This would be the SQL join:

SELECT Id
FROM Projects
INNER JOIN (VALUES (1), (2), (3), (4), (5)) AS Data(Item) ON Projects.UserId = Data.Item

This would be the SQL Contains:

SELECT Id
FROM Projects
WHERE UserId IN (1, 2, 3, 4, 5, 6)

IN is more efficient than JOIN because the DBMS can stop looking after the first match of the IN; the JOIN always finishes, even after the the first match.

You might also want to check which queries are actually sent to the DB. You always have to compare the SQL, not the LINQ code (obviously).



回答2:

Performing a join is quite efficient because Where condition actually performs a cartesian product of all the tables, then filters the rows that satisfy the condition. This means the Where condition is evaluated for each combination of rows (n1 * n2 * n3 * n4)

The Join operator takes the rows from the first tables, then takes only the rows with a matching key from the second table, then only the rows with a matching key from the third table, and so on. Secondly, contains would work in an iterative manner making it slower than join



回答3:


I choose the first,because it does not increase the computer's memory.
if you use both array to compare the conditions, choose from the second.



回答4:

I just spent quite some time trying to find, what caused a stack overflow error in a program with a few simple LINQ queries accessing a mid size database.

for ICollection with ~10k elements on one side, and sql table on the other, a single change from "join" to "Contains" fixed the stack overflow error.

it seems that notwithstanding comparative performance, Contains is a safer choice.