Lets assume I have 3 tables in my Sql Serer 2008 database:
CREATE TABLE [dbo].[Properties](
[PropertyId] [int] NOT NULL,
[PropertyName] [nvarchar](50) NOT NULL
)
CREATE TABLE [dbo].[Entities](
[EntityId] [int] NOT NULL,
[EntityName] [nvarchar](50) NOT NULL
)
CREATE TABLE [dbo].[PropertyValues](
[EntityId] [int] NOT NULL,
[PropertyId] [int] NOT NULL,
[PropertyValue] [int] NOT NULL
)
- Table Properties contains possible set of Properties which values can set up configured for business objects.
- Table Entities contains business objects which are configured from app.
- Table 3 contains selected Property values for business objects. Each business object can contain its own set of properties (i.e. "Property1" can be configured for first object but not configured for the second one).
My task is to find business objects which are exactly same as given object (ones which have exactly same set of properties with exactly same values). Performance is critical.
Any suggestions?
[ADDED]
For example there is an entry in Entities table with EntityId = 1. In PropertyValues table there are 3 row which are related to this entry:
EntityId PropertyId PropertyValue
1 4 Val4
1 5 Val5
1 6 Val6
The requirement is to find other entries in Entity table which have 3 related rows in PropertyValues table and these rows contain the same data as rows for EntityId = 1 (besides of EntityId column)
[ADDED]
Please, see my new question: Best approach to store data which attributes can vary
[BOUNTY1]
Thanks for all. The answers were very helpful. My task is complicated a little bit (but this complication can be useful in performance purposes). Please, see the details below:
- The new table named EntityTypes is added
- EntityTypeId column has been added into Entities and Properties tables
Now, there are several types of entities. Each entity has it's own set of properties.
Is it possible to increase performance using this information?
[BOUNTY2]
There is the second complication:
- IsDeleted column is added to Property table
- PropertyValues table can have values for Properties which already deleted from database. Entities which have such properties are considered invalid.
- Some entities don't have values for each property of EntityType set. These entities also are considered as invalid.
The question is: How do I can write a script which will select all Entities and additional column IsValid for them.
;with cteSource as
(
select PropertyId,
PropertyValue
from PropertyValues
where EntityId = @EntityID
)
select PV.EntityId
from PropertyValues as PV
inner join cteSource as S
on PV.PropertyId = S.PropertyId and
PV.PropertyValue = S.PropertyValue and
PV.EntityId <> @EntityID
group by PV.EntityId
having count(*) = (select count(*)
from cteSource) and
count(*) = (select count(*)
from PropertyValues as PV1
where PV1.EntityId = PV.EntityId)
For your addition you can add this where clause:
where -- exlude entities with deleted properties
PV.EntityID not in (select PV2.EntityID
from Properties as P
inner join PropertyValues as PV2
on P.PropertyID = PV2.PropertyID
where P.IsDeleted = 1)
-- exclude entities with missing EntityType
and PV.EntityID not in (select E.EntityID
from Entities as E
where E.EntityType is null)
Edit:
If you want to test the query against some sample data you can do so here:
http://data.stackexchange.com/stackoverflow/q/110243/matching-properties
My task is to find business objects which are exactly same as given object (ones which have exactly same set of properties with exactly same values). Performance is critical.
Approach might vary depending on the average number of properties objects will typically have, a few versus dozens.
Assuming that objects have a varying number of properties:
I would start with a composite non-unique index on the dyad (PropertyValues.PropertyId, PropertyValues.PropertyValue) for select-performance.
Then, given an entity ID, I would select its propertid, propertyvalue pairs into a cursor.
[EDIT:
Not sure whether (entityid, propertyid) is unique in your system or if you are allowing multiple instances of the same property id for an entity, e.g. FavoriteColors:
entityid propertyid property value
1 17 blue
1 17 dark blue
1 17 sky blue
1 17 ultramarine
You would also need either a non-unique index on the monad (PropertyValues.entityid) or a composite index on (PropertyValues.entityid,PropertyValues.propertyid); the composite index would be unique if you wanted to prevent the same propertyid from being associated with an entity more than once.
If a property can occur multiple times, you should probably have a CanBeMultivalued flag in your Properties table. You should have a unique index on the triad (entityid, propertyid, propertyvalue) if you wanted to prevent this:
entityid propertyid property value
1 17 blue
1 17 blue
If you have this triad indexed, you would not need (entityid) index or the (entityid, propertyid) composite index in the PropertyValues table.
[/EDIT]
Then I would create a temp table to store matching entity ids.
Then I would iterate my cursor above to grab the given entity's propertyid, propertyvalue pairs, one pair at a time, and issue a select statement with each iteration:
insert into temp
select entityid from PropertyValues
where propertyid = mycursor.propertyid and propertyvalue = mycursor.propertyvalue
At the end of the loop, you have a non-distinct set of entityids in your temp table for all entities that had at least one of the properties in common with the given object. But the ones you want must have all properties in common.
Since you know how many properties the given object has, you can do the following to fetch only those entities that have all of the properties in common with the given object:
select entityid from temp
group by entityid having count(entityid) = {the number of properties in the given object}
ADDENDUM:
After the first property-value pair of the given object is used to select all potential matches, your temp table would not be missing any possible matches; rather it would contain entityids that were not perfect matches, which must be discarded in some manner, either by being ignored (by your group by having... clause) or by being explicitly removed from the temp table.
Also, after the first iteration of the loop, you could explore the possibility that an inner join between the temp table and the PropertyValues table might offer some performance gain:
select entityid from propertvalues
>> inner join temp on temp.entityid = propertyvalues.entityid <<
where propertyid = mycursor.propertyid and propertyvalue = mycursor.propertyvalue
And you might also try removing entityids from temp after the first iteration:
delete from temp
where not exists
(
select entityid from propertyvalues
inner join temp on temp.entityid = propertyvalues.entityid
where propertyid = mycursor.propertyid and propertyvalue = mycursor.propertyvalue
)
Alternatively, it would be possible to optimize this looping approach further if you stored some metadata about property-frequency. Optimally, when looking for matches for a given entity, you'd want to begin with the least frequently occuring property-value pair. You could order the given object's property-value pairs by ascending frequency, so that in your loop you'd be looking for the rarest one first. That would reduce the set of potential matches to its smallest possible size on the first iteration of the loop.
Of course, if temp were empty at any time after the given object's first property-value pair was used to look for matches, you would know that there are no matches for your given object, because you have found a property-value that no other entity possesses, and you could exit the loop and return a null set.
One way to look at this is if I have all base ball cards you have then we don't have the same baseball card as I may have more. But if you also have all the baseball cards that I have then we have exactly the same baseball cards. This is a little more complex as we are looking by team. By team could count the match count, my count, and your count and compare those 3 counts but that is 3 joins. This solution is 2 joins and I think it would be faster than the 3 join option.
To me the bonus questions did not make sense. There as a change to a table but that table name did not match any of the tables. Need a full table description for those bonus questions.
Below is the 2 join option:
select [m1].[IDa] as [EntityId1], [m1].[IDb] as [EntityId2]
from
( select [PV1].[EntityId] as [IDa], [PV2].[EntityId] as [IDb]
from [PropertyValue] as [PV1]
left outer join [PropertyValue] as [PV2]
on [PV2].[EntityId] <> [PV1].[EntityId]
and [PV2].[PropertyId] = [PV1].[PropertyId]
and [PV2].[PropertyValue] = [PV1].[PropertyValue]
group by [PV1].[EntityId], [PV2].[EntityId]
having count(*) = count([PV2].[EntityId])
) as [m1]
join
( select [PV1].[EntityId] as [IDa], [PV2].[EntityId] as [IDb]
from [PropertyValue] as [PV1]
right outer join [PropertyValue] as [PV2]
on [PV2].[EntityId] <> [PV1].[EntityId]
and [PV2].[PropertyId] = [PV1].[PropertyId]
and [PV2].[PropertyValue] = [PV1].[PropertyValue]
group by [PV1].[EntityId], [PV2].[EntityId]
having count(*) = count([PV1].[EntityId]))
) as [m2]
on [m1].[IDa] = [m2].[IDa] and [m1].[IDb] = [m2].[IDb]
Below is the 3 join count based option:
select [m1].[IDa] as [EntityId1], [m1].[IDb] as [EntityId2]
from
( select [PV1].[EntityId] as [IDa], [PV2].[EntityId] as [IDb], COUNT(*) as [count]
from [PropertyValue] as [PV1]
join [PropertyValue] as [PV2]
on [PV2].[EntityId] <> [PV1].[EntityId]
and [PV2].[PropertyId] = [PV1].[PropertyId]
and [PV2].[PropertyValue] = [PV1].[PropertyValue]
group by [PV1].[EntityId], [PV2].[EntityId]
) as [m1]
join
( select [PV1].[EntityId] as [IDa], COUNT(*) as [count]
from [PropertyValue] as [PV1]
group by [PV1].[EntityId]
having count(*) = count([PV1].[sID]))
) as [m2]
on [m1].[IDa] = [m2].[IDa] and [m1].[count] = [m2].[count]
join
( select [PV2].[EntityId] as [IDb], COUNT(*) as [count]
from [PropertyValue] as [PV2]
group by [PV2].[EntityId]
) as [m3]
on [m1].[IDb] = [m3].[IDb] and [m1].[count] = [m3].[count]
My task is to find business objects which are exactly same as given
object (ones which have exactly same set of properties with exactly
same values).
if the "given objec"t is described as e.g. #PropertyValues, so the query would be:
create table #PropertyValues(
[PropertyId] [int] NOT NULL,
[PropertyValue] [int] NOT NULL
)
insert #PropertyValues
select
3, 3 -- e.g.
declare
@cnt int
select @cnt = count(*) from #PropertyValues
select
EntityId
from
PropertyValues pv
left join #PropertyValues t on t.PropertyId = pv.PropertyId and t.PropertyValue = pv.PropertyValue
group by
EntityId
having
count(t.PropertyId) = @cnt
and count(pv.PropertyId) = @cnt
drop table #PropertyValues
But if performance is so much critical, you can create special indexed field on table Entities, e.g. EntityIndex varchar(8000), which will be filled by trigger on PropertyValues table as convert(char(10), PropertyId) + convert(char(10), PropertyValue) (for all properties of entity, sorted!). So it will be possible to do very fast seek by this field.
I think this is just a simple self-join:
select P2.EntityID,E.EntityName
from PropertyValues P1
inner join PropertyValues P2
on P1.PropertyID = P2.PropertyID
and P1.PropertyValue = P2.PropertyValue
inner join Entity E
on P2.EntityID = E.EntityID
where P1.EntityId = 1
and P2.EntityId <> 1
group by P2.EntityID, E.EntityName