T-SQL finding of exactly same values in referenced

2019-07-23 04:36发布

问题:

Lets assume I have 3 tables in my Sql Serer 2008 database:

CREATE TABLE [dbo].[Properties](
    [PropertyId] [int] NOT NULL,
    [PropertyName] [nvarchar](50) NOT NULL
)

CREATE TABLE [dbo].[Entities](
    [EntityId] [int] NOT NULL,
    [EntityName] [nvarchar](50) NOT NULL
)    

CREATE TABLE [dbo].[PropertyValues](
    [EntityId] [int] NOT NULL,
    [PropertyId] [int] NOT NULL,
    [PropertyValue] [int] NOT NULL
)
  1. Table Properties contains possible set of Properties which values can set up configured for business objects.
  2. Table Entities contains business objects which are configured from app.
  3. Table 3 contains selected Property values for business objects. Each business object can contain its own set of properties (i.e. "Property1" can be configured for first object but not configured for the second one).

My task is to find business objects which are exactly same as given object (ones which have exactly same set of properties with exactly same values). Performance is critical.

Any suggestions?

[ADDED] For example there is an entry in Entities table with EntityId = 1. In PropertyValues table there are 3 row which are related to this entry:

 EntityId  PropertyId  PropertyValue
 1         4           Val4
 1         5           Val5
 1         6           Val6

The requirement is to find other entries in Entity table which have 3 related rows in PropertyValues table and these rows contain the same data as rows for EntityId = 1 (besides of EntityId column)

[ADDED] Please, see my new question: Best approach to store data which attributes can vary

[BOUNTY1] Thanks for all. The answers were very helpful. My task is complicated a little bit (but this complication can be useful in performance purposes). Please, see the details below:

  • The new table named EntityTypes is added
  • EntityTypeId column has been added into Entities and Properties tables
  • Now, there are several types of entities. Each entity has it's own set of properties.

    Is it possible to increase performance using this information?

[BOUNTY2] There is the second complication:

  • IsDeleted column is added to Property table
  • PropertyValues table can have values for Properties which already deleted from database. Entities which have such properties are considered invalid.
  • Some entities don't have values for each property of EntityType set. These entities also are considered as invalid.

The question is: How do I can write a script which will select all Entities and additional column IsValid for them.

回答1:

;with cteSource as
(
  select PropertyId,
         PropertyValue
  from PropertyValues
  where EntityId = @EntityID
)
select PV.EntityId
from PropertyValues as PV
  inner join cteSource as S  
    on PV.PropertyId = S.PropertyId and
       PV.PropertyValue = S.PropertyValue and
       PV.EntityId <> @EntityID
group by PV.EntityId
having count(*) = (select count(*)
                   from cteSource) and
       count(*) = (select count(*)
                   from PropertyValues as PV1
                   where PV1.EntityId = PV.EntityId)

For your addition you can add this where clause:

where -- exlude entities with deleted properties
      PV.EntityID not in (select PV2.EntityID
                          from Properties as P
                            inner join PropertyValues as PV2
                              on P.PropertyID = PV2.PropertyID
                          where P.IsDeleted = 1)
      -- exclude entities with missing EntityType                     
  and PV.EntityID not in (select E.EntityID
                          from Entities as E
                          where E.EntityType is null) 

Edit:

If you want to test the query against some sample data you can do so here: http://data.stackexchange.com/stackoverflow/q/110243/matching-properties



回答2:

My task is to find business objects which are exactly same as given object (ones which have exactly same set of properties with exactly same values). Performance is critical.

Approach might vary depending on the average number of properties objects will typically have, a few versus dozens.

Assuming that objects have a varying number of properties:

I would start with a composite non-unique index on the dyad (PropertyValues.PropertyId, PropertyValues.PropertyValue) for select-performance.

Then, given an entity ID, I would select its propertid, propertyvalue pairs into a cursor.

[EDIT: Not sure whether (entityid, propertyid) is unique in your system or if you are allowing multiple instances of the same property id for an entity, e.g. FavoriteColors:

              entityid  propertyid property value
                1            17     blue
                1            17     dark blue
                1            17     sky blue
                1            17     ultramarine

You would also need either a non-unique index on the monad (PropertyValues.entityid) or a composite index on (PropertyValues.entityid,PropertyValues.propertyid); the composite index would be unique if you wanted to prevent the same propertyid from being associated with an entity more than once.

If a property can occur multiple times, you should probably have a CanBeMultivalued flag in your Properties table. You should have a unique index on the triad (entityid, propertyid, propertyvalue) if you wanted to prevent this:

        entityid  propertyid property value
                1            17     blue
                1            17     blue

If you have this triad indexed, you would not need (entityid) index or the (entityid, propertyid) composite index in the PropertyValues table.

[/EDIT]

Then I would create a temp table to store matching entity ids.

Then I would iterate my cursor above to grab the given entity's propertyid, propertyvalue pairs, one pair at a time, and issue a select statement with each iteration:

      insert into temp
      select entityid from PropertyValues
      where propertyid = mycursor.propertyid and propertyvalue = mycursor.propertyvalue

At the end of the loop, you have a non-distinct set of entityids in your temp table for all entities that had at least one of the properties in common with the given object. But the ones you want must have all properties in common.

Since you know how many properties the given object has, you can do the following to fetch only those entities that have all of the properties in common with the given object:

       select entityid from temp
       group by entityid having count(entityid) =  {the number of properties in the given object}

ADDENDUM:

After the first property-value pair of the given object is used to select all potential matches, your temp table would not be missing any possible matches; rather it would contain entityids that were not perfect matches, which must be discarded in some manner, either by being ignored (by your group by having... clause) or by being explicitly removed from the temp table.

Also, after the first iteration of the loop, you could explore the possibility that an inner join between the temp table and the PropertyValues table might offer some performance gain:

        select entityid from propertvalues
        >> inner join temp on temp.entityid = propertyvalues.entityid <<
        where propertyid = mycursor.propertyid and propertyvalue = mycursor.propertyvalue

And you might also try removing entityids from temp after the first iteration:

        delete from temp
        where not exists
        (
         select entityid from propertyvalues
         inner join temp on temp.entityid = propertyvalues.entityid
         where propertyid = mycursor.propertyid and propertyvalue = mycursor.propertyvalue
        )

Alternatively, it would be possible to optimize this looping approach further if you stored some metadata about property-frequency. Optimally, when looking for matches for a given entity, you'd want to begin with the least frequently occuring property-value pair. You could order the given object's property-value pairs by ascending frequency, so that in your loop you'd be looking for the rarest one first. That would reduce the set of potential matches to its smallest possible size on the first iteration of the loop.

Of course, if temp were empty at any time after the given object's first property-value pair was used to look for matches, you would know that there are no matches for your given object, because you have found a property-value that no other entity possesses, and you could exit the loop and return a null set.



回答3:

One way to look at this is if I have all base ball cards you have then we don't have the same baseball card as I may have more. But if you also have all the baseball cards that I have then we have exactly the same baseball cards. This is a little more complex as we are looking by team. By team could count the match count, my count, and your count and compare those 3 counts but that is 3 joins. This solution is 2 joins and I think it would be faster than the 3 join option.

To me the bonus questions did not make sense. There as a change to a table but that table name did not match any of the tables. Need a full table description for those bonus questions.

Below is the 2 join option:

    select [m1].[IDa] as [EntityId1], [m1].[IDb] as [EntityId2]
    from
    (   select [PV1].[EntityId] as [IDa], [PV2].[EntityId] as [IDb]
        from [PropertyValue] as [PV1] 
        left outer join [PropertyValue] as [PV2] 
            on  [PV2].[EntityId] <> [PV1].[EntityId]
            and [PV2].[PropertyId] = [PV1].[PropertyId]
            and [PV2].[PropertyValue] = [PV1].[PropertyValue] 
        group by [PV1].[EntityId], [PV2].[EntityId]
        having count(*) = count([PV2].[EntityId])
    ) as [m1]
    join 
    (   select [PV1].[EntityId] as [IDa], [PV2].[EntityId] as [IDb]
        from [PropertyValue] as [PV1] 
        right outer join [PropertyValue] as [PV2] 
            on  [PV2].[EntityId] <> [PV1].[EntityId]
            and [PV2].[PropertyId] = [PV1].[PropertyId]
            and [PV2].[PropertyValue] = [PV1].[PropertyValue] 
        group by [PV1].[EntityId], [PV2].[EntityId]
        having count(*) = count([PV1].[EntityId]))
    ) as [m2]
    on [m1].[IDa] = [m2].[IDa] and [m1].[IDb] = [m2].[IDb] 

Below is the 3 join count based option:

    select [m1].[IDa] as [EntityId1], [m1].[IDb] as [EntityId2]
    from
    (   select [PV1].[EntityId] as [IDa], [PV2].[EntityId] as [IDb], COUNT(*) as [count]
        from [PropertyValue] as [PV1] 
        join [PropertyValue] as [PV2] 
            on  [PV2].[EntityId] <> [PV1].[EntityId]
            and [PV2].[PropertyId] = [PV1].[PropertyId]
            and [PV2].[PropertyValue] = [PV1].[PropertyValue] 
        group by [PV1].[EntityId], [PV2].[EntityId]
    )   as [m1]
    join 
    (   select [PV1].[EntityId] as [IDa], COUNT(*) as [count]
        from [PropertyValue] as [PV1] 
        group by [PV1].[EntityId]
        having count(*) = count([PV1].[sID]))
    )   as [m2]
    on [m1].[IDa] = [m2].[IDa] and [m1].[count] = [m2].[count]
    join 
    (   select [PV2].[EntityId] as [IDb], COUNT(*) as [count]
        from [PropertyValue] as [PV2] 
        group by [PV2].[EntityId]
    )   as [m3]
    on [m1].[IDb] = [m3].[IDb] and [m1].[count] = [m3].[count]


回答4:

My task is to find business objects which are exactly same as given object (ones which have exactly same set of properties with exactly same values).

if the "given objec"t is described as e.g. #PropertyValues, so the query would be:

create table #PropertyValues(
    [PropertyId] [int] NOT NULL,
    [PropertyValue] [int] NOT NULL
)

insert #PropertyValues
select
    3, 3 -- e.g.

declare
    @cnt int

select @cnt = count(*) from #PropertyValues

select 
    EntityId
from 
    PropertyValues pv
    left join  #PropertyValues t on t.PropertyId = pv.PropertyId and t.PropertyValue = pv.PropertyValue
group by
    EntityId
having 
    count(t.PropertyId) = @cnt
    and count(pv.PropertyId) = @cnt

drop table #PropertyValues

But if performance is so much critical, you can create special indexed field on table Entities, e.g. EntityIndex varchar(8000), which will be filled by trigger on PropertyValues table as convert(char(10), PropertyId) + convert(char(10), PropertyValue) (for all properties of entity, sorted!). So it will be possible to do very fast seek by this field.



回答5:

I think this is just a simple self-join:

select P2.EntityID,E.EntityName
from PropertyValues P1
inner join PropertyValues P2
on   P1.PropertyID    = P2.PropertyID
and  P1.PropertyValue = P2.PropertyValue
inner join Entity E
on P2.EntityID = E.EntityID
where P1.EntityId = 1
and   P2.EntityId <> 1
group by P2.EntityID, E.EntityName