Consider searching a table that contains Apartment Rental Information: A client using the interface selects a number of criteria that are represented as bit fields in the DB, for instance:
- AllowsPets
- HasParking
- HasDeck
- ModernKitchen
etc..
We are facing a situation where each new client of our software has additional fields they want to allow their end users to search on. The number of bit fields could reach into the hundreds.
I have three approaches that I'm considering and hoping for input and/or a different approach.
- Current approach: Add more bit fields, sql queries are built dynamically and executed using EXEC:
SET @SQL = @SQL + 'l.[NumUnits],' exec(@SQL))
Continue to add more bit fields. (table with 300 columns?)
Represent the data as a series of bits in one field. I'm unclear on if this approach will work, consider the 4 sample bit fields I offered above. The field could look like this: 1011 which would indicate false for 'hasparking' but true for all others. What I'm unclear on is how you would structure a query where you didn't care if it was false or true, for instance 1?11 where the person searching needs 1,3 and 4 to be true but doesn't care if 'HasParking' is true or false.
Move to an Attribute based approach where you have a table 'AttributeTypeID' and a table PropertyAttributes, which joins the PropertyID to the AttributeTypeId, new bit fields are simply a row in the AttributeTypeID table.
some other approach? Is this a well known SQL design pattern?
Thanks for any help
KM- EDIT PER COMMENTS
attribute table has a few other rows in it and is called listingattributes CREATE TABLE [dbo].[ListingAttributes]( [ListingID] [bigint] NOT NULL, [AttributeID] [int] IDENTITY(1,1) NOT NULL, [AttributeType] [smallint] NOT NULL, [BoardID] [int] NOT NULL, [ListingMLS] [varchar](30) NOT NULL, [PropertyTypeID] [char](3) NOT NULL, [StatusID] [varchar](2) NOT NULL, PRIMARY KEY CLUSTERED ( [AttributeID] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY] ) ON [PRIMARY] ;WITH GetMatchingAttributes AS ( SELECT ListingID,COUNT(AttributeID) AS CountOfMatches FROM ListingAttributes WHERE BoardID = 1 AND StatusID IN ('A') AND --PropertyTypeID in (select * from @PropertyType) --AND AttributeType IN (2,3,6) GROUP BY ListingID HAVING COUNT(AttributeID)=(3) ) SELECT count(l.listingid) FROM Listing l INNER JOIN GetMatchingAttributes m ON l.ListingID=m.ListingID -- where -- StatusID IN (select * from @Status) --AND --PropertyTypeID in (select * from @PropertyType) 1 1 0 NULL NULL 1 NULL 1 NULL NULL NULL 0.1934759 NULL NULL SELECT 0 NULL |--Compute Scalar(DEFINE:([Expr1006]=CONVERT_IMPLICIT(int,[Expr1012],0))) 1 2 1 Compute Scalar Compute Scalar DEFINE:([Expr1006]=CONVERT_IMPLICIT(int,[Expr1012],0)) [Expr1006]=CONVERT_IMPLICIT(int,[Expr1012],0) 1 0 0.001483165 11 0.1934759 [Expr1006] NULL PLAN_ROW 0 1 |--Stream Aggregate(DEFINE:([Expr1012]=Count(*))) 1 3 2 Stream Aggregate Aggregate NULL [Expr1012]=Count(*) 1 0 0.001483165 11 0.1934759 [Expr1012] NULL PLAN_ROW 0 1 |--Filter(WHERE:([Expr1005]=(3))) 1 4 3 Filter Filter WHERE:([Expr1005]=(3)) NULL 2471.109 0 0.00440886 9 0.1919928 NULL NULL PLAN_ROW 0 1 |--Compute Scalar(DEFINE:([Expr1005]=CONVERT_IMPLICIT(int,[Expr1011],0))) 1 5 4 Compute Scalar Compute Scalar DEFINE:([Expr1005]=CONVERT_IMPLICIT(int,[Expr1011],0)) [Expr1005]=CONVERT_IMPLICIT(int,[Expr1011],0) 9185.126 0 0.01422281 11 0.1875839 [Expr1005] NULL PLAN_ROW 0 1 |--Stream Aggregate(GROUP BY:(.[dbo].[ListingAttributes].[ListingID]) DEFINE:([Expr1011]=Count(*))) 1 6 5 Stream Aggregate Aggregate GROUP BY:(.[dbo].[ListingAttributes].[ListingID]) [Expr1011]=Count(*) 9185.126 0 0.01422281 11 0.1875839 [Expr1011] NULL PLAN_ROW 0 1 |--Index Seek(OBJECT:(.[dbo].[ListingAttributes].[_dta_index_ListingAttributes_BoardID_ListingID__AttributeType_PropertyTypeID_StatusID_6_7]), SEEK:(.[dbo].[ListingAttributes].[BoardID]=(1)), WHERE:(.[dbo].[ListingAttributes].[StatusID]='A' AND (.[dbo].[ListingAttributes].[AttributeType]=(2) OR .[dbo].[ListingAttributes].[AttributeType]=(3) OR .[dbo].[ListingAttributes].[AttributeType]=(6))) ORDERED FORWARD) 1 7 6 Index Seek Index Seek OBJECT:(.[dbo].[ListingAttributes].[_dta_index_ListingAttributes_BoardID_ListingID__AttributeType_PropertyTypeID_StatusID_6_7]), SEEK:(.[dbo].[ListingAttributes].[BoardID]=(1)), WHERE:(.[dbo].[ListingAttributes].[StatusID]='A' AND (.[dbo].[ListingAttributes].[AttributeType]=(2) OR .[dbo].[ListingAttributes].[AttributeType]=(3) OR .[dbo].[ListingAttributes].[AttributeType]=(6))) ORDERED FORWARD .[dbo].[ListingAttributes].[ListingID], .[dbo].[ListingAttributes].[AttributeID], .[dbo].[ListingAttributes].[AttributeType], .[dbo].[ListingAttributes].[StatusID] 16050.41 0.09677318 0.0315279 26 0.1283011 .[dbo].[ListingAttributes].[ListingID], .[dbo].[ListingAttributes].[AttributeID], .[dbo].[ListingAttributes].[AttributeType], .[dbo].[ListingAttributes].[StatusID] NULL PLAN_ROW 0 1 (7 row(s) affected)
I suggest you go with the second approach, known as Entity-attribute-value model. It's probably the only approach that will scale as you need.
You could also have two searches, the basic and the advanced. You keep the attributes for the basic search in one table, and all the advanced attributes in the other table. This way at least the basic search will remain rapid as the number of attributes will grow with time.
I've walked down this path a few times trying to store health status markers!
When I first started (in 2000?) I tried a character position approach (your #2) and found that it quickly became pretty unwieldy as I wrestled with the same questions over and over: "which position held 'Allows Pets' again?" or, worse yet, "how long is this string now? / which position am I on?" Can you work around this problem - developing objects to manage things for you? Well, yes, to an extent. But I really didn't appreciate how much extra work it cost compared to having the field identities managed for me by the database.
The second time around, I used an attribute/value pair approach similar to your solution #3. This basically worked and, for specialty needs, I still generate attribute/value pairs using a PIVOT. Also, my background is in AI and we used attribute/value pairs all the time in mechanical theorem proving so this was very natural for me.
However, there is a huge problem with this approach: pulling any one fact out ("Show me the apartments that allow pets") is easy but pulling all of the records meeting multiple constraints quickly gets very, very ugly (see my example below).
**SO...**I ended up adding fields to a table. I understand the theoretical reasons that Jon and 'Unknown' and 'New In Town' give for preferring other approaches and I'd have agreed with either or both at one point. But experience is a pretty harsh teacher...
A Couple More Things
First, I disagree that adding more bit fields is a nightmare of maintenance - at least compared with a character-bit approach (your #2). That is, having a distinct field for each attribute ensures that there is no 'management' necessary to figure out which slot belongs to which attribute.
Second, having 300 fields isn't really the problem - any decent database can do that without problem.
Third, your real issue and the source of pain is really the matter of dynamically generating your queries. If you are like me, this question is really all about "Do I really have to have this massive, grody and inelegant chain of "IF" statements to construct a query?"
The answer, unfortunately, is Yes. All three of the approaches you suggest will still boil down to a chain of IF statements.
In a database bit-field approach, you'll end up with a series of IF statements where all of your columns have to be added like so:
In a character-position approach, you'll do the same thing but your "Appends" will add "0", "1" or "_" to your SQL. You'll also, of course, run into the maintenance issues deciding which one is which that I discussed above (enums help but don't completely solve the problem).
As mentioned above, the Attribute-Value approach is actually the worst. You'll have to either create a nasty chain of sub-queries (which surely will cause a stack overflow of some sort with 300 clauses) or you need to have an IF-THEN like this:
Now, you may be able to optimize this a bit so you run fewer queries (a PIVOT, sets of subqueries or using the UNION operator) but the fact is that this gets VERY expensive compared to the single query that you can use (but have to build) using the other approaches.
Thus, this is a painful kind of problem no matter what approach you take - there really is no magic that helps you to avoid it. But, having been there before, I would absolutely recommend approach #1.
Update: If you are really focused on pulling straight criteria matches ("All Apartments That Have A, B and C") and don't need other queries (like "...Sum(AllowsPets), Sum(AllowsChildren)..." or "...(AllowsPets=1) OR (AllowsChildren=1)...") then I really like KM's answer the more I look at it. It is very clever and looks likely to be acceptably fast.
something like this may work for you:
define tables:
insert sample data:
sample search query:
OUTPUT:
In the search query above, I just included a CSV string of atribute IDs to search for. In reality, you could create a Search stored procedure where you pass in a CSV parameter containing the IDs to search on. You can look at this answer to learn about loop free splitting of that CSV strings into table which you can join to. This would result in not needing to use any dynamic SQL.
EDIT based on the many comments:
if you add a few columns to the #AttributeTypes table you could dynamically build the search page. Here are a few suggestions:
You could make all the fields checkboxes, or add another table called #AttributesGroups, and group some together and use radio buttons. For example, since "Pets Allowed" and "No Pets" are exclusive, add a row in the #AttributesGroups table "Pets". The application would group the attributes in the interface. Attributes in Groups would work the same as regular ungrouped attributes, just collect the selected IDs and pass it in to the search procedure. However, for each group you'll need to have the application include a "no preference" radio button and default it on. This option will not have an attribute ID and is not passed in, since you don't want to consider the attribute.
In my example, I do show an example of a "super attribute" that is in the #Apartments table, "Status". You should only consider major attributes for this table. If you start using these, you may want to alter the CTE to be FROM #Apartments with filtering on these fields and then join to #Attributes. However you will run into issues of Dynamic Search Conditions, so read this article by Erland Sommarskog.
EDIT on latest comments:
here is code to have a list of exclude attributes:
I don't think I would go this way though. I'd go with the approach I outlined in my previous EDIT above. When include/exclude of an attribute is necessary, I'd just add an attribute for each: "Pets allowed" and "No Pets".
I updated the sample data from the original post to show this.
Run the original query with:
I think this is the better approach. When combined with the grouping idea and dynamically built search page described in the last edit, I think this would be better and would run faster.
I've never tested this, but what if you were to create a varchar(256) fields that stored all of your flags as one long string of 0's and 1's.
For example,
would be:
and if you were looking for something that AllowsPets and HasDeck, then the search query would look something like this:
WHERE PropertyFlags LIKE '1_1_' (the underscore represents a single wildcard character in the like clause)
this would solve your issues with adding additional columns to the search in the future, but I'm not sure how this would do performance-wise.
has anyone out there tried anything similar to this?
Create a table that stores attributes or search columns based on the apartment. Definitely do not keep adding more bit field columns..nightmare maintenance and nightmare coding. And definitely don't please don't dynamically generate where statements and use exec.