It's accepted that searching a table on an int column is faster than on a string column (say varchar).
However, if I have a Shirt table with a Color column, would it be more performant to create a Color table with the primary key on that table being the foreign key on the Shirt table? Would the join negate the performance advantage of having the value in the Color column on Shirt being an int instead of a string value such as "Green" when searching for green Shirts?
Compared to the other operations being performed, it is unlikely that there is much performance difference between the two approaches. If you have only a handful of colors (up to a few hundred), the color table fits on a single page in most databases. An index on the color would make look up quite fast and not incur any I/O activity (after the first run to load the page).
A string comparison depends on the database, but it does involve a function and reading the data from the page. So, it is not free. Different databases, of course, might have different performance characteristics for a string function.
Where it should be stored should be a function of your application. Say you have an application where the color is going to be presented to the user. You might, one day, want to show the name of the color in Spanish, Swahili, or Chinese. If so, having a separate table makes such internationalization much easier. More prosaicly, you might want to prevent "Grene" from being entered, if so, having such a table makes a selection list easier.
On the other hand, if performance is your only concern, it doesn't make a different. In other cases, it is actually possible for a lookup table to be faster than a denormalized table. This occurs when the strings are long, increasing the length of every record in a larger table. Larger tables mean more pages, which take longer to load into memory.
If I understand correctly, you are asking which of these two queries would be faster:
SELECT * FROM shirt where color = 'Green'
vs
SELECT shirt.* FROM shirt s INNER JOIN colors c
ON s.colorid = c.colorid
WHERE c.color = 'Green'
It depends a little bit on the database (well ... maybe a lot depending on if it optimizes correctly, which most if not all should), but the lookup in the color table should be negligible and then the remaining execution could use the integer lookup value and should be faster. The bulk of the processing ultimately would be equivalent to SELECT * from shirt WHERE colorid=N
. However, I suspect that you would not notice a difference in speed unless the table was quite large. The decision should probably be based on which design makes the most sense (probably the normalized one).
Beyond performance, creating a separate Color table makes your design better normalized. So, some day in the future, when someone decides that "Dark Blue" should now be called "Navy Blue", you update 1 row in your Color table vs. updating many rows in your Shirt table.
The DBMS has the opportunity to optimize indicies where there is a limited number of values. How to tell sQL to do this though, I don't know. It might figure it out.
start a data warehouse if reporting performance is a serious issue..
As Joe points out, you want the database as normalized as possible. If you have a separate reporting function, which has the possibility of causing performance problems, you should run a periodic transform (or put rules in place to do build in realtime) a 2nd read-only schema. The first is OLTP and the 2nd is OLAP ('data warehouse'); these are important concepts to have in place if you are going to get serious about your data.
If it's important enought to know, test it.
If nobody gives you an answer, the best way to do it is to test on your own.
(1) make 2 databases
(2) each with a test of your 2 tables
(3) On database just joins on the string 'color', and uses that for an FK; the other joins by int ('colorID')
Fill each with 2 million dummy rows. Run a multiple queries on each, timing 1st run and average runs.
Use an instance on your dev machine to take the network out of the picture.
You should also start and stop the instance before each type of test; stuff will stay in memory intentionally so SQL can deliver it faster, but likely, this will skew your test results from real-world operation - where it might not be in memory or cached anymore.
It depends on the query optimizer really. Your color table will be very small, so probably based on the database statistics and the query plans it would probably loaded completely in memory, so you not only end up negating the performance cost of the join it may actually be faster. This obviously depends on the dbms you're using, but several dbms can take hints as to treat a table in a special way.
Another +1 for the Color table is that if you need to change the color name, you only need 1 update as opposed to changing the string value for each occurrence.