I need to find a way to efficiently process a large amount of data in PHP/MySQL. Here's the situation:
I have a database table with, say, one million records. Based on user input from PHP, I need to rank all of those one million records according to a non-trivial calculation step, so that I can pick the top scoring items. My problem is that this scales very poorly from a memory usage standpoint, particularly at the sorting step, if I rearrange the data into columns and use array_multisort.
Alternative methods I can think of are:
- Doing the calculations in PHP and reinserting the data with the scores into a temporary table, retrieving the highest scoring items using a SELECT ... ORDER BY score ... LIMIT query
- Doing the calculations in PHP and outputting the data with the scores into a CSV file, then calling the command-line sort utility, then reading in the top X number of lines
- Doing the calculations in MySQL using a stored procedure and retrieving the top X number of items, as in option 1. My concern with this is whether the DB is well suited for the number crunching this would involve
This has got to be a fairly common problem for things like search engines. Scalability is the number one priority, but performance has to be pretty good too. Is one of these approaches best, or is there some other great option that I'm not even considering?
Doing it in a complex ORDER BY is the best out of all the options you mentioned, if it's possible. But for one million records you will still encounter problems.
Sounds like you are doing things the hard way, trying to keep all your data in one place and order it on the fly. Is there no way to do scoring beforehand, even if it's a few scores you combine at request time for a custom sort key?
How complex are the user's queries? If you are trying to do fulltext search you should get software specialized for that task.
Assuming your dataset is too large to store in memory.... If you only need the top n items, you can keep only the top results in memory as you page through the 1 million rows. This would also work with the temporary table idea of yours, writing the top records from each batch.
Another option would be to write a user defined function:
http://dev.mysql.com/doc/refman/5.1/en/adding-functions.html
Why not do part or all of your calculation when you store the row. That way you only have to do it once and you have lots of time to do it in.