I have a large amount of data that I need to sort in a specific way based on a search query, but I'm not sure of the best approach to take.
The data I'm trying to sort is a list of courses, grouped by schools. Each course is taught by one school. Each school may belong to any number of "partnerships", which represents a relationship between a number of schools. A user can search for any number of courses by the name of the course.
I need to sort the data as follows:
Courses are grouped by school, with 10 schools appearing per page.
Schools that can provide every course that the user has searched for should appear first in the list.
After these results, schools that belong to a partnership that can accommodate all of the courses the user searched for should appear next to each other.
Here is an example:
- A teaches History, French and English courses.
- B teaches French and Mathematics.
- C teaches History.
- B and C are in a partnership together.
D teaches History.
The user searches for "History" and "French".
A should appear first in the results, with its History and French courses, as it can provide both of the courses the user is looking for.
B, followed by C appears next, with the relevant course it teaches listed after it, as the partnership can provide both of the user's required courses.
D Appears next as it only provides 1 relevant course.
The data is stored in an Microsoft SQL Server database across a few tables. Here is a simplified schema:
Course:
- int id
- varchar name
- int schoolId
School:
- int id
- varchar name
Partnership:
- int id
- varchar partnershipName
SchoolPartnership:
- int id
- int schoolId
- int partnershipId
There are over 100000 courses and around 300 schools. I don't know of a way to sort the courses as specified in SQL, which I think is my biggest problem. I only need to display 10 results per page, but as I can't do the sorting in the SQL query, I have to extract the entire result set and sort it manually in PHP before I can cut the result set down to 10 results.
I'm currently extracting the data I need in a single query with multiple joins using Doctrine 2, hydrating the results as an array. Then the plan is to manipulate this big array of records in PHP to get it into the correct order. Due to the size of this array, I'm worried that this sorting process is going to be very slow, so I'm looking for advice on how to make this quicker, either by:
- Handling the sorting in the SQL query.
- Suggesting how an algorithm such as the one described could be implemented in a search engine such as Solr (I have a little experience of the basics of this, but not performing complex sorting).
- Suggestions on how best to perform the sorting in PHP, if the other two options are not viable.
EDIT:
I've made some good progress on this, thanks (particularly @Neil). I've opened a separate question up ( Groupwise MAX() on a subquery ), which contains some of my progress so far.
This is a valid email address.
To find schools by the number of matching courses is simple:
If this was all you needed, you could
ORDER BY schoolCount DESC
to get them in the order you want.To find partnerships with matching courses, you first need to find the partnerships that have the course at at least one school:
Note that
DISTINCT
is needed because we don't care how many schools in the partnership have that course. If you don't haveDISTINCT
then you could use a subselect instead:You can then use the first and last query above as subselects in a join with SchoolPartnership to order the schools in descending order of partnershipMatches and schoolMatches. (Note that I assume that all schools are in a partnership of at least one school.) I think the final query will look like this:
We had similar problem with site's pages. We created special denormilized search table with all params to perform search with no sub queries or joins. All data was duplicated, so when something changes we update all denormalizar data. We used background tasks for syncronizing data, so search results could be not actual for some small time.
May be it seems complicated, but it only way if your data and request will grow up.