Custom sorting algorithm for a large amount of dat

2019-08-10 07:35发布

问题:

I have a large amount of data that I need to sort in a specific way based on a search query, but I'm not sure of the best approach to take.

The data I'm trying to sort is a list of courses, grouped by schools. Each course is taught by one school. Each school may belong to any number of "partnerships", which represents a relationship between a number of schools. A user can search for any number of courses by the name of the course.

I need to sort the data as follows:

  • Courses are grouped by school, with 10 schools appearing per page.

  • Schools that can provide every course that the user has searched for should appear first in the list.

  • After these results, schools that belong to a partnership that can accommodate all of the courses the user searched for should appear next to each other.

Here is an example:

  • A teaches History, French and English courses.
  • B teaches French and Mathematics.
  • C teaches History.
  • B and C are in a partnership together.
  • D teaches History.

  • The user searches for "History" and "French".

  • A should appear first in the results, with its History and French courses, as it can provide both of the courses the user is looking for.

  • B, followed by C appears next, with the relevant course it teaches listed after it, as the partnership can provide both of the user's required courses.

  • D Appears next as it only provides 1 relevant course.

The data is stored in an Microsoft SQL Server database across a few tables. Here is a simplified schema:

Course:

  • int id
  • varchar name
  • int schoolId

School:

  • int id
  • varchar name

Partnership:

  • int id
  • varchar partnershipName

SchoolPartnership:

  • int id
  • int schoolId
  • int partnershipId

There are over 100000 courses and around 300 schools. I don't know of a way to sort the courses as specified in SQL, which I think is my biggest problem. I only need to display 10 results per page, but as I can't do the sorting in the SQL query, I have to extract the entire result set and sort it manually in PHP before I can cut the result set down to 10 results.

I'm currently extracting the data I need in a single query with multiple joins using Doctrine 2, hydrating the results as an array. Then the plan is to manipulate this big array of records in PHP to get it into the correct order. Due to the size of this array, I'm worried that this sorting process is going to be very slow, so I'm looking for advice on how to make this quicker, either by:

  • Handling the sorting in the SQL query.
  • Suggesting how an algorithm such as the one described could be implemented in a search engine such as Solr (I have a little experience of the basics of this, but not performing complex sorting).
  • Suggestions on how best to perform the sorting in PHP, if the other two options are not viable.

EDIT:

I've made some good progress on this, thanks (particularly @Neil). I've opened a separate question up ( Groupwise MAX() on a subquery ), which contains some of my progress so far.

回答1:

To find schools by the number of matching courses is simple:

SELECT schoolId, COUNT(*) AS schoolCount
  FROM Courses
  WHERE name IN ('History', 'French')
  GROUP BY schoolId

If this was all you needed, you could ORDER BY schoolCount DESC to get them in the order you want.

To find partnerships with matching courses, you first need to find the partnerships that have the course at at least one school:

SELECT partnershipId, COUNT(DISTINCT name) AS partnershipCount
  FROM SchoolPartnership
  INNER JOIN Courses ON Course.schoolId = SchoolPartnership.schoolId
  WHERE name IN ('History', 'French')
  GROUP BY partnershipId

Note that DISTINCT is needed because we don't care how many schools in the partnership have that course. If you don't have DISTINCT then you could use a subselect instead:

SELECT partnershipId, COUNT(*) AS partnershipCount
  FROM (
    SELECT DISTINCT partnershipId, name
      FROM SchoolPartnership
      INNER JOIN Courses ON Course.schoolId = SchoolPartnership.schoolId
      WHERE name IN ('History', 'French'))
  GROUP BY partnershipId

You can then use the first and last query above as subselects in a join with SchoolPartnership to order the schools in descending order of partnershipMatches and schoolMatches. (Note that I assume that all schools are in a partnership of at least one school.) I think the final query will look like this:

SELECT SchoolMatches.schoolID
  FROM (
    SELECT schoolId, COUNT(*) AS schoolCount
      FROM Courses
      WHERE name IN ('History', 'French')
      GROUP BY schoolId
  ) SchoolMatches
  JOIN SchoolPartnership ON SchoolMatches.schoolID = SchoolPartnership.schoolID
  JOIN (
    SELECT partnershipId, COUNT(DISTINCT name) AS partnershipCount
      FROM SchoolPartnership
      INNER JOIN Courses ON Course.schoolId = SchoolPartnership.schoolId
      WHERE name IN ('History', 'French')
      GROUP BY partnershipId
   ) PartnershipMatches ON SchoolPartnership.schoolId = PartnershipMatches.schoolId
   ORDER BY PartnershipMatches.partnershipCount DESC, SchoolMatches.SchoolCount DESC


回答2:

We had similar problem with site's pages. We created special denormilized search table with all params to perform search with no sub queries or joins. All data was duplicated, so when something changes we update all denormalizar data. We used background tasks for syncronizing data, so search results could be not actual for some small time.

May be it seems complicated, but it only way if your data and request will grow up.



回答3:

filter_var('sgamgee@example.com', FILTER_VALIDATE_EMAIL); // Returns "sgamgee@example.com"

This is a valid email address.