I have a cities table which looks like this.
|id| Name |
|1 | Paris |
|2 | London |
|3 | New York|
I have a tags table which looks like this.
|id| tag |
|1 | Europe |
|2 | North America |
|3 | River |
and a cities_tags table:
|id| city_id | tag_id |
|1 | 1 | 1 |
|2 | 1 | 3 |
|3 | 2 | 1 |
|4 | 2 | 3 |
|5 | 3 | 2 |
|6 | 3 | 3 |
How do I calculate which are the most closely related city? For example. If I were looking at city 1 (Paris), the results should be: London (2), New York (3)
I have found the Jaccard index but I'm unsure as how best to implement this.
This query is without any fancy functions or even sub queries. It is fast. Just make sure cities.id, cities_tags.id, cities_tags.city_id and cities_tags.tag_id have an index.
The queries returns a result containing: city1, city2 and the count of how many tags city1 and city2 have in common.
Change
!=
into>
to avoid each city to be returned twice. Meaning a city will then no longer appears once in the first column as well as once in the second column.Change the two
left join
intoinner join
if you don't want to see the city combinations that have no tag matches.This query is statically referring to
city_id=1
, so you'll have to make that a variable in both thewhere tag_id in
clause, and thenot city_id in
clause.If I understood the Jaccard index properly, then it also returns that value ordered by the 'most closely related'. The results in our example look like this:
Edit
With a better understanding of how to implement Jaccard Index:
After reading a bit more on wikipedia about the Jaccard Index, I've come up with a better way implement a query for our example dataset. Essentially, we will be comparing our chosen city with each other city in the list independently, and using the count of common tags divided by the count of distinct total tags selected between the two cities.
The query is a bit lengthy, and I don't know how well it will scale, but it does implement a true Jaccard Index, as requested in the question. Here are the results with the new query:
Edited again to add comments to the query, and take into account when the current city's tags are a subset of the chosen city's tags
You question about How do I calculate which are the most closely related city? For example. If I were looking at city 1 (Paris), the results should be: London (2), New York (3) and based on your provided data set there is only one thing to relate that is the common tags between the cities so the cities which shares the common tags would be the closest one below is the subquery which finds the cities (other than which is provided to find its closest cities) that shares the common tags
Working
I assume you will input one of the city id or name to find their closest one in my case "Paris" has the id one
It will find all the tags id which paris has then
It will fetch all the cities except paris that has the some same tags that paris also has
Here is your Fiddle
While reading about the Jaccard similarity/index found some stuff to understand about the what actualy the terms is lets take this example we have two sets A & B
Now move towards your scenario
Here is the query so far which calcluates the perfect jaccard index you can see the below fiddle example
In above query i have the i have derived the result set to two subselects in order get my custom calculated aliases
You can add the filter in above query not to calculate the similarity with itself
So the result shows Paris is closely related to London and then related to New York
Jaccard Similarity Fiddle
Too late, but I think that none of answers are fully correct. I got the best part of each one and put all together to make my own answer:
(q.sets + q.parisset) AS
andunion
(q.sets - q.parisset) AS
is very wrong.intersect
My answer:
cities
table like this.cities_tag
table like this.With this sample data, Florence have a full matches with Paris, New York matches one tag, São Paulo have no tags matches and London matches two tags and have another one. I think the Jaccard Index of this sample is:
My query is like this:
SQL Fidde
Could this be a push in the right direction?
What I tried was this:
// Find the citynames:
Get city.names (SUBQUERY) as matchCount FROM cities WHERE matchCount >0
// the subquery:
select the amount of tags cities have which (SUBSUBQUERY) also has
// the subsubquery
select the id of the tags the original name has