-->

How are Reddit and Hacker News ranking algorithms

2020-05-11 10:57发布

问题:

I've been looking at ranking algorithms recently, specifically those used by Reddit and Hacker News. The algorithms themselves are simple enough, but I don't quite understand how they are used.

One thing that I could do is implement the algorithm straight in SQL, so that every time a user goes to a page displaying ranked posts, something like this would run:

SELECT thing1, thing2 FROM table
ORDER BY ranking_algorithm DESC
LIMIT page*20, 20

There are several similar questions on SO, but the only answer given is to put the ranking algorithm inside the SQL query. Then the thread dies...

Putting the algorithm in the SQL query fine on a smaller scale, but what if the website has a large number of users and a very large number of posts? That means that the every time any user opens a page that displays ranked posts, that query will be run. That can't be very efficient.

Now, Reddit and Hacker News don't run their ranking algorithms in as SQL queries, but in python and ark respectively. So how and when exactly are they used?

One possible solution is to take all the relevant info from every post and store it in some data structure on the web server. Then rank and sort this data structure.

Every time someone opens a page that shows the ranked posts, you just go to the data structure, retrieve the correct range of posts, and display them.

Then every half hour or so, you retrieve the most up to date information from the server, rank it, sort it, and update the data structure.

Other less expensive queries, such as retrieving and displaying all the info from a specific post, or displaying the newest posts (as opposed to the best scored) could be done in SQL every time the relevant page is opened.

The advantage is that your database is being hit (for the expensive ranking query) only once every half hour. The disadvantage is that you need to have a duplicate of a large chunk of your database.

回答1:

Reddit uses Pyrex, the sort algorithm is a Python C extension to improve performance.

So, you can do the same in SQL when the record is updated, pex: when is up or down voted.

The pseudocode you must to translate to your SQL engine syntax:

function hot(ups, downs, date){
    score = ups - downs;
    order = log(max(abs(score), 1), 10);
    if (score>0){
        sign = 1;
    } else {
        if (score<0){
            sign = -1;
        } else {
            sign = 0;
        }
    }
    td = date - datetime(1970,1,1);
    seconds = td.days * 86400 + td.seconds + (float(td.microseconds) / 1000000) - 1134028003;

    return round(order + sign * seconds / 45000, 7);
}

So you must to store in the post table the ups, downs, date and the hot function result. And then you can make a sort in the hot column.

You can see the Reddit source code here: http://code.reddit.com/



回答2:

I implemented an SQL version of Reddit's ranking algorithm for a video aggregator like so:

SELECT id, title
FROM videos
ORDER BY 
    LOG10(ABS(cached_votes_total) + 1) * SIGN(cached_votes_total)   
    + (UNIX_TIMESTAMP(created_at) / 300000) DESC
LIMIT 50

cached_votes_total is updated by a trigger whenever a new vote is cast. It runs fast enough on our current site, but I am planning on adding a ranking value column and updating it with the same trigger as the cached_votes_total column. After that optimization, it should be fast enough for most any size site.

edit: More information at Reddit Hotness Algorithm in SQL