post icon

Scalable story promotion

I had some thoughts on the idea of scalable story promotion for the open source Pligg system. I thought I would share them here

I’ve been thinking about the best way to handle promoting a story/article from a queued status onto the main page and I’ve had a few thoughts I wanted to share with everyone.

The current scheme for promotion is very simple. Number of votes passing a defined threshold in the config file (and the story is fresher than X days). While this will work for very low volume sites, it doesn’t exactly scale well. In a site with a large number of users, and one would assume proportionately a large number of votes, this breaks down.

Our ideal system would be able to quickly determine that a story is of high value and promote it to the homepage based on the frequency of votes as compared to a large sample population. Using some basic statistics we can determine if a story is an “above average performer” and promote it quickly.

In order to accomplish this we need to take into consideration several variables, including:

1) Number of stories submitted in a given time period
2) The average number of votes over a given time period for all “active” stories in the queue
3) The standard deviation of votes over a given time period for all “active” stories in the queue
4) The target number of stories to be promoted to the home page in a given time period

For the purposes of explaining my ideas I am going to define the time period as one day. We need to calculate and story the answers to some of these questions on a regular basis. Ideally we create a new DB table that stores this information for us to easy lookup. This way we can also track information statistically and show trends. For my plans I plan on implementing a cron job that runs daily that calculates the required and stores them as follows.

1) Calculating the number of stories submitted is trivial. The SQL query I am using is:

select count(*) from links where link_status = ‘queued’ and date(link_date) = ’2006-04-16′;

2&3) Calculating the average and standard deviation can be done with the following query:

select avg(link_votes) as ‘average’, stddev(link_votes) as ‘stdev’ from links where link_status = ‘queued’ and date(link_date) = ’2006-04-16′;

4) This can be set in the config.php file and up to the site administrator.

These values will be used to “predict” the future for our new stories. Each story will have a new variable that stores a floating point number. This number is the number of standard deviations above or below the mean (average). We need an additional check running on a much more frequent interval (I plan on using every 5 minutes) to update the items in the database with their new “score” and promote them once they pass a specific threshold.

I plan on calculating this as follows:

$score = //Z score that indicated if a story is above/below the mean
$numofvotes = //The number of votes that a given story has received
$stddev = //Standard deviation for the given time period as calculated above
$average = //Average number of votes for the given time period as calculated above
$numberofstories = //The number of stories submitted in the given time period as calculated above
$desirednumofstories = //Setting from config.php

The cut-off value determines a “score” threshold, or essentially a percentile that a story must fall into in order to be promoted. I plan on calculating this as follows:

$score = ($numofvotes – $average)/$stddev;
or in SQL

select link_id, ((link_votes – $avg)/$stddev) from links where link_status = ‘queued’ and date(link_date) = ’2006-04-16′;

Once we have this score we need to decide if this story needs to be promoted or not. This is done by first calculating where in the rank order this story is likely to fall for the day using some basic probability statistics.

$rank = round($numofstories/(1+exp(-1.7*$score)),1);
Now we just check and see if that rank falls above our threshold and promote the story accordingly.

$desiredrank = ($numberofstories – $desirednumofstories)
if ($rank >= $desiredrank) then {
// Update story and set to promoted
}

This method should work well assuming that traffic is fairly stable from day to day. Since we are using the previous days data to predict the current days volumes if a sudden traffic spike is hit it will mean that a larger number of stories will be promoted than desired. This can be mitigated on larger volume websites by decreasing the “given period of time” from a day to something shorter. Additionally this whole idea could be retooled to calculate these variables on a per category level. Some sites might have much higher traffic, and therefore votes for one category than another.

I’m currently working on implementing the above for a site that I plan on launching in the near future, and I would love feedback, criticism before I do it.

2 Comments

Leave a comment
  1. Jamiie
    February 7, 2008 at 8:47 am #

    Nice read. Good approach, however, have you released any source code for this? :)

Trackbacks/Pingbacks

  1. In search of a Digg rating algorithm - May 17, 2007

    [...] Scalable story promotion [...]

Leave a Reply