Posts Tagged ‘digg’

The death of the local newspaper?

Tuesday, November 27th, 2007

The death of the local newspaper?

I consider myself fairly well informed. I read a number of different publications to stay up to date with current events, the latest technology, or even just a bit of celebrity gossip. I’m a busy guy, I have a lot going on and I don’t have a bunch of time to just sit around reading different websites. I, like many others, rely heavily on RSS in order to get the most of my online leisure time. I use the fantastic Google Reader application to aggregate the feeds that interest me into a single easy to sort through interface. I’m subscribed to several national and international news feeds like the New York Times, Washington Post, Wall Street Journal, and the BBC. I’ve got feeds for a couple of the social news websites like Digg and Reddit. I’ve got a few feeds for Google News searches on topics that interest me. Lastly, I’ve got a few blogs and other miscellaneous feeds. I can quickly scan the headlines and read an article if it is of actual interest to me. All of this gets me fairly well informed on what is going on in the nation, and in the world. That is the problem….

You see, a lot goes on that is relevant to me, and that I would be very interested in knowing, but I’m completely clueless about it. The world around me, around where I live, isn’t well represented online. I want to be able to access my local news just like I access the rest of my news. I want to be informed about what is going on without spending unnecessary time on it.

I live just outside a moderately sized city in Northern Colorado called Longmont. We have a local newspaper called The Daily Times Call that covers local current events. The newspaper represents Longmont as well as several of the smaller communities around Longmont like the one I live in called Firestone. According to the wikipedia article on Longmont, as of 2005 it has about 76k residents and 26k households. If you include the surrounding communities my best guess would be that the newspaper could potentially reach as many as 40k homes. I have no clue how many subscribers have, but it would certainly be a small subset of that. The Times Call has always been a good newspaper. I’ve been a subscriber on and off over the years. I even delivered papers for them when I was much younger. My problem with the Times Call is that my options are fairly limited on actually getting the news from them. Today, it really breaks down to either of the following:

1) Subscribe to the dead tree version of the newspaper. There are lots of reasons why this isn’t ideal for me, and I would guess a lot of folks like me. The print version is a huge waste of paper. It takes a significantly longer time to sort through the articles. Using an RSS reader I can glance over 250 stories and read the ones of interest before I could even get through the first section of the print version. The print version is largely ad supported which just adds more heft to its size. Most importantly however is that the print version isn’t always available when I want to read the news. I frequently read the news at work, at home, or on my mobile device. It is really just a matter of whenever I can grab a free minute.

2) Read the news on their website. The downsides of this are that their page has a relatively poor user interface. It is loaded down heavily with advertisements. The biggest downside is that I have to remember to go check it. Google Reader is routine for me, it’s my source of news and information. To get the local news from the Times Call website, it requires an extra step of loading up a separate page trying to make heads or tales of the articles they have on their site.

Over the years I’ve done both methods. I gave up on the dead tree version a couple years ago in preference to their website. Up until just earlier this year they didn’t even publish most of their local stories on their website. The only thing up there would be the top couple of lead articles. These problems aren’t unique to the Times Call either. I’m sure there are some exceptions out there, but when I did a casual survey of several other local newspapers throughout Colorado I found a similar experience.

If I could dream up a solution to these problems it would be content created by individual journalists, paid journalists, and amateur bloggers alike. The content would be well organized and tagged not only for category or type but also for geography. A social network, or digg/reddit like approach would be used to identify popular stories for the masses but that content wouldn’t drown out the local information that might not have as much of a mass appeal.

Until my news and information utopia exists I need to come up with a real interim solution. I’ve contacted the Times Call on several occasions asking, begging, for RSS on their website. My emails seem to have fallen on deaf ears as I have never received a response. I assume that they are tied to an old and outdated business model and are afraid to move into the modern age. They keep tight control over the methods in which their content is viewed so they can pump the pages full of ads. I respect that, I understand that is currently their way of making the web profitable. I also understand that this isn’t the business model of the future. Content like theirs is only valuable if they have an audience to read it. Increasingly so, folks are turning to other methods to become informed. Technologies like RSS are disruptive, game changing. They empower users to be in control, not the publishers. The Times Call is the best there is for covering news about Longmont, Colorado but the last place I would turn for news about the war in Iraq. They don’t have the resources to provide quality coverage of national and international topics of interest. Each news source has it’s place and as those niches are carved out each publication will have it’s own following.

The Times Call, and local newspapers like it all across the country, need to do what they do best. Provide great coverage about what is going on in our communities. They need to provide that coverage in ways that are accessible to everyone. From folks like my Dad, who I don’t think I could even explain to him what an RSS feed is let alone get him to use one. To folks like me, and many of my peers and friends who want to make the most of our busy lives but still be informed about the communities we live in. If the local newspapers don’t adopt a different business model for the web, they will continue to see their subscriber base shrink. I’m happy to pay online subscription fees for access to quality content. I know nobody works for free and someone has to pay the bills. Let me pay the Times Call $10 a month for access to their RSS feed, hopefully advertisement free. If that doesn’t work, just publish the article title and a synopsis and force the user to access your website to read an article they are interested in. At least this way I know what is on your site and if I want to read it I’ll be subjected to all of your advertisements.

Since neither of these solutions have happened so far, I’ve decided to take matters into my own hands. I created an application that harvests the article contents from the Times Call website and then redistributes it in RSS format. It took me all of a couple hours to put this together and test it. It is working great and myself along with several of my friends are now using it. While this doesn’t help most people out there, if you happen to live in and around Longmont and want to access the Times Call in RSS format you can get the feed at http://vallery.net/timescall.xml.

Happy reading!

Optimizing Wordpress and LAMP to survive the Digg effect

Monday, March 26th, 2007

Anyone who has used digg for any amount of time has certainly come across a dead link. Invariably you click on the comment link hoping someone has posted a mirror of the content. You head over to duggmirror or coral cache to view the content. During the peak time right after an article hits the digg homepage, a given site might receive as many as 100 page loads per minute. If this page is hosted on Wordpress, or other MySQL based blog platform, combined with shared hosting, it’s a recipe for a crash. Your site will go down in flames and you won’t have a chance of recovering until the load goes back to a lower level.

Recently I had an article hit the front page of Digg. I was lucky because I had already done a number of things to optimize my server so that I survived. My page never went down, and was very responsive for the entire duration of getting dugg. My site is hosted on Wordpress on a server that I own and manage. The server hosts about 15 different domains, none of which are exceptionally high traffic sites. The server isn’t anything special either. It’s a single processor P4 3.0GHz with 512MB of memory running CentOS 4.4, Apache 2.0, PHP5, MySQL4. I do have webalizer running on my Apache logs and I have the output of those posted at http://vallery.net/stats/ if you would like to see what the Digg effect can do.

I never exceeded more than a 0.15 load average during the Digg. Load average is a computation of how maxed out a Unix based system is. You multiply the number of processors by 1, and this indicates the maximum load average that you can reasonably sustain. If the load average exceeds this number, than your system is over utilized. In my case, my max server load is 1 given that this is a single processor system. This means I was only using 15% of my system resources during the maximum load that was generated from the Digg.

In order to optimize my system I have done the following things:

1) I’ve installed the Wordpress 2.0 plugin wp-cache. This is something that every Wordpress blog should have installed. Especially on a shared hosting environment, it will dramatically increase your ability to handle high traffic. The plugin generates the HTML for a given page and then saves it in a cache file. When someone accesses your page, instead of fetching the content from the database it uses the already generated HTML in the cache to send to the browser. This eliminates a number of fetches to the DB and dramatically speeds up page load times.

2) Optimize Apache/MySQL to handle the expected number of database queries. The most common error you receive when viewing a Wordpress page that has been owned by the Digg effect is a database connection failed, or timeout. When connecting to a MySQL database you can either have a persistent connection, or generate a new connection for each request. There are a number of schools of thought as to which method is better, but generally speaking using persistent connections utilize more memory, so for me using a new connection for each request makes more sense. A lot of the PHP blog templates including Wordpress use these non-persistent connections as well. The rub comes in the maximum number of said connections. MySQL can be configured in your my.cnf file to say the maximum number of connections that can be created to the database. If each instance of Apache (which in turn represents concurrent vistors to your page) has it’s own connection, then you can quickly exceed this maximum number of connections. Apache similarly has a configuration option that dictates the maximum number of threads that can be concurrently running. This number is specified in the httpd.conf file. If the Apache number is larger than the MySQL number (which is the case in most default configurations) when you have more Apache threads running than MySQL is capable of handling, you get our nice “maximum number of connections” error message.

In /etc/my.cnf you want to set the variable “max_connections”. I recommend something reasonably high like 250.
In your httpd.conf file for Apache you want to set the variable “MaxClients”. This should typically be the same number that you selected for MySQL.

3) Set reasonable Apache timeouts. This means that no individual thread/connection can monopolize the system bringing all the other queries to its knees. This protects you from rogue “edge” cases.

In your httpd.conf file set “Timeout” to a low number like 30 (measured in seconds) and “KeepAliveTimeout” to something like 3.

4) If your site makes it on to Digg, use real-time monitoring tools to measure your server’s health. There are a number of command line tools available to help in this regard. The first I would recommend is the tried and true “top” command. This will display the processes running on your machine along with their associated memory and CPU usage. Keep an eye and make sure things aren’t getting out of hand. The second tool that I use is called “Tcptrack“. Tcptrack will need to be installed on your machine, but once it is it will give you a real time view of your incoming connections and bandwidth usage.

Happy Digging!

Scalable story promotion

Monday, March 26th, 2007

I had some thoughts on the idea of scalable story promotion for the open source Pligg system. I thought I would share them here

I’ve been thinking about the best way to handle promoting a story/article from a queued status onto the main page and I’ve had a few thoughts I wanted to share with everyone.

The current scheme for promotion is very simple. Number of votes passing a defined threshold in the config file (and the story is fresher than X days). While this will work for very low volume sites, it doesn’t exactly scale well. In a site with a large number of users, and one would assume proportionately a large number of votes, this breaks down.

Our ideal system would be able to quickly determine that a story is of high value and promote it to the homepage based on the frequency of votes as compared to a large sample population. Using some basic statistics we can determine if a story is an “above average performer” and promote it quickly.

In order to accomplish this we need to take into consideration several variables, including:

1) Number of stories submitted in a given time period
2) The average number of votes over a given time period for all “active” stories in the queue
3) The standard deviation of votes over a given time period for all “active” stories in the queue
4) The target number of stories to be promoted to the home page in a given time period

For the purposes of explaining my ideas I am going to define the time period as one day. We need to calculate and story the answers to some of these questions on a regular basis. Ideally we create a new DB table that stores this information for us to easy lookup. This way we can also track information statistically and show trends. For my plans I plan on implementing a cron job that runs daily that calculates the required and stores them as follows.

1) Calculating the number of stories submitted is trivial. The SQL query I am using is:

select count(*) from links where link_status = ‘queued’ and date(link_date) = ‘2006-04-16′;

2&3) Calculating the average and standard deviation can be done with the following query:

select avg(link_votes) as ‘average’, stddev(link_votes) as ’stdev’ from links where link_status = ‘queued’ and date(link_date) = ‘2006-04-16′;

4) This can be set in the config.php file and up to the site administrator.

These values will be used to “predict” the future for our new stories. Each story will have a new variable that stores a floating point number. This number is the number of standard deviations above or below the mean (average). We need an additional check running on a much more frequent interval (I plan on using every 5 minutes) to update the items in the database with their new “score” and promote them once they pass a specific threshold.

I plan on calculating this as follows:

$score = //Z score that indicated if a story is above/below the mean
$numofvotes = //The number of votes that a given story has received
$stddev = //Standard deviation for the given time period as calculated above
$average = //Average number of votes for the given time period as calculated above
$numberofstories = //The number of stories submitted in the given time period as calculated above
$desirednumofstories = //Setting from config.php

The cut-off value determines a “score” threshold, or essentially a percentile that a story must fall into in order to be promoted. I plan on calculating this as follows:

$score = ($numofvotes – $average)/$stddev;
or in SQL

select link_id, ((link_votes – $avg)/$stddev) from links where link_status = ‘queued’ and date(link_date) = ‘2006-04-16′;

Once we have this score we need to decide if this story needs to be promoted or not. This is done by first calculating where in the rank order this story is likely to fall for the day using some basic probability statistics.

$rank = round($numofstories/(1+exp(-1.7*$score)),1);
Now we just check and see if that rank falls above our threshold and promote the story accordingly.

$desiredrank = ($numberofstories – $desirednumofstories)
if ($rank >= $desiredrank) then {
// Update story and set to promoted
}

This method should work well assuming that traffic is fairly stable from day to day. Since we are using the previous days data to predict the current days volumes if a sudden traffic spike is hit it will mean that a larger number of stories will be promoted than desired. This can be mitigated on larger volume websites by decreasing the “given period of time” from a day to something shorter. Additionally this whole idea could be retooled to calculate these variables on a per category level. Some sites might have much higher traffic, and therefore votes for one category than another.

I’m currently working on implementing the above for a site that I plan on launching in the near future, and I would love feedback, criticism before I do it.