Archive

Archive for the ‘Web Development’ Category

Debug your outbound POST and GET requests

May 17th, 2007 jvallery 1 comment

I work a lot with different types of web services. I find when I’m building an application that has to post data off to a remote service that it can be difficult to debug where the problems are. I can’t always see an exact copy of the HTTP request that I am sending, and therefore how the remote service sees my call. I created a simple little app that when called will return exactly what it was sent. You can pass in variables in a POST or GET, and it will just spit them right back at you along with whatever HTTP headers were sent by your client.

If you point your browser over to http://vallery.net/postback/index.php you can see it in action. It will report back to you exactly how your browser is identifying itself, including any cookies you might have received from my word press blog!

Now the next time you are writing an application and you want to debug your outbound posts, just send them over to the above URL and it will respond with exactly what it received.

Pretty cool!

Categories: Web Development Tags: , , , ,

Optimizing WordPress and LAMP to survive the Digg effect

March 26th, 2007 jvallery 3 comments

Anyone who has used digg for any amount of time has certainly come across a dead link. Invariably you click on the comment link hoping someone has posted a mirror of the content. You head over to duggmirror or coral cache to view the content. During the peak time right after an article hits the digg homepage, a given site might receive as many as 100 page loads per minute. If this page is hosted on WordPress, or other MySQL based blog platform, combined with shared hosting, it’s a recipe for a crash. Your site will go down in flames and you won’t have a chance of recovering until the load goes back to a lower level.

Recently I had an article hit the front page of Digg. I was lucky because I had already done a number of things to optimize my server so that I survived. My page never went down, and was very responsive for the entire duration of getting dugg. My site is hosted on WordPress on a server that I own and manage. The server hosts about 15 different domains, none of which are exceptionally high traffic sites. The server isn’t anything special either. It’s a single processor P4 3.0GHz with 512MB of memory running CentOS 4.4, Apache 2.0, PHP5, MySQL4. I do have webalizer running on my Apache logs and I have the output of those posted at http://vallery.net/stats/ if you would like to see what the Digg effect can do.

I never exceeded more than a 0.15 load average during the Digg. Load average is a computation of how maxed out a Unix based system is. You multiply the number of processors by 1, and this indicates the maximum load average that you can reasonably sustain. If the load average exceeds this number, than your system is over utilized. In my case, my max server load is 1 given that this is a single processor system. This means I was only using 15% of my system resources during the maximum load that was generated from the Digg.

In order to optimize my system I have done the following things:

1) I’ve installed the WordPress 2.0 plugin wp-cache. This is something that every WordPress blog should have installed. Especially on a shared hosting environment, it will dramatically increase your ability to handle high traffic. The plugin generates the HTML for a given page and then saves it in a cache file. When someone accesses your page, instead of fetching the content from the database it uses the already generated HTML in the cache to send to the browser. This eliminates a number of fetches to the DB and dramatically speeds up page load times.

2) Optimize Apache/MySQL to handle the expected number of database queries. The most common error you receive when viewing a WordPress page that has been owned by the Digg effect is a database connection failed, or timeout. When connecting to a MySQL database you can either have a persistent connection, or generate a new connection for each request. There are a number of schools of thought as to which method is better, but generally speaking using persistent connections utilize more memory, so for me using a new connection for each request makes more sense. A lot of the PHP blog templates including WordPress use these non-persistent connections as well. The rub comes in the maximum number of said connections. MySQL can be configured in your my.cnf file to say the maximum number of connections that can be created to the database. If each instance of Apache (which in turn represents concurrent vistors to your page) has it’s own connection, then you can quickly exceed this maximum number of connections. Apache similarly has a configuration option that dictates the maximum number of threads that can be concurrently running. This number is specified in the httpd.conf file. If the Apache number is larger than the MySQL number (which is the case in most default configurations) when you have more Apache threads running than MySQL is capable of handling, you get our nice “maximum number of connections” error message.

In /etc/my.cnf you want to set the variable “max_connections”. I recommend something reasonably high like 250.
In your httpd.conf file for Apache you want to set the variable “MaxClients”. This should typically be the same number that you selected for MySQL.

3) Set reasonable Apache timeouts. This means that no individual thread/connection can monopolize the system bringing all the other queries to its knees. This protects you from rogue “edge” cases.

In your httpd.conf file set “Timeout” to a low number like 30 (measured in seconds) and “KeepAliveTimeout” to something like 3.

4) If your site makes it on to Digg, use real-time monitoring tools to measure your server’s health. There are a number of command line tools available to help in this regard. The first I would recommend is the tried and true “top” command. This will display the processes running on your machine along with their associated memory and CPU usage. Keep an eye and make sure things aren’t getting out of hand. The second tool that I use is called “Tcptrack“. Tcptrack will need to be installed on your machine, but once it is it will give you a real time view of your incoming connections and bandwidth usage.

Happy Digging!

Categories: Web Development Tags: , ,

Scalable story promotion

March 26th, 2007 jvallery 1 comment

I had some thoughts on the idea of scalable story promotion for the open source Pligg system. I thought I would share them here

I’ve been thinking about the best way to handle promoting a story/article from a queued status onto the main page and I’ve had a few thoughts I wanted to share with everyone.

The current scheme for promotion is very simple. Number of votes passing a defined threshold in the config file (and the story is fresher than X days). While this will work for very low volume sites, it doesn’t exactly scale well. In a site with a large number of users, and one would assume proportionately a large number of votes, this breaks down.

Our ideal system would be able to quickly determine that a story is of high value and promote it to the homepage based on the frequency of votes as compared to a large sample population. Using some basic statistics we can determine if a story is an “above average performer” and promote it quickly.

In order to accomplish this we need to take into consideration several variables, including:

1) Number of stories submitted in a given time period
2) The average number of votes over a given time period for all “active” stories in the queue
3) The standard deviation of votes over a given time period for all “active” stories in the queue
4) The target number of stories to be promoted to the home page in a given time period

For the purposes of explaining my ideas I am going to define the time period as one day. We need to calculate and story the answers to some of these questions on a regular basis. Ideally we create a new DB table that stores this information for us to easy lookup. This way we can also track information statistically and show trends. For my plans I plan on implementing a cron job that runs daily that calculates the required and stores them as follows.

1) Calculating the number of stories submitted is trivial. The SQL query I am using is:

select count(*) from links where link_status = ‘queued’ and date(link_date) = ’2006-04-16′;

2&3) Calculating the average and standard deviation can be done with the following query:

select avg(link_votes) as ‘average’, stddev(link_votes) as ‘stdev’ from links where link_status = ‘queued’ and date(link_date) = ’2006-04-16′;

4) This can be set in the config.php file and up to the site administrator.

These values will be used to “predict” the future for our new stories. Each story will have a new variable that stores a floating point number. This number is the number of standard deviations above or below the mean (average). We need an additional check running on a much more frequent interval (I plan on using every 5 minutes) to update the items in the database with their new “score” and promote them once they pass a specific threshold.

I plan on calculating this as follows:

$score = //Z score that indicated if a story is above/below the mean
$numofvotes = //The number of votes that a given story has received
$stddev = //Standard deviation for the given time period as calculated above
$average = //Average number of votes for the given time period as calculated above
$numberofstories = //The number of stories submitted in the given time period as calculated above
$desirednumofstories = //Setting from config.php

The cut-off value determines a “score” threshold, or essentially a percentile that a story must fall into in order to be promoted. I plan on calculating this as follows:

$score = ($numofvotes – $average)/$stddev;
or in SQL

select link_id, ((link_votes – $avg)/$stddev) from links where link_status = ‘queued’ and date(link_date) = ’2006-04-16′;

Once we have this score we need to decide if this story needs to be promoted or not. This is done by first calculating where in the rank order this story is likely to fall for the day using some basic probability statistics.

$rank = round($numofstories/(1+exp(-1.7*$score)),1);
Now we just check and see if that rank falls above our threshold and promote the story accordingly.

$desiredrank = ($numberofstories – $desirednumofstories)
if ($rank >= $desiredrank) then {
// Update story and set to promoted
}

This method should work well assuming that traffic is fairly stable from day to day. Since we are using the previous days data to predict the current days volumes if a sudden traffic spike is hit it will mean that a larger number of stories will be promoted than desired. This can be mitigated on larger volume websites by decreasing the “given period of time” from a day to something shorter. Additionally this whole idea could be retooled to calculate these variables on a per category level. Some sites might have much higher traffic, and therefore votes for one category than another.

I’m currently working on implementing the above for a site that I plan on launching in the near future, and I would love feedback, criticism before I do it.

Categories: Web Development Tags: ,