1 0 Archive | Web Development RSS feed for this section
post icon

And we’re back…. On wordpress

So my fairly short lived experiment with migrating my Blog to SharePoint is over.   As much as I love SharePoint it really isn’t a great platform for public blogging, at least using the out of the box blog template.

There are some major limitations that caused me to switch back:

 

  • Because the solution runs in the sandbox I can’t make any outbound HTTP requests from my custom code.  This means that even if I wanted to write a custom comment handler to use a spam filter like Akismet, I can’t.   The result was that I was getting hundreds of spam comments on my blog. 
  • There is no URL rewrite support so all of the URLs on the new blog were simply “Post.aspx?ID=XX”.   This is crap for search engine optimization.
  • The comment capability doesn’t allow for anonymous users to specify their name and email address.   This meant that all comments were anonymous.

Overall it was a fun experiment and I learned a lot about sandbox solutions and online hosting.   I think I’ll stick to wordpress at least for now though as I have considerably more flexibility with it.

For historical purposes though I still have the Visual Studio project available that I used to create the sandbox solution here:  http://vallery.net/wp-content/uploads/2012/01/ValleryBlog.zip

Leave a Comment
post icon

Blog migration to SharePoint

 

You can find the source code for this entry at:  http://vallery.net/wp-content/uploads/2012/01/ValleryBlog.zip

So I decided to migrate my blog from WordPress to SharePoint.   The motivation was driven primarily out of wanting to “eat my own dogfood”.    Given that most of the work I do and the posts on my blog all have to do with SharePoint I thought it would be appropriate to actually host my blog with SharePoint.  

I decided to seek out a hosting provider which specializes in “SharePoint in the cloud” given that my blog is high enough traffic and important enough to me that hosting at my house isn’t an option.   I first turned to Microsoft’s Office 365 service given that I’m already hosting my personal email with them.    I was disappointed to learn that you currently cannot host a public facing website with anonymous access.   They require the site to be secured which wouldn’t work for a blog.

I emailed a couple of the other community hosting options and got a very nice reply from FPWeb.   One of the services they offer gives you a single site collection on SharePoint Foundation which you can optionally give anonymous access to.   The site is limited in terms of which SharePoint features are available but they did offer the blog template.  If you’re not familiar with the differences check out this great guide which details the plans that are available.   The site provided uses SharePoint hostname site collections which allows me to have a fully qualified domain name (http://blog.vallery.net/) point to a single site collection on their web application.  

For my use case there were a few limitations that bit me:

  • Deployed custom solutions must fit into a “Sandboxed Solution”.    This is true for any of the non-dedicated online SharePoint option though so not an issue with FPWeb specifically.    This means that many of the things I’m used to doing for my customer’s on-premise installations are not possible in the cloud.   Microsoft has a good list of the limitations of SandBoxed solutions worth reading.  The things I missed the most are:
    • Ability to create custom delegate controls.    This made doing a custom searchbox a bit more difficult and less flexible than I’d like.
    • Ability to integrate with external systems.   My current WordPress based blog will post to twitter automatically when I write a new article but given that no external HTTP requests are allowed I can’t write an event receiver to do that for me.
    • I have to deploy all style assets to the SharePoint “Style Library” because I can’t write to the file system.    This could potentially have a negative impact on performance because now they live in the database.
    • I wanted to create a few custom controls to handle some of the style elements.   For example the banner on the site isn’t as flexible as I’d like.   It should really pull the title, url, and description from SPContext.current.web.   I had to hardcode them into the masterpage which means my solution isn’t portable.
  • Since FPWeb only provides SharePoint Foundation it makes branding more difficult.   There are a number of capabilities found in the enterprise publishing features that I typically use with my customers which were unavailable to me. 
    • In a custom masterpage you can use the SPUrl object to find a the URL of the site collection or site in order to reference your style library (to include CSS, JS, and images on the page).   This is not available if you don’t have enterprise so I had to hard code the path to the Style Library into the masterpage.   On my development box I created a blog site collection in a managed path off the root which made the path different than my site here at FPWeb (which is the root site collection).   This caused me to have to make a quick change every time I switched from my dev box to pushing out the solution to FPWeb.
  • Search is limited given that it is only SharePoint Foundation.  This is fine because I only need a site scoped search anyways.   The trouble of course is that it turns out the control that is used is very different.   I styled the Enterprise version since that is what I had on my dev box.   Once I deployed to FPWeb I discovered that none of my styles worked.   I had to go back and refactor for the foundation search box.   It wasn’t a huge issue but it took me a while to figure out why my styles weren’t working!

Once I got my solution done with my custom style I needed to get all my old content off my WordPress blog and onto SharePoint.   I had some good conversations with Metalogix at their booth during the SharePoint conference in Anaheim so I was already familiar their "Migration Manager for Blogs and Wikis" product.   They were kind enough to give me a trial key which was enough to migrate my posts and comments from wordpress across.   The process was painless and just required confirming the source and destination.   It was able to capture metadata as well like category which was handy.  

I did discover a few issues with the way the data was migrated though.  I realized that the blog author for all of the posts and comments was listed as me.   I guess this make sense since the author field is tied to a people picker and needs to be a person.   None of the comment authors in wordpress are users in my SharePoint environment obviously.    Metalogix handles this by creating a couple of extra columns on the comments list which store the original comment author, name, and email address.   I plan on writing a custom XSLT file and view to parse this at some point to place this information onto the page.  I’ve also included these fields on the new comment dialogue so I can capture it for future comments.  For now though I’m listed as the comment author for all of the comments.

The only issue I had with the Metalogix tool is the moving of images.   This could easily have been my fault as well so I don’t want to blame the tool.   My posts had a number of embedded images which WordPress stored in the wp-content folder.   My expectation would have been that Metalogix would have grabbed those files and put them into SharePoint and updated the references.   It even looked like it attempted to do that however I received a bunch of error messages which indicated failure.   My images are still linking to their original source on my WordPress site (which is still up) so they are not broken.   One of my todo items is to migrate those across by hand and then update my old posts that reference them.     This will give me an opportunity to clean up my old posts anyways which I’ve been meaning to do for a while.

With everything up and running I put 301 redirects into the .htaccess file on my WordPress server for each post.   I pulled out a list of all the posts from the SharePoint list and matched them up to their original URL (which thankfully Metalogix stores in an additional column for me).   With all of these URLs known I’m able to redirect the deeplinks to their specific posts.   WordPress has the handy SEO optimized permalinks which SharePoint does not.

Overall I’m very satisfied with both FPWeb and Metalogix.  Without either of them this project would have been significantly more difficult.   If you can accept the limitations of hosting your SharePoint site in a shared environment then I highly recommend FPWeb.

In case anyone is interested in the actual development work that was involved with the branding in all this I’ve posted my Visual Studio project.  In total I spent about 15 hours on the entire project over my Thanksgiving weekend. You can download the project at http://vallery.net/wp-content/uploads/2012/01/ValleryBlog.zip.

Leave a Comment
post icon

Debug your outbound POST and GET requests

I work a lot with different types of web services. I find when I’m building an application that has to post data off to a remote service that it can be difficult to debug where the problems are. I can’t always see an exact copy of the HTTP request that I am sending, and therefore how the remote service sees my call. I created a simple little app that when called will return exactly what it was sent. You can pass in variables in a POST or GET, and it will just spit them right back at you along with whatever HTTP headers were sent by your client.

If you point your browser over to http://vallery.net/postback/index.php you can see it in action. It will report back to you exactly how your browser is identifying itself, including any cookies you might have received from my word press blog!

Now the next time you are writing an application and you want to debug your outbound posts, just send them over to the above URL and it will respond with exactly what it received.

Pretty cool!

Leave a Comment
post icon

Optimizing WordPress and LAMP to survive the Digg effect

Anyone who has used digg for any amount of time has certainly come across a dead link. Invariably you click on the comment link hoping someone has posted a mirror of the content. You head over to duggmirror or coral cache to view the content. During the peak time right after an article hits the digg homepage, a given site might receive as many as 100 page loads per minute. If this page is hosted on WordPress, or other MySQL based blog platform, combined with shared hosting, it’s a recipe for a crash. Your site will go down in flames and you won’t have a chance of recovering until the load goes back to a lower level.

Recently I had an article hit the front page of Digg. I was lucky because I had already done a number of things to optimize my server so that I survived. My page never went down, and was very responsive for the entire duration of getting dugg. My site is hosted on WordPress on a server that I own and manage. The server hosts about 15 different domains, none of which are exceptionally high traffic sites. The server isn’t anything special either. It’s a single processor P4 3.0GHz with 512MB of memory running CentOS 4.4, Apache 2.0, PHP5, MySQL4. I do have webalizer running on my Apache logs and I have the output of those posted at http://vallery.net/stats/ if you would like to see what the Digg effect can do.

I never exceeded more than a 0.15 load average during the Digg. Load average is a computation of how maxed out a Unix based system is. You multiply the number of processors by 1, and this indicates the maximum load average that you can reasonably sustain. If the load average exceeds this number, than your system is over utilized. In my case, my max server load is 1 given that this is a single processor system. This means I was only using 15% of my system resources during the maximum load that was generated from the Digg.

In order to optimize my system I have done the following things:

1) I’ve installed the WordPress 2.0 plugin wp-cache. This is something that every WordPress blog should have installed. Especially on a shared hosting environment, it will dramatically increase your ability to handle high traffic. The plugin generates the HTML for a given page and then saves it in a cache file. When someone accesses your page, instead of fetching the content from the database it uses the already generated HTML in the cache to send to the browser. This eliminates a number of fetches to the DB and dramatically speeds up page load times.

2) Optimize Apache/MySQL to handle the expected number of database queries. The most common error you receive when viewing a WordPress page that has been owned by the Digg effect is a database connection failed, or timeout. When connecting to a MySQL database you can either have a persistent connection, or generate a new connection for each request. There are a number of schools of thought as to which method is better, but generally speaking using persistent connections utilize more memory, so for me using a new connection for each request makes more sense. A lot of the PHP blog templates including WordPress use these non-persistent connections as well. The rub comes in the maximum number of said connections. MySQL can be configured in your my.cnf file to say the maximum number of connections that can be created to the database. If each instance of Apache (which in turn represents concurrent vistors to your page) has it’s own connection, then you can quickly exceed this maximum number of connections. Apache similarly has a configuration option that dictates the maximum number of threads that can be concurrently running. This number is specified in the httpd.conf file. If the Apache number is larger than the MySQL number (which is the case in most default configurations) when you have more Apache threads running than MySQL is capable of handling, you get our nice “maximum number of connections” error message.

In /etc/my.cnf you want to set the variable “max_connections”. I recommend something reasonably high like 250.
In your httpd.conf file for Apache you want to set the variable “MaxClients”. This should typically be the same number that you selected for MySQL.

3) Set reasonable Apache timeouts. This means that no individual thread/connection can monopolize the system bringing all the other queries to its knees. This protects you from rogue “edge” cases.

In your httpd.conf file set “Timeout” to a low number like 30 (measured in seconds) and “KeepAliveTimeout” to something like 3.

4) If your site makes it on to Digg, use real-time monitoring tools to measure your server’s health. There are a number of command line tools available to help in this regard. The first I would recommend is the tried and true “top” command. This will display the processes running on your machine along with their associated memory and CPU usage. Keep an eye and make sure things aren’t getting out of hand. The second tool that I use is called “Tcptrack“. Tcptrack will need to be installed on your machine, but once it is it will give you a real time view of your incoming connections and bandwidth usage.

Happy Digging!

Leave a Comment
post icon

Scalable story promotion

I had some thoughts on the idea of scalable story promotion for the open source Pligg system. I thought I would share them here

I’ve been thinking about the best way to handle promoting a story/article from a queued status onto the main page and I’ve had a few thoughts I wanted to share with everyone.

The current scheme for promotion is very simple. Number of votes passing a defined threshold in the config file (and the story is fresher than X days). While this will work for very low volume sites, it doesn’t exactly scale well. In a site with a large number of users, and one would assume proportionately a large number of votes, this breaks down.

Our ideal system would be able to quickly determine that a story is of high value and promote it to the homepage based on the frequency of votes as compared to a large sample population. Using some basic statistics we can determine if a story is an “above average performer” and promote it quickly.

In order to accomplish this we need to take into consideration several variables, including:

1) Number of stories submitted in a given time period
2) The average number of votes over a given time period for all “active” stories in the queue
3) The standard deviation of votes over a given time period for all “active” stories in the queue
4) The target number of stories to be promoted to the home page in a given time period

For the purposes of explaining my ideas I am going to define the time period as one day. We need to calculate and story the answers to some of these questions on a regular basis. Ideally we create a new DB table that stores this information for us to easy lookup. This way we can also track information statistically and show trends. For my plans I plan on implementing a cron job that runs daily that calculates the required and stores them as follows.

1) Calculating the number of stories submitted is trivial. The SQL query I am using is:

select count(*) from links where link_status = ‘queued’ and date(link_date) = ’2006-04-16′;

2&3) Calculating the average and standard deviation can be done with the following query:

select avg(link_votes) as ‘average’, stddev(link_votes) as ‘stdev’ from links where link_status = ‘queued’ and date(link_date) = ’2006-04-16′;

4) This can be set in the config.php file and up to the site administrator.

These values will be used to “predict” the future for our new stories. Each story will have a new variable that stores a floating point number. This number is the number of standard deviations above or below the mean (average). We need an additional check running on a much more frequent interval (I plan on using every 5 minutes) to update the items in the database with their new “score” and promote them once they pass a specific threshold.

I plan on calculating this as follows:

$score = //Z score that indicated if a story is above/below the mean
$numofvotes = //The number of votes that a given story has received
$stddev = //Standard deviation for the given time period as calculated above
$average = //Average number of votes for the given time period as calculated above
$numberofstories = //The number of stories submitted in the given time period as calculated above
$desirednumofstories = //Setting from config.php

The cut-off value determines a “score” threshold, or essentially a percentile that a story must fall into in order to be promoted. I plan on calculating this as follows:

$score = ($numofvotes – $average)/$stddev;
or in SQL

select link_id, ((link_votes – $avg)/$stddev) from links where link_status = ‘queued’ and date(link_date) = ’2006-04-16′;

Once we have this score we need to decide if this story needs to be promoted or not. This is done by first calculating where in the rank order this story is likely to fall for the day using some basic probability statistics.

$rank = round($numofstories/(1+exp(-1.7*$score)),1);
Now we just check and see if that rank falls above our threshold and promote the story accordingly.

$desiredrank = ($numberofstories – $desirednumofstories)
if ($rank >= $desiredrank) then {
// Update story and set to promoted
}

This method should work well assuming that traffic is fairly stable from day to day. Since we are using the previous days data to predict the current days volumes if a sudden traffic spike is hit it will mean that a larger number of stories will be promoted than desired. This can be mitigated on larger volume websites by decreasing the “given period of time” from a day to something shorter. Additionally this whole idea could be retooled to calculate these variables on a per category level. Some sites might have much higher traffic, and therefore votes for one category than another.

I’m currently working on implementing the above for a site that I plan on launching in the near future, and I would love feedback, criticism before I do it.

Leave a Comment