1 0 Archive | May, 2012
post icon

What is Web 3.0? A review of the ICWSM

What is Web 3.0?

Just when you started to get used to the idea of Web 2.0, the trend setters on the web start tossing around the term Web 3.0. Can’t they just let us get used to one version before pushing another one on us? Fortunately Web 3.0 is in its infancy, and nobody is certain of what it really means at this point. The direction of the internet and its users is a fickle thing. It’s hard to predict what they are going to want and use. The next big thing is all about the right idea, in the right place, at the right time.

I’ve been attending the International Conference on Weblogs and Social Media conference the last couple days and listening to some of these visionaries talk. While there certainly is no consensus about what the future holds, there are definitely some key trends to take note of. I decided to go this year because it just so happened to be local for me. Normally I wouldn’t have taken the time off of work, incurred the travel expenses, etc just to go to a conference like this. For some bizarre twist of fate they decided to hold the conference in my home town. Given my personal work in blog research, it seemed like a no brainer to go to this conference.

The topics presented at the conference were fairly diverse, ranging from sentiment analysis to gender bias in blogs. There were three invited speakers and a number of academics presenting published papers. I have to say that by far I got more out of the invited speakers than the papers, although there were a few the sparked my interest. The three invited speakers were Danah Boyd (a prominent blogger), Andrew Tomkins (a researcher at Yahoo!), and Evan Williams (founder of Twitter). Here are some of the highlights from my notes.

Danah Boyd – “MySpace is *my* space”

In the talk she gave us a brief history of social networking starting with Friendster and went on to discuss a number of specifics about MySpace.

  • 93% of American teenagers have access to the Internet.
  • 55% of online American teens 12-18 have a profile on a social network site (That are willing to admit it in front of their parents).
  • 91% use it to talk to friends while very few use it to talk to strangers.
  • The number one spot in the “top 8″ is dependent on culture. Some cultures dictate that it should be your significant other, some your best same sex friend, and others say family members like your cousins.
  • 66% of teenagers use MySpace in a private fashion in order to avoid marketeers and adults (parents).
  • Teenagers think something is fishy on a page without ads. They suspect at some point they will be asked to pay for something if they aren’t being force fed advertisement.
  • Gadgetry has broken the gender barrier. More teen girls have game consoles, iPods, and trendy cell phones now.
  • If parents are flipping out, they know it’s going to be fun!

In addition to the key points about MySpace, Danah also discussed the direction that she sees social networking going. The future to her is about mobile devices. We already see the prevalence of SMS in Asian cultures, and it is slowly becoming a phenomenon here in the US with the introduction of sites like Twitter. The problem is how this move will take place given the current atmosphere in the mobile market. The major cell phone carriers control every piece of software on their phones and every byte of data that goes over their network. There is no self-interest for them to open these platforms and networks up for development. They want to be in control so they can utilize the technology to make as much money as possible. Because of the competitive nature, there is not likely to be any major cooperation between the carriers, which will stunt social networking’s growth.

Andrew Tomkins – “Social Media, Storage, and Data Analysis”

Andrew is a PhD member of the Yahoo! Research team. His talk focused on several different topics, including the evolution of search, and Flickr.

Interesting things being done with search:

  • Moving away from the “10 results” model and devoting more real estate to targeted results.
  • Search is adding in shortcuts to specific verticals. For example, integrating weather and movie times right into the search results.
  • Google Co-Op allows content providers to integrate their results in your Google search.
  • The current approach that is used is naive, based mostly on regular expressions and filtering for target words.


Is search solved?

  • Search is really good at finding content on relatively open and static pages. Search does not integrate results where content might be buried in forums or other social network sites.
  • The data sits behind walled gardens and is generally unavailable. There is lots of money still to be made in search, but it will require a large capital investment and structured deals with the owners of the content to let search inside their walls.
  • For example, the Yahoo! Answers data is fully crawlable but the several billion posts in Yahoo! Groups are locked up from the outside.

Andrew also showed off a demo off a timeline based view of Flickr tags. This can be seen over at Taglines. He also mentioned that they are developing a “game” that users can “play” that will actually generate metadata and tags for Flickr images.

Evan Williams – “The Evolution of a Social Media Platform: Twitter

Evan is the founder of Blogger, Odeo, and Twitter. He is one of the leading entrepreneurs in social media and brings a unique perspective to the conference. Most of his presentation talked about Twitter and the things they are doing. Twitter started just 9 months ago, and went through a complete revamp in November of 2006. Over the last few months the number of “tweets” being sent into Twitter has doubled month over month.

  • 2/3rds of the inbound tweets come from the web and IM, the remaining 1/3rd come from SMS
  • They are currently working on an API that will soon be released
  • While they don’t have hard data, their informal research indicates that the core demographic of Twitter is 30 year old web geeks, not teenagers as you might expect.

Evan also presented a quote from Lisa Reichelt about ambient intimacy that describes Twitter.

Ambient intimacy is about being able to keep in touch with people with a level of regularity and intimacy that you wouldn’t usually have access to, because time and space conspire to make it impossible. Flickr lets me see what friends are eating for lunch, how they’ve redecorated their bedroom, their latest haircut. Twitter tells me when they’re hungry, what technology is currently frustrating them, who they’re having drinks with tonight.

There was also a quick demo of Twittervision. This is a site developed by a 3rd party that integrates Google maps with Twitter. The result is that users who send a location code with their tweet (l:Boulder, CO for example) will be shown on the map display.

Summary

While the future is far from certain, one thing is for sure. The future of social media and “web 3.0″ will be focused around two key areas. Mobility and search. Developers needs to come up with better ways to get at the information you need and make it simple to do from mobile devices. While some predict the death of sites like Twitter, I think they are ground-breakers in their field. When blogging can become a commodity that is approachable to anyone, and all of that data is well organized and searchable, that is when we can say that web 3.0 has arrived.

Leave a Comment
post icon

Optimizing WordPress and LAMP to survive the Digg effect

Anyone who has used digg for any amount of time has certainly come across a dead link. Invariably you click on the comment link hoping someone has posted a mirror of the content. You head over to duggmirror or coral cache to view the content. During the peak time right after an article hits the digg homepage, a given site might receive as many as 100 page loads per minute. If this page is hosted on WordPress, or other MySQL based blog platform, combined with shared hosting, it’s a recipe for a crash. Your site will go down in flames and you won’t have a chance of recovering until the load goes back to a lower level.

Recently I had an article hit the front page of Digg. I was lucky because I had already done a number of things to optimize my server so that I survived. My page never went down, and was very responsive for the entire duration of getting dugg. My site is hosted on WordPress on a server that I own and manage. The server hosts about 15 different domains, none of which are exceptionally high traffic sites. The server isn’t anything special either. It’s a single processor P4 3.0GHz with 512MB of memory running CentOS 4.4, Apache 2.0, PHP5, MySQL4. I do have webalizer running on my Apache logs and I have the output of those posted at http://vallery.net/stats/ if you would like to see what the Digg effect can do.

I never exceeded more than a 0.15 load average during the Digg. Load average is a computation of how maxed out a Unix based system is. You multiply the number of processors by 1, and this indicates the maximum load average that you can reasonably sustain. If the load average exceeds this number, than your system is over utilized. In my case, my max server load is 1 given that this is a single processor system. This means I was only using 15% of my system resources during the maximum load that was generated from the Digg.

In order to optimize my system I have done the following things:

1) I’ve installed the WordPress 2.0 plugin wp-cache. This is something that every WordPress blog should have installed. Especially on a shared hosting environment, it will dramatically increase your ability to handle high traffic. The plugin generates the HTML for a given page and then saves it in a cache file. When someone accesses your page, instead of fetching the content from the database it uses the already generated HTML in the cache to send to the browser. This eliminates a number of fetches to the DB and dramatically speeds up page load times.

2) Optimize Apache/MySQL to handle the expected number of database queries. The most common error you receive when viewing a WordPress page that has been owned by the Digg effect is a database connection failed, or timeout. When connecting to a MySQL database you can either have a persistent connection, or generate a new connection for each request. There are a number of schools of thought as to which method is better, but generally speaking using persistent connections utilize more memory, so for me using a new connection for each request makes more sense. A lot of the PHP blog templates including WordPress use these non-persistent connections as well. The rub comes in the maximum number of said connections. MySQL can be configured in your my.cnf file to say the maximum number of connections that can be created to the database. If each instance of Apache (which in turn represents concurrent vistors to your page) has it’s own connection, then you can quickly exceed this maximum number of connections. Apache similarly has a configuration option that dictates the maximum number of threads that can be concurrently running. This number is specified in the httpd.conf file. If the Apache number is larger than the MySQL number (which is the case in most default configurations) when you have more Apache threads running than MySQL is capable of handling, you get our nice “maximum number of connections” error message.

In /etc/my.cnf you want to set the variable “max_connections”. I recommend something reasonably high like 250.
In your httpd.conf file for Apache you want to set the variable “MaxClients”. This should typically be the same number that you selected for MySQL.

3) Set reasonable Apache timeouts. This means that no individual thread/connection can monopolize the system bringing all the other queries to its knees. This protects you from rogue “edge” cases.

In your httpd.conf file set “Timeout” to a low number like 30 (measured in seconds) and “KeepAliveTimeout” to something like 3.

4) If your site makes it on to Digg, use real-time monitoring tools to measure your server’s health. There are a number of command line tools available to help in this regard. The first I would recommend is the tried and true “top” command. This will display the processes running on your machine along with their associated memory and CPU usage. Keep an eye and make sure things aren’t getting out of hand. The second tool that I use is called “Tcptrack“. Tcptrack will need to be installed on your machine, but once it is it will give you a real time view of your incoming connections and bandwidth usage.

Happy Digging!

Leave a Comment
post icon

Scalable story promotion

I had some thoughts on the idea of scalable story promotion for the open source Pligg system. I thought I would share them here

I’ve been thinking about the best way to handle promoting a story/article from a queued status onto the main page and I’ve had a few thoughts I wanted to share with everyone.

The current scheme for promotion is very simple. Number of votes passing a defined threshold in the config file (and the story is fresher than X days). While this will work for very low volume sites, it doesn’t exactly scale well. In a site with a large number of users, and one would assume proportionately a large number of votes, this breaks down.

Our ideal system would be able to quickly determine that a story is of high value and promote it to the homepage based on the frequency of votes as compared to a large sample population. Using some basic statistics we can determine if a story is an “above average performer” and promote it quickly.

In order to accomplish this we need to take into consideration several variables, including:

1) Number of stories submitted in a given time period
2) The average number of votes over a given time period for all “active” stories in the queue
3) The standard deviation of votes over a given time period for all “active” stories in the queue
4) The target number of stories to be promoted to the home page in a given time period

For the purposes of explaining my ideas I am going to define the time period as one day. We need to calculate and story the answers to some of these questions on a regular basis. Ideally we create a new DB table that stores this information for us to easy lookup. This way we can also track information statistically and show trends. For my plans I plan on implementing a cron job that runs daily that calculates the required and stores them as follows.

1) Calculating the number of stories submitted is trivial. The SQL query I am using is:

select count(*) from links where link_status = ‘queued’ and date(link_date) = ’2006-04-16′;

2&3) Calculating the average and standard deviation can be done with the following query:

select avg(link_votes) as ‘average’, stddev(link_votes) as ‘stdev’ from links where link_status = ‘queued’ and date(link_date) = ’2006-04-16′;

4) This can be set in the config.php file and up to the site administrator.

These values will be used to “predict” the future for our new stories. Each story will have a new variable that stores a floating point number. This number is the number of standard deviations above or below the mean (average). We need an additional check running on a much more frequent interval (I plan on using every 5 minutes) to update the items in the database with their new “score” and promote them once they pass a specific threshold.

I plan on calculating this as follows:

$score = //Z score that indicated if a story is above/below the mean
$numofvotes = //The number of votes that a given story has received
$stddev = //Standard deviation for the given time period as calculated above
$average = //Average number of votes for the given time period as calculated above
$numberofstories = //The number of stories submitted in the given time period as calculated above
$desirednumofstories = //Setting from config.php

The cut-off value determines a “score” threshold, or essentially a percentile that a story must fall into in order to be promoted. I plan on calculating this as follows:

$score = ($numofvotes – $average)/$stddev;
or in SQL

select link_id, ((link_votes – $avg)/$stddev) from links where link_status = ‘queued’ and date(link_date) = ’2006-04-16′;

Once we have this score we need to decide if this story needs to be promoted or not. This is done by first calculating where in the rank order this story is likely to fall for the day using some basic probability statistics.

$rank = round($numofstories/(1+exp(-1.7*$score)),1);
Now we just check and see if that rank falls above our threshold and promote the story accordingly.

$desiredrank = ($numberofstories – $desirednumofstories)
if ($rank >= $desiredrank) then {
// Update story and set to promoted
}

This method should work well assuming that traffic is fairly stable from day to day. Since we are using the previous days data to predict the current days volumes if a sudden traffic spike is hit it will mean that a larger number of stories will be promoted than desired. This can be mitigated on larger volume websites by decreasing the “given period of time” from a day to something shorter. Additionally this whole idea could be retooled to calculate these variables on a per category level. Some sites might have much higher traffic, and therefore votes for one category than another.

I’m currently working on implementing the above for a site that I plan on launching in the near future, and I would love feedback, criticism before I do it.

Leave a Comment
post icon

Setting up an automated workflow to convert files for Apple TV on OS X

With the arrival of my Apple TV yesterday I needed a solution to get my Xvix/WMV/Divx files converted and imported into iTunes so that I can watch them. I already have Quicktime Pro, which with the recent release added the ability to “Export to Apple TV”. Since I have a lot of files, and no desire to sit around and convert these one at a time, it seemed like a perfect job for automator. I figured someone out there at some point had to have done something similar so I did a bit of google searching and found the required automator actions. Using the actions I found combined with the sample workflow they have already created for you, it is trivial to set-up a workflow that will convert to the Apple TV format and then import the file into your iTunes library. With a slight modification you can set it up a plug-in and attach it to a folder action. Now I have a simple drop folder on my desktop that launches quicktime pro and converts the file to an Apple TV viewable format, imports the file into iTunes, and cleans up after itself.

Here is a simple step-by-step guide to walk you through what I did:

1) Install xvid, divx, and wmv codecs.

These can be found here:

Divx
Xvid
WMV

2) Install the automator actions for compressing and importing into iTunes.

Download Quick Time Compression Actions and Workflow

3) Once installed you will have a directory on your desktop called “QuickTime Compression Workflow resources”. In this folder you will find a sample automator workflow called “Convert videos and add to iTunes”. Open this workflow in automator.

4) Delete the first step of the workflow which is “Ask for Finder Items”. Instead of being prompted for which items to convert, we want to setup a folder action that will automatically convert the files dropped in our folder.

5) Add a new first step to the workflow called “Get Selected Finder Items”. This action can be found under the “Finder” application.

6) Under the “Compress QuickTime Using Most Recent Settings” step change “Choose directory for converted files” to the desktop (or any other temporary folder you want to use).

7) Create a new folder on your desktop. This will be your drop folder, so call it something relevant. I called mine “Convert to AppleTV”.

8 ) Back in automator, click on file and choose “Save as plug-in”, choose “Folder Actions” from the “plug-in for” drop down. Give the plug-in the same name as your folder. Select your newly created folder for the “Attached to folder” option. Click save.

9) Since the script will convert whatever file you drop in your conversion folder using the last settings you used in QuickTime you’ll need to launch QuickTime with a test file and then choose “Export” from the file menu. Assuming you have the most recent version of QuickTime Pro you should have an option “Export Movie to Apple TV”.

10) That’s it!!! Now just close out of everything and drop your files into your new folder and watch as they are converted and imported to iTunes. It works great to leave your Mac on and then drop a bunch of files in the folder before you go to bed. When you get to your PC in the morning everything should be all ready to go.

To find out more about folder actions, check this page out:

Folder Actions

Leave a Comment