Content Clouds

ken's picture

Web 2.0 is here, and with it Tag Clouds. In this article I look at improving Tag Clouds. What are Tag Clouds? Tag Clouds are a simple visualisation depicting the key tags used in a webpage (or website). What are Tags? As Web 2.0 arrived increasing amounts of user generated content emerged - wikis, blogs, discussion forums, photo and video exchanges. Traditionally content is classified by the publisher - if I write a book, then I'll be assigned an ISBN according to the relevant field of the book.Libraries can then position my book in an appropriate place, so it can be easily found alongside related texts. This is a Taxonomy. Since the emergence of Web 2.0 the method has been replaced by Folksonomy. Ignoring the problem of quality checking, with masses of user generated content, who's going to be responsible for classifying it? Web 2.0's answer - we all are! As we create content we tag it and sometimes people come along behind us and also tag it.

This leads to a community consensus describing the content - 'democracy' in motion. If enough people call something a spade, then a spade it will be. The intuitive simplicity of this approach seems attractive - especially with millions of willing participants unwittingly becoming librarians. The downside, in my view, is the easy potential for abusers to stack the ballot box. In the unmoderated free world of web 2.0's internet, there are a plethora of users happy to sell their vote, and happy to fake tag an article or picture. If enough people say this article is about 'fish', then it is forever about 'fish' - regardless of the true content. Perhaps I am overestimating the extent of the problem, but if there was a better way of classifying articles then why not use it?

Regular visitors to this site will have noticed the new 'cloud' on the right ------------->>>
It looks like a tag cloud, but it isn't - it's actually a content cloud, specific to each article. Follow the links to search the rest of the site for other related articles. But, creating the content cloud is automatic, based entirely on the content of the article - without needing a community to tag it. We can still tag articles, but the content cloud provides an alternative means of classification. We have however kept the simple 'cloud' visualisation, just generating it differently. So how is it done?

Well, firstly it depends on word frequencies - based upon the assumption that the words used frequently in a text reflect the contents of the text. (You can't describe the stock exchange by using words like 'stoat' or 'badger'!) Word frequencies are not enough though, as clearly words like 'the' and 'a' are likely to be common in all articles. Here some clever stats comes into play with an algorithm selecting the significantly overused words in the article. This is done by comparing the relative frequencies of each word in the article, with the relative frequencies of each word in the British National Corpus - a 100,000,000 word selection of standard English, which reflects the expected frequency of each word. The words can then be ranked based on the log likelihood of a word appearing as frequently as it did - and then a content cloud generated from it.

There is still quite a bit of work to do to finish this project, but for now, it's a clever implementation - I think!


Comments

Ralph van den Berg's picture

Implementing our own clouds

So this content cloud, is that just some script we can copy and paste into our websites? Where'd you get it?

I'm also kind of curious whether you can specify where it reads for content. It would be nice if you can have little clouds for sections of a page.

---Ralph van den Berg
visit RalphvandenBerg.com

ken's picture

As yet no - it's not a

As yet no - it's not a little script... We are thinking about making it into a mod for drupal, which could then easily be added to other sites, but right now it still needs some tweaking. The biggest potential problem is the database it runs off - as it compares frequencies with the British National Corpus, it needs to compare with a massive database that stores the 100 million words - actually the database has over 250,000 rows! Passing that database around might take some bandwidth on many sites....

Where'd I get it? Well, I coded it! ;) Although due thanks to rburns for his great input! The ideas behind it really stem from a discussion some time back (with tobyonline) which have slowly been implemented.

As for specifying regions, right now no... But, this is just the first implementation which I plan to improve. Actually the results improve with more content - so sections of a page might not work so well, but sections of the site would work well - the next step will be to create sitewide content clouds, and also content clouds for specific menu sections. Imho there is great potential for the project and your inputs and thoughts are very welcome.

I have a small offline version which has some more flexibility - pm me if you want to see it.

Ken

Chiang Mai Plan
Thai Tan Thai
Laos Plan