Enriching a List of URLs with Google Page Rank

Dealing with a large body of web resources can be daunting. You make a list of hundreds of blogs, but how do you share or recall those resources later? You must somehow organize your list. Many people do this with tags, but this is not necessarily the best option. Manual organization is also tedious, so tools for enriching data automatically came in handy. The relevance of different resources changes over time. What we originally tagged as “breakthrough” may come insignificant.

Last week I saw a friend who had recently started a new job and wanted my opinion about current and future technological trends. I wanted to give him links to thousands of resources that I have been accumulating over the years, but organized in such a way that he would not have to view them one at a time. This triggered an avalanche of ideas about how to enrich lists of links. My first thought was to rank my list of sites about venture capital and data science using Google Page Rank. I also considered adding the number of tweets, likes, and “+1” for each site but these are generally awarded for individual articles, not whole sites. I ended up adding the Google Page Rank with project pagerank.

The most interesting ideas to explore, though, are in another direction: how to boost items that are in the long tail. The best music may not make the Top 40, and so remains invisible. Algorithms better at recognizing value in the long tail would revolutionise the economy.

The code is available on github. Two examples of the output are available on data-science-bundle and venture-capital-bundle.

See Also

  1. Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website
  2. Exporting StackOverflow users blogs to Excel
  3. Data Science Resources

Data Science Resources

Big Data, Big Trend

Big Data is an important megatrend:

Companies such as Google, Facebook, Twitter and LinkedIn are using their vast information to discover things that are definitely not obvious, and may even challenge our common sense. Some initiatives like Recorded Future or StreamBase try to predict the future while events like a plane crash were first pointed out on Twitter. One of the funniest blogs about big data is on OKCupid which mines information about relationship matching and can discover connections between orgasms and exercise.

Sharing Data Science Blogs

Following is an OPML file with 228 data science related blogs: Data Science Blogs in OPML format.