Enriching a List of URLs with Google Page Rank

Dealing with a large body of web resources can be daunting. You make a list of hundreds of blogs, but how do you share or recall those resources later? You must somehow organize your list. Many people do this with tags, but this is not necessarily the best option. Manual organization is also tedious, so tools for enriching data automatically came in handy. The relevance of different resources changes over time. What we originally tagged as “breakthrough” may come insignificant.

Last week I saw a friend who had recently started a new job and wanted my opinion about current and future technological trends. I wanted to give him links to thousands of resources that I have been accumulating over the years, but organized in such a way that he would not have to view them one at a time. This triggered an avalanche of ideas about how to enrich lists of links. My first thought was to rank my list of sites about venture capital and data science using Google Page Rank. I also considered adding the number of tweets, likes, and “+1” for each site but these are generally awarded for individual articles, not whole sites. I ended up adding the Google Page Rank with project pagerank.

The most interesting ideas to explore, though, are in another direction: how to boost items that are in the long tail. The best music may not make the Top 40, and so remains invisible. Algorithms better at recognizing value in the long tail would revolutionise the economy.

The code is available on github. Two examples of the output are available on data-science-bundle and venture-capital-bundle.

See Also

  1. Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website
  2. Exporting StackOverflow users blogs to Excel
  3. Data Science Resources