Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website

Searching for Sales Leads

The best definition of “marketing” I have read is by Dave Kellog in the What the CEO Really Thinks of Marketing (And 5 Things You Can Do About It) presentation. He says that marketing exists to make sales easier. For example, the process of searching for sales opportunities can be optimized if we pay attention to what our prospectives and current customers are sharing on different social media. Good corporate blogs include insightful information about the company’s aims. The first step in this direction is to discover what web resources a specific company has available. The discovery process is easier for companies than for individuals. Individuals uses a variety of aliases and alternative identities on the web. while companies with good communication strategies provide links to all of their web resources on their primary sites.


We offer a script which retrieves web resources connected to any company’s URL. With this tool you will no longer waste time manually searching for this useful information. Companies and people usually have a number of associated sites: blogs; LinkedIn accounts; Twitter accounts; Facebook pages; and videos and photos on specialized sites such as YouTube, Vimeo, Flickr, or Picassa. A recursive level of page crawling is needed to retrieve the location of associated resources. Large companies such as IBM or Dell have multiple accounts associated with different areas. IBM has different Twitter accounts for their research divisions and for the important corporate news.

Usage <input.yaml> <output.yaml>

Look at data-science-organizations.yaml for an example.


  1. Python 2.7 (or greater 2.x series)
  2. lxml.html
  4. PyYAML


This code is available at github.


import argparse
import sys
from focused_web_crawler import FocusedWebCrawler
import logging
import code
import yaml
from constraint import Constraint

def main():
   logger = logging.getLogger('data_big_bang.focused_web_crawler')
   ap = argparse.ArgumentParser(description='Discover web resources associated with a site.')
   ap.add_argument('input', metavar='input.yaml', type=str, nargs=1, help ='YAML file indicating the sites to crawl.')
   ap.add_argument('output', metavar='output.yaml', type=str, nargs=1, help ='YAML file with the web resources discovered.')

   args = ap.parse_args()

   input = yaml.load(open(args.input[0], "rt"))

   fwc = FocusedWebCrawler()

   for e in input:
      e.update({'constraint': Constraint()})


   with open(args.output[0], "wt") as s:
      yaml.dump(fwc.collection, s, default_flow_style = False)

if __name__ == '__main__':

from threading import Thread, Lock
from worker import Worker
from Queue import Queue
import logging

class FocusedWebCrawler(Thread):
   NWORKERS = 10
   def __init__(self, nworkers = NWORKERS):
      self.nworkers = nworkers
      #self.queue = DualQueue()
      self.queue = Queue()
      self.visited_urls = set()
      self.mutex = Lock()
      self.workers = []
      self.logger = logging.getLogger('data_big_bang.focused_web_crawler')
      sh = logging.StreamHandler()
      formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
      self.collection = {}
      self.collection_mutex = Lock()

   def run(self):'Focused Web Crawler launched')'Starting workers')
      for i in xrange(self.nworkers):
         worker = Worker(self.queue, self.visited_urls, self.mutex, self.collection, self.collection_mutex)

      self.queue.join() # Wait until all items are consumed

      for i in xrange(self.nworkers): # send a 'None signal' to finish workers

      self.queue.join() # Wait until all workers are notified

#     for worker in self.workers:
#        worker.join()'Finished workers')'Focused Web Crawler finished')

from threading import Thread
from fetcher import fetch
from evaluator import get_all_links, get_all_feeds
from collector import collect
from urllib2 import HTTPError
import logging

class Worker(Thread):
   def __init__(self, queue, visited_urls, mutex, collection, collection_mutex):
      self.queue = queue
      self.visited_urls = visited_urls
      self.mutex = mutex
      self.collection = collection
      self.collection_mutex = collection_mutex
      self.logger = logging.getLogger('data_big_bang.focused_web_crawler')

   def run(self):
      item = self.queue.get()

      while item != None:
            url = item['url']
            key = item['key']
            constraint = item['constraint']
            data = fetch(url)

            if data == None:
     'Not fetched: %s because type != text/html', url)
               links = get_all_links(data, base = url)
               feeds = get_all_feeds(data, base = url)
               interesting = collect(links)

               if interesting:
                  if key not in self.collection:
                     self.collection[key] = {'feeds':{}}

                  if feeds:
                     for feed in feeds:
                        self.collection[key]['feeds'][feed['href']] = feed['type']

                  for service, accounts in interesting.items():
                     if service not in self.collection[key]:
                        self.collection[key][service]  = {}

                     for a,u in accounts.items():
                        self.collection[key][service][a] = {'url': u, 'depth':constraint.depth}

               for l in links:
                  new_constraint = constraint.inherit(url, l)
                  if new_constraint == None:

                  if l not in self.visited_urls:
                     self.queue.put({'url':l, 'key':key, 'constraint': new_constraint})

         except HTTPError:
  'HTTPError exception on url: %s', url)


         item = self.queue.get()

      self.queue.task_done() # task_done on None

import urllib2
import logging

def fetch(uri):'Fetching: %s', uri)
   #logger = logging.getLogger('data_big_bang.focused_web_crawler')
   print uri

   h = urllib2.urlopen(uri)
   if h.headers.type == 'text/html':
      data =
      data = None

   return data

fetch.logger = logging.getLogger('data_big_bang.focused_web_crawler')

import lxml.html
import urlparse

def get_all_links(page, base = ''):
   doc = lxml.html.fromstring(page)
   links = map(lambda x: urlparse.urljoin(base, x.attrib['href']), filter(lambda x: 'href' in x.attrib, doc.xpath('//a')))

   return links

def get_all_feeds(page, base = ''):
   doc = lxml.html.fromstring(page)

   feeds = map(lambda x: {'href':urlparse.urljoin(base, x.attrib['href']),'type':x.attrib['type']}, filter(lambda x: 'type' in x.attrib and x.attrib['type'] in ['application/atom+xml', 'application/rss+xml'], doc.xpath('//link')))

   return feeds

import urlparse
from parse_domain import parse_domain

class Constraint:
   DEPTH = 1
   def __init__(self):
      self.depth = 0

   def inherit(self, base_url, url):
      base_up = urlparse.urlparse(base_url)
      up = urlparse.urlparse(url)

      base_domain = parse_domain(base_url, 2)
      domain = parse_domain(url, 2)

      if base_domain != domain:
         return None

      if self.depth >= Constraint.DEPTH: # only crawl two levels
         return None
         new_constraint = Constraint()
         new_constraint.depth = self.depth + 1

         return new_constraint

import urlparse
import re

twitter = re.compile('^!/)?(?P[a-zA-Z0-9_]{1,15})$')

def collect(urls):
   collection = {'twitter':{}}
   for url in urls :
      up = urlparse.urlparse(url)
      hostname = up.hostname

      if hostname == None:

      if hostname == '':
      elif hostname == '':
         m = twitter.match(url)

         if m:
            gs = m.groupdict()
            if 'account' in gs:
               if gs['account'] != 'share': # this is not an account, although!/share says that this account is suspended.
                  collection['twitter'][gs['account']] = url
      elif hostname == '':
      elif hostname == '':
      elif hostname == '':
      elif hostname == '':
      elif hostname == '':
      elif hostname[-9:] == '':

   return collection

Further Work

This process can be integrated with a variety of CRM and business intelligence processes like Salesforce, Microsoft Dynamics, and SAP. These applications provide APIs to retrieve company URLs which you can crawl with our script.

The discovery process is just the first step in studying your prospective customers and generating leads. Once you have stored the sources of company information it is possible to apply machine learning tools to search for more opportunities.

See Also

  1. Enriching a List of URLs with Google Page Rank
  2. Integrating Google Analytics into your Company Loop with a Microsoft Excel Add-on


  1. Sales process
  2. Sales process engineering
  3. Microsoft Dynamics API
  4. Salesforce API
  5. SAP API
  6. SugarCRM Developer Zone

Automatically Tracking Events with Google Analytics, jQuery and jsUri

Pragmatic Code

Google analytics can track user events on a web page. This article shows a code snippet which automates the insertion of tracking code. Instead of adding tracking codes manually one tag at a time, we bind the code to the click event automatically. We opt not to make use of the plugins for libs such as jQuery or for applications such as WordPress so as to have full control over the process.Since multiple interactions can take place on a single page, it is essential to add tracking codes to log user interactions. Tracking codes are also needed to track clicks on links to external sites.

JsUri is the most robust library to parse URIs since a parsing function is sadly not included in javascript implementations (only a trick).

This is how we implemented it on our Data Big Bang blog to track clicks to other sites:

<!-- Inside <head> -->
<script type='text/javascript' src=''></script>
<script type='text/javascript' src=''></script>

<!-- After <body> -->
<script type="text/javascript">
	// Track click on hyperlinks to external sites
	$(document).ready(function() {
		$('a').click(function(event) {
			var target =;
			var uri = new Uri(target);
			if( != '' && != '') {
				//alert('Match!'); // Only for debugging
				_gaq.push(['_trackEvent', 'UI', 'Click', target.toString(), 0, true]);

Indeed this is how we configure WordPress to get the libs automatically: edit the functions.php under the theme folder.

if( !is_admin()) {

#	Avoid retrieving jquery libs from since Google domains can be blocked in countries like China.
#	wp_register_script('jquery', (""), false, "1.7.0");
	wp_register_script('jquery', (''), false, '1.7.0');


	wp_register_script('jsuri', (""), false, "1.1.1");

This is how we implemented it on our secure coupon codes generator site to track clicks on a rich web application.

<!-- Inside <head> -->
<script type='text/javascript' src=''></script>
<script type='text/javascript' src=''></script>	

<!-- After <body> -->
$(document).ready(function() {
	// Add Event Trackers
	$('a').click(function(event) {
		var target =;
		var uri = new Uri(target.href);

		if( == '') {
			//alert('match link');

			_gaq.push(['_trackEvent', 'UI', 'Click', target.href, 0, true]);


	$('button').click(function(event) {
		var target =;
		//alert('match button');

		_gaq.push(['_trackEvent', 'UI', 'Click', target.innerText, 0, true]);


  1. How do I parse a URL into hostname and path in javascript?
  2. Event Tracking Guide
  3. Is Google’s CDN for jQuery available in China?