Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website

Searching for Sales Leads

The best definition of “marketing” I have read is by Dave Kellog in the What the CEO Really Thinks of Marketing (And 5 Things You Can Do About It) presentation. He says that marketing exists to make sales easier. For example, the process of searching for sales opportunities can be optimized if we pay attention to what our prospectives and current customers are sharing on different social media. Good corporate blogs include insightful information about the company’s aims. The first step in this direction is to discover what web resources a specific company has available. The discovery process is easier for companies than for individuals. Individuals uses a variety of aliases and alternative identities on the web. while companies with good communication strategies provide links to all of their web resources on their primary sites.

Discovery

We offer a script which retrieves web resources connected to any company’s URL. With this tool you will no longer waste time manually searching for this useful information. Companies and people usually have a number of associated sites: blogs; LinkedIn accounts; Twitter accounts; Facebook pages; and videos and photos on specialized sites such as YouTube, Vimeo, Flickr, or Picassa. A recursive level of page crawling is needed to retrieve the location of associated resources. Large companies such as IBM or Dell have multiple accounts associated with different areas. IBM has different Twitter accounts for their research divisions and for the important corporate news.

Usage

fwc.py <input.yaml> <output.yaml>

Look at data-science-organizations.yaml for an example.

Prerequisites

Python 2.7 (or greater 2.x series)
lxml.html
parse_domain.py
PyYAML

Script

This code is available at github.

fwc.py

#!/usr/bin/python2.7

import argparse
import sys
from focused_web_crawler import FocusedWebCrawler
import logging
import code
import yaml
from constraint import Constraint

def main():
   logger = logging.getLogger('data_big_bang.focused_web_crawler')
   ap = argparse.ArgumentParser(description='Discover web resources associated with a site.')
   ap.add_argument('input', metavar='input.yaml', type=str, nargs=1, help ='YAML file indicating the sites to crawl.')
   ap.add_argument('output', metavar='output.yaml', type=str, nargs=1, help ='YAML file with the web resources discovered.')

   args = ap.parse_args()

   input = yaml.load(open(args.input[0], "rt"))

   fwc = FocusedWebCrawler()

   for e in input:
      e.update({'constraint': Constraint()})
      fwc.queue.put(e)

   fwc.start()
   fwc.join()

   with open(args.output[0], "wt") as s:
      yaml.dump(fwc.collection, s, default_flow_style = False)

if __name__ == '__main__':
   main()

focused-web-crawler.py

from threading import Thread, Lock
from worker import Worker
from Queue import Queue
import logging

class FocusedWebCrawler(Thread):
   NWORKERS = 10
   def __init__(self, nworkers = NWORKERS):
      Thread.__init__(self)
      self.nworkers = nworkers
      #self.queue = DualQueue()
      self.queue = Queue()
      self.visited_urls = set()
      self.mutex = Lock()
      self.workers = []
      self.logger = logging.getLogger('data_big_bang.focused_web_crawler')
      sh = logging.StreamHandler()
      formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
      sh.setFormatter(formatter)
      self.logger.addHandler(sh)
      self.logger.setLevel(logging.INFO)
      self.collection = {}
      self.collection_mutex = Lock()

   def run(self):
      self.logger.info('Focused Web Crawler launched')
      self.logger.info('Starting workers')
      for i in xrange(self.nworkers):
         worker = Worker(self.queue, self.visited_urls, self.mutex, self.collection, self.collection_mutex)
         self.workers.append(worker)
         worker.start()

      self.queue.join() # Wait until all items are consumed

      for i in xrange(self.nworkers): # send a 'None signal' to finish workers
         self.queue.put(None)

      self.queue.join() # Wait until all workers are notified

#     for worker in self.workers:
#        worker.join()

      self.logger.info('Finished workers')
      self.logger.info('Focused Web Crawler finished')

worker.py

from threading import Thread
from fetcher import fetch
from evaluator import get_all_links, get_all_feeds
from collector import collect
from urllib2 import HTTPError
import logging

class Worker(Thread):
   def __init__(self, queue, visited_urls, mutex, collection, collection_mutex):
      Thread.__init__(self)
      self.queue = queue
      self.visited_urls = visited_urls
      self.mutex = mutex
      self.collection = collection
      self.collection_mutex = collection_mutex
      self.logger = logging.getLogger('data_big_bang.focused_web_crawler')

   def run(self):
      item = self.queue.get()

      while item != None:
         try:
            url = item['url']
            key = item['key']
            constraint = item['constraint']
            data = fetch(url)

            if data == None:
               self.logger.info('Not fetched: %s because type != text/html', url)
            else:
               links = get_all_links(data, base = url)
               feeds = get_all_feeds(data, base = url)
               interesting = collect(links)

               if interesting:
                  self.collection_mutex.acquire()
                  if key not in self.collection:
                     self.collection[key] = {'feeds':{}}

                  if feeds:
                     for feed in feeds:
                        self.collection[key]['feeds'][feed['href']] = feed['type']

                  for service, accounts in interesting.items():
                     if service not in self.collection[key]:
                        self.collection[key][service]  = {}

                     for a,u in accounts.items():
                        self.collection[key][service][a] = {'url': u, 'depth':constraint.depth}
                  self.collection_mutex.release()

               for l in links:
                  new_constraint = constraint.inherit(url, l)
                  if new_constraint == None:
                     continue

                  self.mutex.acquire()
                  if l not in self.visited_urls:
                     self.queue.put({'url':l, 'key':key, 'constraint': new_constraint})
                     self.visited_urls.add(l)
                  self.mutex.release()

         except HTTPError:
            self.logger.info('HTTPError exception on url: %s', url)

         self.queue.task_done()

         item = self.queue.get()

      self.queue.task_done() # task_done on None

fetcher.py

import urllib2
import logging

def fetch(uri):
   fetch.logger.info('Fetching: %s', uri)
   #logger = logging.getLogger('data_big_bang.focused_web_crawler')
   print uri

   h = urllib2.urlopen(uri)
   if h.headers.type == 'text/html':
      data = h.read()
   else:
      data = None

   return data

fetch.logger = logging.getLogger('data_big_bang.focused_web_crawler')

evaluator.py

import lxml.html
import urlparse

def get_all_links(page, base = ''):
   doc = lxml.html.fromstring(page)
   links = map(lambda x: urlparse.urljoin(base, x.attrib['href']), filter(lambda x: 'href' in x.attrib, doc.xpath('//a')))

   return links

def get_all_feeds(page, base = ''):
   doc = lxml.html.fromstring(page)

   feeds = map(lambda x: {'href':urlparse.urljoin(base, x.attrib['href']),'type':x.attrib['type']}, filter(lambda x: 'type' in x.attrib and x.attrib['type'] in ['application/atom+xml', 'application/rss+xml'], doc.xpath('//link')))

   return feeds

constraint.py

import urlparse
from parse_domain import parse_domain

class Constraint:
   DEPTH = 1
   def __init__(self):
      self.depth = 0

   def inherit(self, base_url, url):
      base_up = urlparse.urlparse(base_url)
      up = urlparse.urlparse(url)

      base_domain = parse_domain(base_url, 2)
      domain = parse_domain(url, 2)

      if base_domain != domain:
         return None

      if self.depth >= Constraint.DEPTH: # only crawl two levels
         return None
      else:
         new_constraint = Constraint()
         new_constraint.depth = self.depth + 1

         return new_constraint

collector.py

import urlparse
import re

twitter = re.compile('^http://twitter.com/(#!/)?(?P[a-zA-Z0-9_]{1,15})$')

def collect(urls):
   collection = {'twitter':{}}
   for url in urls :
      up = urlparse.urlparse(url)
      hostname = up.hostname

      if hostname == None:
         continue

      if hostname == 'www.facebook.com':
         pass
      elif hostname == 'twitter.com':
         m = twitter.match(url)

         if m:
            gs = m.groupdict()
            if 'account' in gs:
               if gs['account'] != 'share': # this is not an account, although http://twitter.com/#!/share says that this account is suspended.
                  collection['twitter'][gs['account']] = url
      elif hostname == 'www.linkedin.com':
         pass
      elif hostname == 'plus.google.com':
         pass
      elif hostname == 'www.slideshare.net':
         pass
      elif hostname == 'www.youtube.com':
         pass
      elif hostname == 'www.flickr.com':
         pass
      elif hostname[-9:] == '.xing.com':
         pass
      else:
         continue

   return collection

Further Work

This process can be integrated with a variety of CRM and business intelligence processes like Salesforce, Microsoft Dynamics, and SAP. These applications provide APIs to retrieve company URLs which you can crawl with our script.

The discovery process is just the first step in studying your prospective customers and generating leads. Once you have stored the sources of company information it is possible to apply machine learning tools to search for more opportunities.

Resources

Helping Search Engines to Find Content in the Invisible Web

Discovering Hidden Web Resources

Search engines and social networks are digital telescopes. It is extremely difficult and time consuming to find web resources outside of their lens. It’s a search craft. Our intuition knows that there are interesting invisible information but we can’t touch it.

IMDB contains a lot of information about users but the site only offers sharing as a collateral feature. If we search on Google we can’t find all the users sharing their movie rankings. At the time of writing of this article the query: site:imdb.com inurl:”user/*/ratings” was returning a few results on Google. How we can help people, through search engines, to find more web resources? This article shows the first 10 million results of the Distributed Scraping With Multiple Tor Circuits process. In a short time Google will index this article and include these new resources so everyone can find them.

In the meantime you have the great honor to see web resources that are invisible for search engines. These page contains the first 10 million of IMDB users sharing their movie’s ratings. We have included a script below to get their ratings taking advantage of the comma separated value export offered by IMDB.

Python Code for Exporting IMDB Ratings in Comma Separated Values

get-user-ratings.py

#!/usr/bin/python2.7

import pymongo
import urllib2

MONGODB_HOSTNAME = 'localhost'

HTML = """
<html>
<body>
{0}
</body>
</html>
"""

EXPORT_URL = "http://www.imdb.com/list/export?list_id=ratings&author_id={0}"

def main():
   conn = pymongo.Connection(MONGODB_HOSTNAME, 27017)
   db = conn.scraping
   coll = db.imdb.ratings

   items = coll.find({'last_response':200})

   links = ""

   i = 0
   for item in items:
      url = item['url']
      index = 'ur{0:07}'.format(item['index'])
      filename = 'ur{0}.csv'.format(item['index'])
      links += "<a href='{0}'>{1}</a><br>".format(url, index)

     with open(filename, "wt") as h:
        h.write(urllib2.urlopen(EXPORT_URL.format(index)).read())

   print HTML.format(links)

if __name__ == '__main__':
   main()

Resources

Photo taken by gari.baldi

Running Console Applications with Invisible Windows

Hiding Console Application Windows

With the simple free source code and executable application below you can launch and run multiple console applications using cmd.exe without displaying the console window. This code eliminates the need to build a service, and is a useful complement to the Distributed Scraping With Multiple Tor Circuits article for those running Tor under the Microsoft Windows operating system. It includes a batch file to enable you to run multiple Tor proxies. Other companies charge for a similar application.

Application

You can download the application hideconsole.exe here.

Code

This is the simple code. If you want to download the entire Visual Studio 2010 project or fork it please use the github project.

// hideconsole.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"
#include "windows.h"
#include <string>
#include <iostream>
#include <stdio.h>

namespace std
{
#ifdef _UNICODE
	typedef wstring tstring;
	#define tout std::wcout
#else
	typedef string tstring;
	#define tout std::cout
#endif
}
using namespace std;

int _tmain(int argc, _TCHAR* argv[])
{
	if(argc == 1) {
		//wprintf(_T("Usage: %s <cmd> [<parameter1>..<parametern>]"));
		tout << "Usage: " << argv[0] << " <cmd> [<parameter1>..<parametern>]" << endl;
		return 0;
	}

	tstring file = argv[1];
	tstring parameters;

	for(int i=2; i < argc; i++) {
		if( i != 2)
			parameters.append(_T(" "));
		parameters.append(argv[i]);
	}

	ShellExecute(NULL, _T("open"), file.c_str(), parameters.c_str(), NULL, SW_HIDE);
	tout << "Running cmd = " << file << endl;
	tout << "Arguments = " << parameters << endl;

	return 0;
}

Batch for Running Multiple Tor Instances

mytor.bat

@echo off
echo %3%
IF EXIST data\tor%3 GOTO DATASUBDIREXISTS
	mkdir data\tor%3
:DATASUBDIREXISTS
.\hideconsole.exe c:\windows\system32\cmd.exe /k """"""c:\Program Files (x86)\tor\tor.exe""" --RunAsDaemon 1 --CookieAuthentication 0 --HashedControlPassword """" --SocksListenAddress 192.168.0.178 --ControlPort %2 --PidFile tor%3.pid --SocksPort %1 --DataDirectory data\tor%3""""

start-multiple-tor.bat

@echo off
setlocal enabledelayedexpansion

set /A base_socks_port=9050
set /A base_control_port=8118
set /A idx=0

IF EXIST data GOTO DATAEXISTS
mkdir data

:DATAEXISTS

FOR /L %%i IN (1,1,80) DO (
	call .\mytor !base_socks_port! !base_control_port! !idx!
	set /A base_socks_port+=1
	set /A base_control_port+=1
	set /A idx+=1
	rem echo !idx!
)

Note

The multiple quotes that you see in the batch file are necessary since we are escaping quotes three times.

Resources

Image by william

Data Big Bang Blog

Creativity and Problem Solving for Data Science (whatever it may mean…) | An experimental spin-off from Nektra Advanced Computing

Menu

Monthly Archives: January 2012

Helping Search Engines to Find Content in the Invisible Web

Discovering Hidden Web Resources

Python Code for Exporting IMDB Ratings in Comma Separated Values

get-user-ratings.py

Resources

Running Console Applications with Invisible Windows

Hiding Console Application Windows

Application

Code

Batch for Running Multiple Tor Instances

mytor.bat

start-multiple-tor.bat

Note

Resources

Menu

Searching for Sales Leads

Discovery

Usage

Prerequisites

Script

fwc.py

focused-web-crawler.py

worker.py

fetcher.py

evaluator.py

constraint.py

collector.py

Further Work

See Also

Resources

Discovering Hidden Web Resources

Python Code for Exporting IMDB Ratings in Comma Separated Values

get-user-ratings.py

Resources

Hiding Console Application Windows

Application

Code

Batch for Running Multiple Tor Instances

mytor.bat

start-multiple-tor.bat

Note

Resources