Precise Scraping with Google Chrome

Developers often search the vast corpus of scraping tools for one that is capable of simulating a full browser. Their search is pointless. Full browsers with extension capabilities are great scraping tools. Among extensions, Google Chrome’s are by far the easiest to develop, while Mozilla has less restrictive APIs. Google offers a second way to control Chrome: the Debugger protocol. Unfortunately, Debugger protocol is pretty slow.

The Google Chrome extension API is an excellent choice for writing an up to date scraper which uses a full browser with the latest HTML5 features and performance improvements. In a previous article, we described how to scrape Microsoft TechNet App-V forum. Now, we will focus on VMWare’s ThinApp. In this case, we develop a Google extension instead of a Python script.

Procedure

You will need Google Chrome, Python 2.7, and lxml.html
Download the code from github
Install the Google Chrome extension
Enter the VMware ThinApp: Discussion Forum
The scraper starts automatically
Once it stops, go to the Google Chrome console and copy&paste the results in JSON format to the thinapp.json file
Run the thinapp_parser.py to generate the thinapp.csv file with the results
Open the thinapp.csv file with a spreadsheet
To rank the results, add a column which divides the number of views by the number of days.

Our Results: Top Twenty Threads

Registry Isolation…
Thinapp Internet Explorer 10
Process (ifrun60.exe) remains active (Taskmanager) after closing thinapp under windows7 (xp works)
Google Chrome browser
File association not passing file to thinapp package
Adobe CS3 Design Premium and FlexNET woes…
How to thinapp Office 2010?
Size limit of .dat file?
ThinApp Citrix Receiver 3.2
Visio 2010 Thinapp – Licensing issue
Thinapp Google Chrome
Thinapp IE7 running on Windows 7
Adobe CS 6
Failed to open, find, or create Sandbox directory
Microsoft Project and Office issues
No thinapp in thinapp factory + unable to create workpool
IE8 Thinapp crashing with IE 10 installed natively
ThinApp MS project and MS Visio 2010
Difference between ESXi and vSphere and VMware view ??
ThinAPP with AppSense

Acknowledgments

Matias Palomera from Nektra Advanced Computing wrote the code.

Notes

This approach can be successfully used to scrape heavy Javascript and AJAX sites
Instead of copying the JSON data from the Chrome console, you can use the FileSystem API to write the results to a file
You can also write the CSV directly from Chrome instead of using an extra script

If you like this article, you might also be interested in

Resources

Web Scraping for Semi-automatic Market Research

It is easy to web scrape Microsoft TechNet Forums (look at the xml output here: http://social.technet.microsoft.com/Forums/en-US/mdopappv/threads?outputAs=xml)and normalize the resulting information to have a better idea of each thread’s rank based on views and initial publication date. Knowing how issues are ranked can help a company choose what to focus on.

This code was used to scrape Microsoft TechNet’s forums. In the example below we scraped the App-V forum, since it is one of the application virtualization market’s leaders along with VMware ThinApp, and Symantec Workspace Virtualization.

These are the top ten threads for the App-V forum:

“Exception has been thrown by the target of an invocation”
Office 2010 KMS activation Error: 0xC004F074
App-V 5 Hotfix 1
Outlook 2010 Search Not Working
Java 1.6 update 17 for Kronos (webapp)
Word 2010 There was a problem sending the command to the program
Utility to quickly install/remove App-V packages
SAP GUI 7.1
The dreaded: “The Application Virtualization Client could not launch the application”
Sequencing Chrome with 4.6 SP1 on Windows 7 x64

The results show how frequently customers have issues with virtualizing Microsoft Office, Key Management Services (KMS), SAP, and Java. App-V competitors like Symantec Workspace Virtualization and VMWare ThinApp have similar problems. Researching markets this way gives you a good idea of areas where you can contribute solutions.

The scraper stores all the information in a SQLite database. The database can be exported using the csv_App-V.py script to an UTF-8 CSV file. We imported the file with Microsoft Excel and then normalized the ranking of the threads. To normalize it we divided the number of views by the age of the thread so threads with more views per day rank higher. Again, the scraper can be used on any Microsoft forum on Social TechNet. Try it out on your favorite forum.

Code

Prerequisites: lxml.html

The code is available at microsoft-technet-forums-scraping [github] . It was written by Matias Palomera from Nektra Advanced Computing, who received valuable support from Victor Gonzalez.

Usage

Run scrapper-App-V.py
Then run csv_App-V.py
The results are available in the App-V.csv file

Acknowledgments

Matias Palomera from Nektra Advanced Computing wrote the code.

Notes

This is a single thread code. You can take a look at our discovering web resources code to optimize it with multithreading.
Microsoft has given scrapers a special gift: it is possible to use the outputAs variable in the URL to get the structured information as XML instead of parsing HTML web pages.
Our articles Distributed Scraping With Multiple Tor Circuits and Running Your Own Anonymous Rotating Proxies show how to Implement your own rotating proxies infrastructure with Tor.

If you liked this article, you might also like:

Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website

Searching for Sales Leads

The best definition of “marketing” I have read is by Dave Kellog in the What the CEO Really Thinks of Marketing (And 5 Things You Can Do About It) presentation. He says that marketing exists to make sales easier. For example, the process of searching for sales opportunities can be optimized if we pay attention to what our prospectives and current customers are sharing on different social media. Good corporate blogs include insightful information about the company’s aims. The first step in this direction is to discover what web resources a specific company has available. The discovery process is easier for companies than for individuals. Individuals uses a variety of aliases and alternative identities on the web. while companies with good communication strategies provide links to all of their web resources on their primary sites.

Discovery

We offer a script which retrieves web resources connected to any company’s URL. With this tool you will no longer waste time manually searching for this useful information. Companies and people usually have a number of associated sites: blogs; LinkedIn accounts; Twitter accounts; Facebook pages; and videos and photos on specialized sites such as YouTube, Vimeo, Flickr, or Picassa. A recursive level of page crawling is needed to retrieve the location of associated resources. Large companies such as IBM or Dell have multiple accounts associated with different areas. IBM has different Twitter accounts for their research divisions and for the important corporate news.

Usage

fwc.py <input.yaml> <output.yaml>

Look at data-science-organizations.yaml for an example.

Prerequisites

Python 2.7 (or greater 2.x series)
lxml.html
parse_domain.py
PyYAML

Script

This code is available at github.

fwc.py

#!/usr/bin/python2.7

import argparse
import sys
from focused_web_crawler import FocusedWebCrawler
import logging
import code
import yaml
from constraint import Constraint

def main():
   logger = logging.getLogger('data_big_bang.focused_web_crawler')
   ap = argparse.ArgumentParser(description='Discover web resources associated with a site.')
   ap.add_argument('input', metavar='input.yaml', type=str, nargs=1, help ='YAML file indicating the sites to crawl.')
   ap.add_argument('output', metavar='output.yaml', type=str, nargs=1, help ='YAML file with the web resources discovered.')

   args = ap.parse_args()

   input = yaml.load(open(args.input[0], "rt"))

   fwc = FocusedWebCrawler()

   for e in input:
      e.update({'constraint': Constraint()})
      fwc.queue.put(e)

   fwc.start()
   fwc.join()

   with open(args.output[0], "wt") as s:
      yaml.dump(fwc.collection, s, default_flow_style = False)

if __name__ == '__main__':
   main()

focused-web-crawler.py

from threading import Thread, Lock
from worker import Worker
from Queue import Queue
import logging

class FocusedWebCrawler(Thread):
   NWORKERS = 10
   def __init__(self, nworkers = NWORKERS):
      Thread.__init__(self)
      self.nworkers = nworkers
      #self.queue = DualQueue()
      self.queue = Queue()
      self.visited_urls = set()
      self.mutex = Lock()
      self.workers = []
      self.logger = logging.getLogger('data_big_bang.focused_web_crawler')
      sh = logging.StreamHandler()
      formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
      sh.setFormatter(formatter)
      self.logger.addHandler(sh)
      self.logger.setLevel(logging.INFO)
      self.collection = {}
      self.collection_mutex = Lock()

   def run(self):
      self.logger.info('Focused Web Crawler launched')
      self.logger.info('Starting workers')
      for i in xrange(self.nworkers):
         worker = Worker(self.queue, self.visited_urls, self.mutex, self.collection, self.collection_mutex)
         self.workers.append(worker)
         worker.start()

      self.queue.join() # Wait until all items are consumed

      for i in xrange(self.nworkers): # send a 'None signal' to finish workers
         self.queue.put(None)

      self.queue.join() # Wait until all workers are notified

#     for worker in self.workers:
#        worker.join()

      self.logger.info('Finished workers')
      self.logger.info('Focused Web Crawler finished')

worker.py

from threading import Thread
from fetcher import fetch
from evaluator import get_all_links, get_all_feeds
from collector import collect
from urllib2 import HTTPError
import logging

class Worker(Thread):
   def __init__(self, queue, visited_urls, mutex, collection, collection_mutex):
      Thread.__init__(self)
      self.queue = queue
      self.visited_urls = visited_urls
      self.mutex = mutex
      self.collection = collection
      self.collection_mutex = collection_mutex
      self.logger = logging.getLogger('data_big_bang.focused_web_crawler')

   def run(self):
      item = self.queue.get()

      while item != None:
         try:
            url = item['url']
            key = item['key']
            constraint = item['constraint']
            data = fetch(url)

            if data == None:
               self.logger.info('Not fetched: %s because type != text/html', url)
            else:
               links = get_all_links(data, base = url)
               feeds = get_all_feeds(data, base = url)
               interesting = collect(links)

               if interesting:
                  self.collection_mutex.acquire()
                  if key not in self.collection:
                     self.collection[key] = {'feeds':{}}

                  if feeds:
                     for feed in feeds:
                        self.collection[key]['feeds'][feed['href']] = feed['type']

                  for service, accounts in interesting.items():
                     if service not in self.collection[key]:
                        self.collection[key][service]  = {}

                     for a,u in accounts.items():
                        self.collection[key][service][a] = {'url': u, 'depth':constraint.depth}
                  self.collection_mutex.release()

               for l in links:
                  new_constraint = constraint.inherit(url, l)
                  if new_constraint == None:
                     continue

                  self.mutex.acquire()
                  if l not in self.visited_urls:
                     self.queue.put({'url':l, 'key':key, 'constraint': new_constraint})
                     self.visited_urls.add(l)
                  self.mutex.release()

         except HTTPError:
            self.logger.info('HTTPError exception on url: %s', url)

         self.queue.task_done()

         item = self.queue.get()

      self.queue.task_done() # task_done on None

fetcher.py

import urllib2
import logging

def fetch(uri):
   fetch.logger.info('Fetching: %s', uri)
   #logger = logging.getLogger('data_big_bang.focused_web_crawler')
   print uri

   h = urllib2.urlopen(uri)
   if h.headers.type == 'text/html':
      data = h.read()
   else:
      data = None

   return data

fetch.logger = logging.getLogger('data_big_bang.focused_web_crawler')

evaluator.py

import lxml.html
import urlparse

def get_all_links(page, base = ''):
   doc = lxml.html.fromstring(page)
   links = map(lambda x: urlparse.urljoin(base, x.attrib['href']), filter(lambda x: 'href' in x.attrib, doc.xpath('//a')))

   return links

def get_all_feeds(page, base = ''):
   doc = lxml.html.fromstring(page)

   feeds = map(lambda x: {'href':urlparse.urljoin(base, x.attrib['href']),'type':x.attrib['type']}, filter(lambda x: 'type' in x.attrib and x.attrib['type'] in ['application/atom+xml', 'application/rss+xml'], doc.xpath('//link')))

   return feeds

constraint.py

import urlparse
from parse_domain import parse_domain

class Constraint:
   DEPTH = 1
   def __init__(self):
      self.depth = 0

   def inherit(self, base_url, url):
      base_up = urlparse.urlparse(base_url)
      up = urlparse.urlparse(url)

      base_domain = parse_domain(base_url, 2)
      domain = parse_domain(url, 2)

      if base_domain != domain:
         return None

      if self.depth >= Constraint.DEPTH: # only crawl two levels
         return None
      else:
         new_constraint = Constraint()
         new_constraint.depth = self.depth + 1

         return new_constraint

collector.py

import urlparse
import re

twitter = re.compile('^http://twitter.com/(#!/)?(?P[a-zA-Z0-9_]{1,15})$')

def collect(urls):
   collection = {'twitter':{}}
   for url in urls :
      up = urlparse.urlparse(url)
      hostname = up.hostname

      if hostname == None:
         continue

      if hostname == 'www.facebook.com':
         pass
      elif hostname == 'twitter.com':
         m = twitter.match(url)

         if m:
            gs = m.groupdict()
            if 'account' in gs:
               if gs['account'] != 'share': # this is not an account, although http://twitter.com/#!/share says that this account is suspended.
                  collection['twitter'][gs['account']] = url
      elif hostname == 'www.linkedin.com':
         pass
      elif hostname == 'plus.google.com':
         pass
      elif hostname == 'www.slideshare.net':
         pass
      elif hostname == 'www.youtube.com':
         pass
      elif hostname == 'www.flickr.com':
         pass
      elif hostname[-9:] == '.xing.com':
         pass
      else:
         continue

   return collection

Further Work

This process can be integrated with a variety of CRM and business intelligence processes like Salesforce, Microsoft Dynamics, and SAP. These applications provide APIs to retrieve company URLs which you can crawl with our script.

The discovery process is just the first step in studying your prospective customers and generating leads. Once you have stored the sources of company information it is possible to apply machine learning tools to search for more opportunities.

Data Big Bang Blog

Creativity and Problem Solving for Data Science (whatever it may mean…) | An experimental spin-off from Nektra Advanced Computing

Menu

Tag Archives: marketing

Precise Scraping with Google Chrome

Procedure

Our Results: Top Twenty Threads

Acknowledgments

Notes

If you like this article, you might also be interested in

Resources

Web Scraping for Semi-automatic Market Research

Code

Usage

Acknowledgments

Notes

If you liked this article, you might also like:

Menu

Procedure

Our Results: Top Twenty Threads

Acknowledgments

Notes

If you like this article, you might also be interested in

Resources

Code

Usage

Acknowledgments

Notes

If you liked this article, you might also like:

Searching for Sales Leads

Discovery

Usage

Prerequisites

Script

fwc.py

focused-web-crawler.py

worker.py

fetcher.py

evaluator.py

constraint.py

collector.py

Further Work

See Also

Resources