Using Queues in Web Crawling and Analysis Infrastructure

Message oriented middleware (MOM) is a key technology for implementing a custom pipeline and analyzing unstructured data. The pipeline for going from crawling web pages to part of speech tagging (PoST) and beyond is long. It requires a variety of processes which are implemented in several different programming languages and operating systems. For example, boilerpipe is an excellent Java library for extracting main text content while PoSTs libraries, like NLTK or FreeLing, are implemented in Python.

One might be tempted to integrate different technologies using web services but web services alone have many weak points. If the pipeline has ten processes and, for example, the last one fails, then the intermediate processes can be lost if they are not persisted. There must be a higher level mechanism in place to resume the pipeline processing. MOMs ensure message persistence until a consumer acknowledges that a specific process has finished.

There are a lot of MOMs to choose from, including commercial and free open source variants. Some features are present in almost all of them while others are not. Contention management is an important feature if you are dealing, as is likely, with a high ratio of messages produced to messages consumed at any one time. For example, a web crawler can fetch web pages at an incredibly high speed while processes like content extraction take longer. Running a message queue without contention management under these circumstances will exhaust the machine’s memory.

While MOMs are important for uniting heterogeneous technologies, the different processes must also know which queues to utilize to consume the input and produce the output for the next phases. A new wave of frameworks like NServiceBusResque, Celery, and Octobot has emerged to handle this.

In conclusion, MOMs help to connect heterogeneous technologies and bring robustness, and are very useful in the context of unstructured information like text analysis. Many MOMs are available, but there is not a single one with a complete feature set. However some of these features can be supplied by frameworks such as NServiceBus, Resque, Celery, and Octobot.

See Also

  1. Esoteric Queue Scheduling Disciplines
  2. Persisting Native Python Queues
  3. Adding Acknowledgement Semantics to a Persistent Queue

Resources

  1. Message Queues vs Web Services
  2. Message Queue Evaluation Notes
  3. The Hadoop Map-Reduce Capacity Scheduler
  4. Contention Management in the WSA
  5. Message Queuing Architectures

Integrating Google Analytics into your Company Loop with a Microsoft Excel Add-on

 

Introduction

Google Analytics and AdWords are essential marketing and sales tools. They can be integrated with the ubiquitous Microsoft Excel with the Google Data API. Data Big Bang’s Nicolas Papagna has developed an Excel add-on which can be downloaded here. This plugin enables Excel users to quickly retrieve Google Analytics data using the available Google Analytics metrics, and dimensions, and may also be sorted by the user’s criteria. One of the advantages of our solution is that Excel accesses the Google Analytics API directly instead of accessing it thru Data Big Bang server. Other solutions need access to your information which this exposes your private data to third parties.

Installation and Usage

  1. Download GoogleAnalyticsToExcel.AddInSetup_1.0.20.0.exe.
  2. Install it.
  3. Run Microsoft Excel.
  4. Configure your Google credentials by clicking on “Settings” under the “Google Analytics to Excel Addin” ribbon tab.
  5. Customize your query and retrieve your Google Analytics data by clicking “Query Google Analytics” button.

Development Notes

Data Big Bang’s research team has also developed an OData web service that can be consumed using applications such as PowerPivot, Tableau and LINQPad. This web service doesn’t require any add-ons. However, since unfortunately neither PowerPivot nor Tableau offer query builders to interact with OData providers, users must know how to craft the OData URL query themselves. The most interesting part of this project was developing a Google Data Protocol to Open Data Protocol .NET class that offers an IQueryable interface to convert LINQ queries to GData. LINQ queries add a lot of expressive power beyond GData.

See Also

  1. Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website
  2. Integrating Dropbox with Microsoft Outlook
  3. Exporting StackOverflow users blogs to Excel Hyperlinks

Automatically Tracking Events with Google Analytics, jQuery and jsUri

Pragmatic Code

Google analytics can track user events on a web page. This article shows a code snippet which automates the insertion of tracking code. Instead of adding tracking codes manually one tag at a time, we bind the code to the click event automatically. We opt not to make use of the plugins for libs such as jQuery or for applications such as WordPress so as to have full control over the process.Since multiple interactions can take place on a single page, it is essential to add tracking codes to log user interactions. Tracking codes are also needed to track clicks on links to external sites.

JsUri is the most robust library to parse URIs since a parsing function is sadly not included in javascript implementations (only a trick).

This is how we implemented it on our Data Big Bang blog to track clicks to other sites:

<!-- Inside <head> -->
<script type='text/javascript' src='http://www.databigbang.com/js/jquery-1.7.min.js?ver=1.7.0'></script>
<script type='text/javascript' src='http://www.databigbang.com/js/jsuri-1.1.1.min.js?ver=1.1.1'></script>

<!-- After <body> -->
<script type="text/javascript">
	// Track click on hyperlinks to external sites
	$(document).ready(function() {
		$('a').click(function(event) {
			var target = event.target;
			var uri = new Uri(target);
			if(uri.host() != 'www.databigbang.com' && uri.host() != 'blog.databigbang.com') {
				//alert('Match!'); // Only for debugging
				_gaq.push(['_trackEvent', 'UI', 'Click', target.toString(), 0, true]);
			}
		});
	});
</script>

Indeed this is how we configure WordPress to get the libs automatically: edit the functions.php under the theme folder.

if( !is_admin()) {
	wp_deregister_script('jquery');

#	Avoid retrieving jquery libs from ajax.googleapis.com since Google domains can be blocked in countries like China.
#	wp_register_script('jquery', ("http://ajax.googleapis.com/ajax/libs/jquery/1/jquery.min.js"), false, "1.7.0");
	wp_register_script('jquery', ('http://www.databigbang.com/js/jquery-1.7.min.js'), false, '1.7.0');

	wp_enqueue_script('jquery');

	wp_deregister_script('jsuri');
	wp_register_script('jsuri', ("http://www.databigbang.com/js/jsuri-1.1.1.min.js"), false, "1.1.1");
	wp_enqueue_script('jsuri');
}

This is how we implemented it on our secure coupon codes generator site to track clicks on a rich web application.

<!-- Inside <head> -->
<script type='text/javascript' src='http://www.databigbang.com/js/jquery-1.7.min.js?ver=1.7.0'></script>
<script type='text/javascript' src='http://www.databigbang.com/js/jsuri-1.1.1.min.js?ver=1.1.1'></script>	

<!-- After <body> -->
$(document).ready(function() {
	// Add Event Trackers
	$('a').click(function(event) {
		var target = event.target;
		var uri = new Uri(target.href);

		if(uri.host() == 'www.securecouponcodes.com') {
			//alert('match link');

			_gaq.push(['_trackEvent', 'UI', 'Click', target.href, 0, true]);

		}
	});

	$('button').click(function(event) {
		var target = event.target;
		//alert('match button');

		_gaq.push(['_trackEvent', 'UI', 'Click', target.innerText, 0, true]);
	});
});

Resources

  1. How do I parse a URL into hostname and path in javascript?
  2. Event Tracking Guide
  3. Is Google’s CDN for jQuery available in China?