September 3, 2011

Articles Summary

This is a summary of all the Data Big Bang blog articles by subject.

IR

A summary of information retrieval stages and current data science articles.

Fetching

  1. Distributed Scraping With Multiple Tor Circuits
  2. Running Your Own Anonymous Rotating Proxies

Cleaning/Tidying

  1. HTML Cleaners and Tidiers

Parsing

Handling of Active Content

  1. Web Scraping Ajax and Javascript Sites

Main Content Extraction

  1. Extraction of Main Text Content Using the Google Reader NoAPI
  2. Voice Recognition + Content Extraction + TTS = Innovative Web Browsing

Language Identification

  1. Language Identification for Text Mining and NLP

Security

  1. Automated Browserless OAuth Authentication for Twitter
  2. The Python POPO’s Way to Integrate PayPal Instant Payment Notification

APIs and NoAPIs

  1. Google Search NoAPI
  2. Exporting StackOverflow users blogs to Excel Hyperlinks
  3. Extraction of Main Text Content Using the Google Reader NoAPI
  4. Integrating Google Analytics into your Company Loop with a Microsoft Excel Add-on

Policies and Data Issues

  1. Scraping vs Antiscraping
  2. The Data Portability Fact Sheet

Entrepreneurship

  1. Ideas and Execution Magic Chart
  2. Ideas: Egont, A Web Orchestration Language
  3. Egont Part II

Marketing and Sales

  1. Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website

Plugins

  1. Integrating Google Analytics into your Company Loop with a Microsoft Excel Add-on

Big Data Stack

  1. Using Queues in Web Crawling and Analysis Infrastructure
  2. Persisting Native Python Queues
  3. Adding Acknowledgement Semantics to a Persistent Queue
  4. Esoteric Queue Scheduling Disciplines

Tools

  1. Running Microsoft Windows Console Applications with Invisible Windows

Announcements

Resources

  1. Data Science Resources
Digital Art by Don Relyea