Scraping vs Antiscraping

Introduction

It’s not possible to jump into the subject of scrapers without confronting antiscraping techniques.  The reverse is also true: if you want to develop good antiscraping techniques you must think like a scraper developer. Similarly, real hackers needs knowledge of security technologies while a good security system benefits from simulated attacks. This kind of “game dynamics” also applies to security algorithms. For example one of the best known public encryption algorithms, RSA, was invented by Ron Rivest, Adi Shamir and Leonard Adleman. Ron and Adi invented new algorithms and Adelman was in charge of breaking them. They eventually came up with RSA1.

Antiscraping Measures and How to Pass Them

A preliminary chart:

Antiscraping techniques Scraping techniques
The site only enables crawling by a known search engine bot. The scraper can access the search engine cache.
The site doesn’t allow the same IP to access a lot of pages in a short period of time. Use Tor, a set of proxies, or a crawling service like 80legs.
The site shows a captcha if it’s crawled in a massive way. Use anti-captcha techniques or services like Mechanical Turk where real people can give the answer. Another alternative is to listen to the captcha and use voice recognition with noise.
The site uses javascript. Use a javascript enabled crawler.

Many antiscraping measures are annoying for visitors. For example if you’re a “search engine junkie” you’ll find pretty quickly that Google shows you a captcha thinking that you are a bot.

Digression

I believe the web should follow a MVC (Model View Controller) type pattern where you can access the data (the model) independently of how you interact with it. This would enable stronger connections between different sites. Linked Data is one of such initiative, but there are others. Data Portability and APIs are a step towards this pattern, but when you are using APIs from large sites you realize that they’ve put a lot of limits. Starting a whole business based on third party APIs is very risky. You only have to look at the past to see a lot of changes on API features and policies. Facebook, Google and Twitter are good examples. API providers are afraid of losing control of their sites and the profits they generate. We need new business models which can get around this problem and benefit both API providers and consumers. In this sense should be created new business models not only based on advertising. One common approach is to charge for the use of the API. There are other models like that followed by the Guardian, which distribute their ads via their API. APIs carrying advertising is a promising concept. We hope that more creative people will came up with new models for a better MVC web.

See Also

  1. Running Your Own Anonymous Rotating Proxies
  2. Distributed Scraping With Multiple Tor Circuits

References

  1. Leonard Adleman Interview

Further reading

  1. Captcha Recognition
  2. OCR Research Team
  3. Data Scraping with YQL and jQuery
  4. API Conference
  5. Google Calls Out Facebook’s Data Hypocrisy, Blocks Gmail Import
  6. Google Search NoAPI
  7. Kayak Search API is no longer supported
  8. The Guardian Open Platform
  9. Twitter Slashes API Rate Limits In Half Across The Board To Deal With Capacity Issues
  10. Facebook, you are doing it wrong
  11. Cubeduel Goes Viral Too Quickly, Stumbles Over LinkedIn API Limits
  12. Keyword Exchange Market
  13. A Union for Mechanical Turk Workers?
  14. The Long Tail Of Business Models
  15. Scraping, cleaning, and selling big data
  16. Detecting ‘stealth’ web-crawlers

Photo: Glykais Gyula fencing against Oreste Puliti. [Source]