It’s not possible to jump into the subject of scrapers without confronting antiscraping techniques. The reverse is also true: if you want to develop good antiscraping techniques you must think like a scraper developer. Similarly, real hackers needs knowledge of security technologies while a good security system benefits from simulated attacks. This kind of “game dynamics” also applies to security algorithms. For example one of the best known public encryption algorithms, RSA, was invented by Ron Rivest, Adi Shamir and Leonard Adleman. Ron and Adi invented new algorithms and Adelman was in charge of breaking them. They eventually came up with RSA1.
Antiscraping Measures and How to Pass Them
A preliminary chart:
|The site only enables crawling by a known search engine bot.
|The scraper can access the search engine cache.
|The site doesn’t allow the same IP to access a lot of pages in a short period of time.
|Use Tor, a set of proxies, or a crawling service like 80legs.
|The site shows a captcha if it’s crawled in a massive way.
|Use anti-captcha techniques or services like Mechanical Turk where real people can give the answer. Another alternative is to listen to the captcha and use voice recognition with noise.
Many antiscraping measures are annoying for visitors. For example if you’re a “search engine junkie” you’ll find pretty quickly that Google shows you a captcha thinking that you are a bot.
I believe the web should follow a MVC (Model View Controller) type pattern where you can access the data (the model) independently of how you interact with it. This would enable stronger connections between different sites. Linked Data is one of such initiative, but there are others. Data Portability and APIs are a step towards this pattern, but when you are using APIs from large sites you realize that they’ve put a lot of limits. Starting a whole business based on third party APIs is very risky. You only have to look at the past to see a lot of changes on API features and policies. Facebook, Google and Twitter are good examples. API providers are afraid of losing control of their sites and the profits they generate. We need new business models which can get around this problem and benefit both API providers and consumers. In this sense should be created new business models not only based on advertising. One common approach is to charge for the use of the API. There are other models like that followed by the Guardian, which distribute their ads via their API. APIs carrying advertising is a promising concept. We hope that more creative people will came up with new models for a better MVC web.
- Captcha Recognition
- OCR Research Team
- Data Scraping with YQL and jQuery
- API Conference
- Google Calls Out Facebook’s Data Hypocrisy, Blocks Gmail Import
- Google Search NoAPI
- Kayak Search API is no longer supported
- The Guardian Open Platform
- Twitter Slashes API Rate Limits In Half Across The Board To Deal With Capacity Issues
- Facebook, you are doing it wrong
- Cubeduel Goes Viral Too Quickly, Stumbles Over LinkedIn API Limits
- Keyword Exchange Market
- A Union for Mechanical Turk Workers?
- The Long Tail Of Business Models
- Scraping, cleaning, and selling big data
- Detecting ‘stealth’ web-crawlers
Photo: Glykais Gyula fencing against Oreste Puliti. [Source]