Introduction
It’s not possible to jump into the subject of scrapers without confronting antiscraping techniques. The reverse is also true: if you want to develop good antiscraping techniques you must think like a scraper developer. Similarly, real hackers needs knowledge of security technologies while a good security system benefits from simulated attacks. This kind of “game dynamics” also applies to security algorithms. For example one of the best known public encryption algorithms, RSA, was invented by Ron Rivest, Adi Shamir and Leonard Adleman. Ron and Adi invented new algorithms and Adelman was in charge of breaking them. They eventually came up with RSA1.
Antiscraping Measures and How to Pass Them
A preliminary chart:
Antiscraping techniques | Scraping techniques |
The site only enables crawling by a known search engine bot. | The scraper can access the search engine cache. |
The site doesn’t allow the same IP to access a lot of pages in a short period of time. | Use Tor, a set of proxies, or a crawling service like 80legs. |
The site shows a captcha if it’s crawled in a massive way. | Use anti-captcha techniques or services like Mechanical Turk where real people can give the answer. Another alternative is to listen to the captcha and use voice recognition with noise. |
The site uses javascript. | Use a javascript enabled crawler. |
Many antiscraping measures are annoying for visitors. For example if you’re a “search engine junkie” you’ll find pretty quickly that Google shows you a captcha thinking that you are a bot.
Digression
I believe the web should follow a MVC (Model View Controller) type pattern where you can access the data (the model) independently of how you interact with it. This would enable stronger connections between different sites. Linked Data is one of such initiative, but there are others. Data Portability and APIs are a step towards this pattern, but when you are using APIs from large sites you realize that they’ve put a lot of limits. Starting a whole business based on third party APIs is very risky. You only have to look at the past to see a lot of changes on API features and policies. Facebook, Google and Twitter are good examples. API providers are afraid of losing control of their sites and the profits they generate. We need new business models which can get around this problem and benefit both API providers and consumers. In this sense should be created new business models not only based on advertising. One common approach is to charge for the use of the API. There are other models like that followed by the Guardian, which distribute their ads via their API. APIs carrying advertising is a promising concept. We hope that more creative people will came up with new models for a better MVC web.
See Also
References
Further reading
- Captcha Recognition
- OCR Research Team
- Data Scraping with YQL and jQuery
- API Conference
- Google Calls Out Facebook’s Data Hypocrisy, Blocks Gmail Import
- Google Search NoAPI
- Kayak Search API is no longer supported
- The Guardian Open Platform
- Twitter Slashes API Rate Limits In Half Across The Board To Deal With Capacity Issues
- Facebook, you are doing it wrong
- Cubeduel Goes Viral Too Quickly, Stumbles Over LinkedIn API Limits
- Keyword Exchange Market
- A Union for Mechanical Turk Workers?
- The Long Tail Of Business Models
- Scraping, cleaning, and selling big data
- Detecting ‘stealth’ web-crawlers
Photo: Glykais Gyula fencing against Oreste Puliti. [Source]
Nice article. I believe if more sites released quality APIs and charged a low price with reasonable licensing, they wouldn’t need to worry so much about scraping. But making quality APIs can be a lot harder than throwing up a basic CRUD website. Like you showed, scrapers will figure out how to get the data, and the only way to stop it is to provide an easier way.
I agree that making a quality API is an extra effort, but the major part of sites being scraped has enough resources to take it seriously or see it like an additional revenue source. For example sites related to jobs, real state can benefit from building an API.