Scraping vs Antiscraping

Introduction

It’s not possible to jump into the subject of scrapers without confronting antiscraping techniques. The reverse is also true: if you want to develop good antiscraping techniques you must think like a scraper developer. Similarly, real hackers needs knowledge of security technologies while a good security system benefits from simulated attacks. This kind of “game dynamics” also applies to security algorithms. For example one of the best known public encryption algorithms, RSA, was invented by Ron Rivest, Adi Shamir and Leonard Adleman. Ron and Adi invented new algorithms and Adelman was in charge of breaking them. They eventually came up with RSA¹.

Antiscraping Measures and How to Pass Them

A preliminary chart:

Antiscraping techniques	Scraping techniques
The site only enables crawling by a known search engine bot.	The scraper can access the search engine cache.
The site doesn’t allow the same IP to access a lot of pages in a short period of time.	Use Tor, a set of proxies, or a crawling service like 80legs.
The site shows a captcha if it’s crawled in a massive way.	Use anti-captcha techniques or services like Mechanical Turk where real people can give the answer. Another alternative is to listen to the captcha and use voice recognition with noise.
The site uses javascript.	Use a javascript enabled crawler.

Many antiscraping measures are annoying for visitors. For example if you’re a “search engine junkie” you’ll find pretty quickly that Google shows you a captcha thinking that you are a bot.

Digression

I believe the web should follow a MVC (Model View Controller) type pattern where you can access the data (the model) independently of how you interact with it. This would enable stronger connections between different sites. Linked Data is one of such initiative, but there are others. Data Portability and APIs are a step towards this pattern, but when you are using APIs from large sites you realize that they’ve put a lot of limits. Starting a whole business based on third party APIs is very risky. You only have to look at the past to see a lot of changes on API features and policies. Facebook, Google and Twitter are good examples. API providers are afraid of losing control of their sites and the profits they generate. We need new business models which can get around this problem and benefit both API providers and consumers. In this sense should be created new business models not only based on advertising. One common approach is to charge for the use of the API. There are other models like that followed by the Guardian, which distribute their ads via their API. APIs carrying advertising is a promising concept. We hope that more creative people will came up with new models for a better MVC web.

References

Leonard Adleman Interview

3 thoughts on “Scraping vs Antiscraping”

Pingback: Tweets that mention Scraping vs Antiscraping « Data Big Bang Blog -- Topsy.com ~
ProxyMesh says:

March 17, 2011 at 10:05 pm

Nice article. I believe if more sites released quality APIs and charged a low price with reasonable licensing, they wouldn’t need to worry so much about scraping. But making quality APIs can be a lot harder than throwing up a basic CRUD website. Like you showed, scrapers will figure out how to get the data, and the only way to stop it is to provide an easier way.
- Sebastian Wain says:
  
  April 11, 2011 at 1:01 am
  
  I agree that making a quality API is an extra effort, but the major part of sites being scraped has enough resources to take it seriously or see it like an additional revenue source. For example sites related to jobs, real state can benefit from building an API.

Comments are closed.

Data Big Bang Blog

Creativity and Problem Solving for Data Science (whatever it may mean…) | An experimental spin-off from Nektra Advanced Computing

Menu

Introduction

Antiscraping Measures and How to Pass Them

Digression

See Also

References

Further reading

3 thoughts on “Scraping vs Antiscraping”