open | Data Big Bang Blog

Introduction

It’s not possible to jump into the subject of scrapers without confronting antiscraping techniques. The reverse is also true: if you want to develop good antiscraping techniques you must think like a scraper developer. Similarly, real hackers needs knowledge of security technologies while a good security system benefits from simulated attacks. This kind of “game dynamics” also applies to security algorithms. For example one of the best known public encryption algorithms, RSA, was invented by Ron Rivest, Adi Shamir and Leonard Adleman. Ron and Adi invented new algorithms and Adelman was in charge of breaking them. They eventually came up with RSA¹.

Antiscraping Measures and How to Pass Them

A preliminary chart:

Antiscraping techniques	Scraping techniques
The site only enables crawling by a known search engine bot.	The scraper can access the search engine cache.
The site doesn’t allow the same IP to access a lot of pages in a short period of time.	Use Tor, a set of proxies, or a crawling service like 80legs.
The site shows a captcha if it’s crawled in a massive way.	Use anti-captcha techniques or services like Mechanical Turk where real people can give the answer. Another alternative is to listen to the captcha and use voice recognition with noise.
The site uses javascript.	Use a javascript enabled crawler.

Many antiscraping measures are annoying for visitors. For example if you’re a “search engine junkie” you’ll find pretty quickly that Google shows you a captcha thinking that you are a bot.

Digression

I believe the web should follow a MVC (Model View Controller) type pattern where you can access the data (the model) independently of how you interact with it. This would enable stronger connections between different sites. Linked Data is one of such initiative, but there are others. Data Portability and APIs are a step towards this pattern, but when you are using APIs from large sites you realize that they’ve put a lot of limits. Starting a whole business based on third party APIs is very risky. You only have to look at the past to see a lot of changes on API features and policies. Facebook, Google and Twitter are good examples. API providers are afraid of losing control of their sites and the profits they generate. We need new business models which can get around this problem and benefit both API providers and consumers. In this sense should be created new business models not only based on advertising. One common approach is to charge for the use of the API. There are other models like that followed by the Guardian, which distribute their ads via their API. APIs carrying advertising is a promising concept. We hope that more creative people will came up with new models for a better MVC web.

References

Leonard Adleman Interview

Data Big Bang Blog

Creativity and Problem Solving for Data Science (whatever it may mean…) | An experimental spin-off from Nektra Advanced Computing

Menu

Tag Archives: open

Scraping vs Antiscraping

Introduction

Antiscraping Measures and How to Pass Them

Digression

See Also

References

Further reading