Challenging Google’s Search Engine

Google is the undisputed search leader (88% market share in the US1). Google is not only ahead of competitors in terms of quality of search results, infrastructure worthy of science fiction, and computer science research. Another of their strengths is how quickly they apply their own research.

How can Google be dethroned? Sure, there are other search engines, and newcomers like Blekko and Duck Duck Go make headlines from time to time. However, when you look more closely at those other search engines, you find that they cannot seriously compete with Google.

Benchmarking Search Engines

A search for “reverse engineering” on Blekko returns a hundred thousand results, while the same search on Google returns approximately two million. If it is so difficult for Blekko to compete at the crawling level, imagine what happens in the rest of the search engine pipeline. Just looking at Google’s search quality reports  tells you that Page Rank was only the catalyst for much more sophisticated algorithms.

Duck Duck Go manages to attract a geeky audience with highlighted features like putting privacy first. If we search for “reverse engineering” on Duck Duck Go the results seem wacky: the second result is http://reverseengineeringinc.com/ a content poor site which just has the right domain name.

Google appears to be in a league by itself. It currently seems unlikely that they could lose significant market share due to an engineering weakness. In order to outdo Google, we must think holistically and try to guess how the web as a whole will evolve over the next ten years.

A Holistic Approach

Duck Duck Go created a two level search engine for sites like Wikipedia or YouTube. DDG offers DuckDuckGo Instant Answer API to incorporate the search engines of third parties. In order to take advantage of DDG and other two-tiered search engines, sites will have to improve their local search. Currently if you search using the local site search inside Stack Overflow, for example, the results are much lower quality than the same query in Google restricted to stackoverflow.com. When each site understands its own data better than Google, its internal search results will surpass Google’s. Google will no doubt continue to provide better global results, but the two-tiered search would decentralize efforts to improve algorithms. It is important to note that this solution does not need to be distributed: sites can share their local indexes and ranking algorithms with the routing search engine.

The fact that a small number of sites receive the majority of Internet traffic means that optimizing the top sites for a two layer search would make a big difference.

Notes

  1. Search Engine Market Share

Additional Resources

  1. To Break Google’s Monopoly on Search, Make Its Index Public
  2. Can a small search engine take on Google?
  3. Google’s Weakness, AltaVista’s Strength [2002]
  4. Is the Internet driving competition or market monopolization?
  5. Media and Internet concentration in Canada

See Also

  1. Reverse Engineering and The Cloud
  2. Egont, A Web Orchestration Language
  3. Helping Search Engines to Find Content in the Invisible Web
  4. The Data Portability Fact Sheet

The Data Portability Fact Sheet

Introduction

Parallego has been announced on TechCrunch after a stealth period as the latest social network that will challenge Facebook and Google Plus. Their investors include big names like Sequoia Capital, Andreessen Horowitz and Union Square Ventures, and they have top angels like Ron Conway. They really love developers, so they offer an API to show their commitment to openness.

Parallego doesn’t really exist, but announcements like this are part of startup breaking news about the web and entrepreneurship. These companies emphasize their love for developers and claim to be open because they provide APIs. The truth is that when you test their APIs you usually find a number of problems:

  1. You can read the information but cannot write or modify it.
  2. You have access to certain information but other information is unavailable.
  3. The rate of API calls is low, so you can only make a few calls and must wait a certain period of time to continue.
  4. You cannot make parallel requests in a multiprocess or multithreaded application.
  5. There is no way to quickly pay for the service and access a better service. Google API Console is a step in that direction but a lot of important Google NoAPIs are unavailable.
  6. Some OAuth2 protocol implementation does not work with the existing development libraries.
  7. The service says it welcomes new applications, but this is not the case for new UIs and mobile clients. See Twitter to Devs: Don’t Make Twitter Clients… Or Else [mashable.com]
  8. You cannot even export your own information. The time you have spent adding content to this service is lost once you leave it.
  9. There is no love for developers: the forums are filled with questions and there are no official answers. See Rate limit with billing enabled [google.com] and Graph API rate limit? [facebook.com]
  10. The company often changes its policies. The web mashup that you did seven months ago that attracted thousands of users is useless because the new API revision does not give you the data that you need for some specific features. See Should facebook pay compensation for deprecated API calls and changes [facebook.com]
  11. Old content is removed without warning.

After a while, you begin to doubt, close your eyes and rethink again about the word “Open”. It seems somewhat meaningless. If you are older you may remember that Microsoft was accused of being closed, but you may also remember that in the worst case you could reverse engineer and access all the internals yourself. You need advanced knowledge of tools like IDA Pro, OllyDbg, and WinDbg of course, but it was possible. You can’t reverse engineer the cloud, however you can scrape the information, but this is time consuming both in terms of development and running time.

And while “Open” is repeated in every announcement from high profile web companies, your brain does not register the word anymore just like you do not see any of the ads on Google because your brain made has made its own AdBlock extension.

Data Portability Classification

For all of the above reasons we think the best initiative towards transparency is adding a fact sheet to every service so we can compare them and know how “open” they really are. WikiMatrix is a good example of how comparisons could be made.

Marco Paol from DBB has been informally collecting information about some web services and has put it in a public spreadsheet on Data Portability Comparison

Please feel free to send us clarifications, suggestions, and fixes.

Resources

  1. Open Data and Linked Data [wikipedia.org]
  2. DataPortability project [wikipedia.org]
  3. Small data [smalldata.org]
  4. The open data manual [opendatamanual.org]
  5. Is It Open Data?
  6. Open Data mailing lists [okfn.org]
  7. Synaptic/Web
  8. Open Knowledge Foundation Blog
  9. The Friend of a Friend (FOAF) project
  10. theinfo.org: Community for Getting, Processing, and Visualizing Large Data Sets
  11. Plagiarism Today
  12. PeopleBrowsr’s case against Twitter heads back to state court after federal court ruling
  13. Archive Team archivists