Today, one company – Google – controls almost all of the world’s access to information on the Internet. Their monopoly on search means that billions of people, their gateway to knowledge and products, and their exploration of the web are in the hands of one company. Most agree that this lack of competition in research is bad for individuals, societies, and democracy.
Unbeknownst to many, one of the biggest obstacles to competition in research is the lack of creep neutrality. The only way to build an independent search engine and have a fair competition against Big Tech is to first crawl efficiently and effectively to the Internet. However, the web is an actively hostile environment for start-up search engine crawlers, with most websites only allowing Google’s crawler and discriminating against other search engine crawlers like Neeva.
This all-important, and often overlooked, issue has a massive impact on preventing start-up search engines like Neeva from providing users with real alternatives, reducing search competition. Similar to net neutrality, today we need an approach to crawling into net neutrality. Without a change in policy and behavior, competitors in research will still be fighting with one hand tied behind our backs.
Let’s start from the beginning. Creating a comprehensive web index is a prerequisite for competition in search. In other words, the first step to building a Neeva search engine is to “download the Internet” via Neeva’s crawler called Neevabot.
Here is where the problem begins. For the most part, websites only allow Google and Bing crawlers to have unrestricted access while discriminating against other crawlers like Neeva. Either these sites don’t allow anything else in their robots.txt files, or (more commonly) they don’t say anything in their robots.txt file, but return errors instead of content to other crawlers. The intent may be to filter out the malicious actors, but the result is to flush the child out with the bathwater. And you can’t view search results if you can’t crawl the web.
This forces startups to spend a tremendous amount of time and resources to come up with alternative solutions. For example, Neeva applies a policy to “crawl a site as long as the robots.txt file allows GoogleBot and does not specifically block Neevabot”. Even after a workaround like this, the parts of the web that contain useful search results remain inaccessible to many search engines.
As a second example, many websites often allow the non-Google crawler via a robots.txt file and block it in other ways, either by throwing different types of errors (503s, 429s,…) or speed limitation. To crawl these sites one has to publish solutions like “obfuscation by crawling with a bank of periodically rotated proxy IP addresses.” Legitimate search engines like Neeva hate posting hostile solutions like this.
These roadblocks are often intended for malicious bots, but they have the effect of stifling legitimate research competition. At Neeva, we put a lot of effort into building a well-behaved crawler that respects rate limits and crawls with the lowest rate needed to build a great search engine. Meanwhile, Google has carte blanche. It crawls 50 billion web pages per day. It visits every web page once every three days, and it taxes network bandwidth on all websites. This is the monopoly tax on the Internet.
For the lucky crawlers among us, a group of well-rounded professionals, webmasters, and well-meaning publishers can help whitelist your bot. Thanks to them, Neeva’s crawl is now reaching hundreds of millions of pages per day, on track to reach billions of pages per day soon. However, this still requires identifying the right individuals at these companies who you can talk to, sending cold emails and calls, and hoping for goodwill from webmasters on webmaster aliases that are usually overlooked. Non-scalable temporary fix.
Getting permission to crawl doesn’t have to be about who you know. There should be a level playing field for anyone who competes and follows the rules. Google has a monopoly on search. Websites and webmasters face an impossible choice. Either you allow Google to crawl them, or they don’t appear prominently in Google results. As a result, Google’s search monopoly led to the Internet in general strengthening the monopoly by giving Googlebot preferential access.
The Internet should not be allowed to discriminate between search engine crawlers based on their identity. The Neeva crawler is able to crawl the web with the speed and depth that Google does. There are no technical limitations, only anti-competitive market forces make fair competition more difficult. And if it’s a lot of extra work for webmasters to differentiate the bad bots that slow down their websites from legitimate search engines, then those who are free to act like GoogleBot should be asked to share their data with responsible actors.
Regulators and policy makers need to step in if they care about competition in research. The market needs crawl neutrality, similar to network neutrality.
Vivek Raghunathan is the co-founder of Neeva, an ad-free private search engine. Asim Shankar is the chief technology officer at Neeva.