Common Crawl is awesome. I wonder how complex it would be to run a Google-like frontend on top of it, and how good the results would be after a couple days of hacking…
I wondered why it wasn’t obvious to this person that it would be very complex to do that, and that the result would not be great. I realized that many people who search Google in 2013 are too young to remember the early days of web search (I’ll use Google as an example for this post, replace with your favorite search engine if you’d like). There was one time when building a decent search engine was relatively simple:
- crawl a few million pages (in 1997, that would have included pretty much every interesting web page out there).
- create an index in which every word points to all the web pages containing it, just like one at the end of a textbook.
- Write a cgi script to parse a query from a search box into separate words, find the words in the index, compute the intersection of all the pages, create a list of (at most) 10 results.
- Render those results as ten blue links on a white page. Include the search box in that page in case the user wasn’t satisfied with the (probably crappy) results we just provided.
Fast forward 16 years. Today’s search engines are no longer about the web. They are about reading people’s minds, and finding answers to questions we cannot even articulate properly. A naive search engine built using the Common Crawl data on top of 1990s technology would feel clunky and tedious, like riding a horse for a ten-mile commute. If you don’t believe me, just go to Google (either on your computer or your phone) and start typing a few characters. Pay attention to all the things that happen:
- Google fixes spelling mistakes as you type. Every keystroke is sent back to a server that tries to help you as quickly as possible. What does this person want? How much information do I have so far? Should I suggest searches, or should I display results? What device is this person using, and what does the screen look like?
- Google knows who you are. On Google, you’re searching your own personal universe. If you and I start typing a first name (J-o-h), odds are it will show us people we know named John, along with their pictures. They probably won’t be the same people.
- The web is only a small fraction of your information needs. It’s been a long time since search engines restricted themselves to finding web pages. Today they will do math for you, find real-time flight information, tell you about the local weather, show you when the next bus will come by your stop. If you type an artist or a song title, you’ll probably see a video or two. Back in 2007 Google called this Universal Search, but the concept had been around for years. In 2000-2001 we were already working on blending different types of results at Inktomi.
A mainstream search engine acts a bit like a very diligent psychic. It has performed a background check on you. It knows more about you than you can probably imagine, including patterns you are not even aware of. It uses that information plus a number of large data sources (of which the web is just one) to guess what you want. To make matters more complicated, the usual input methods don’t let you articulate your desires as if you were talking to your personal search genie:
“Hey Mysterious Finder of Things, I’m looking for an article that I read sometime last week. There was a phrase that was really funny, it had something to do with George Carlin. Or maybe Louis CK, I’m not sure.”
Instead, you’d go to Google and start typing something like “George Carlin article.” Perhaps you’d do it directly from your browser bar without even noticing that you’re searching Google. As you type, you might see the url and title of the article you read 6 days ago. If you don’t, you could actually pick a suggested query, submit the search and refine the results. How often do you do that? If you see really old results, you could tell Google to restrict the search to articles from the past week. How do you think Google knows that an article is from the past week, or from the past year? It’s not as easy as it seems; Google knows when it first saw an article. What if the article was created ten years ago but only made visible to Google last week?
I could keep going, and make this into a book about the immense amount of work that has taken search to the current state of the art. I’ll stop here. The point of this post was to show how even tech-savvy people take for granted the extremely complex mechanisms that power a search engine. It may be possible to make some investors believe that you can build a usable alternative to Google in a year, but anybody who calls him/herself a hacker should know better.