I just saw this poston Hacker News, and I thought I’d write my answer here. Disclaimer: I don’t know if I am (or ever was) a hardcore back-end developer. I did make money for a long time doing web-scale development work for a bunch of companies.
I started coding in the 80s. No web, no connectivity for the most part, just simple games, random programs, hardware drivers, random utilities. Basic, Pascal, Assembly, C.
Fast forward to 1997. I’m doing my masters at Carnegie Mellon, enjoying the most awesome network connection. I download my first mp3, a novelty back then. It’s not even illegal, that will come later. A few months later the first mp3 search engines start to pop up. I think I can create one that’s better, and it becomes my personal summer project in 1998. After six weeks of nonstop hacking, I have 2look4.com
running (check out the old snapshot, the domain hasn’t been mine since 2001).
So what was 2look4? A few things.
- A CGI program written in C running under Apache that would process queries and call system(“/bin/grep”) over a file containing ftp links to mp3 files.
- An ftp spider called by a cron job that would go through a list of known sites, list the mp3 files there and guess if the site had “unlimited access” or if it required an upload/download/ratio.
One day I put it out there and told some friends about it. The first day it had 100 queries. A week later, 10,000. My linear search with grep held up pretty well for the first few days, but as soon as I started getting concurrent queries it slowed down to a crawl. That’s the first time I saw a web-scale problem. Over the course of three nights I wrote a very simple inverted indexer and searcher to replace grep. It had no tf-idf or word positions. All you could do was an AND search (intersection between the sets of documents containing words), but it reduced the complexity from linear to almost logarithmic (binary search, basically). This was enough for the load of the server to drop and for the machine (a Pentium server) to run pretty much unattended for the next year. It would peak at around 200k queries per day sometime in 1999, before Napster came into the picture and killed ftp servers that shared mp3 files.
In the fall of ’98 I showed 2look4 to a former Carnegie Mellon professor who worked at Inktomi
. Long story short, Inktomi acqui-hired me and my puny search engine software (not the site, which earned me some nice cash on the side from ads). I joined the crawling and indexing team and battled Google fiercely for the next four years and ultimately lost. We did a lot of cool stuff though. We grew from indexing 30M documents to 500M in a couple of years for example.
To answer the title question: I don’t think you “become a hardcore back-end developer”, you simply dip your toes in the water of terabytes and millions of requests per day. Try not to commit any algorithms with orders of complexity that don’t make sense for what you are doing. Pierce through abstraction layers and try to understand what the hardware is doing. Are we killing the network? The cpu? The disks? Do lots of back-of-the-envelope calculations. Write prototypes. Use load generator tools. Instrument code. Optimize carefully and only when necessary. Ask others who know more. It never ends, because technology changes fast. Game-changing technologies become affordable (cloud computing, SSDs, insane amounts of memory). It’s very humbling, because you think your system is as fast as it can be until someone comes along and makes it 10x faster.
If you are a software developer, try to work for a while on a web-scale problem. To me, it never gets old.