Have You Received Bitcoin Spam?

Over the past few days, someone has been sending tiny amounts of bitcoin (one satoshi) to thousands of addresses. Those transactions are unlikely to be confirmed, but they do show up on blockchain.info for now. Here’s one example:

bitcoinspam

The satoshis come from two vanity addresses: 1Enjoy… and 1Sochi…

If you click on the first address as I write this, you will see that its most recent transaction contains a link with the text “Play and win BTC”

bitcoinspam2

The link leads to what looks like a bitcoin gambling site, bitwars dot org. I’m not linking to it for obvious reasons.

Some googling shows it has been done before. Here’s a discussion on Stackexchange. If you have received these tiny amounts and wondered what they are about, now you know. You can safely ignore them. Just like life, spam will always find a way.

Dogecoin and the Appeal of Small Numbers

Dogecoin is a unique phenomenon in the fascinating world of cryptocurrencies. It’s barely six weeks old, and as I write this post its network has more computing power than any other cryptocurrency except for Bitcoin. It made headlines this weekend when its community raised enough money to send the Jamaican bobsled team to the Sochi Winter Olympics.

From a technical standpoint, Dogecoin is essentially a branded clone of Litecoin (the second cryptocurrency in terms of total market value). Without a doubt one of the most important factors contributing to Dogecoin’s popularity is its community. The Dogecoin subreddit has almost 40k users right now. The front page usually has a good mix of humor, good will, finance, and technology. Check it out if you haven’t already.

There’s another more subtle factor that I believe plays in Dogecoin’s favor: its tiny value. One DOGE is worth about $0.0015 right now. In other words, one dollar buys you about 600-700 DOGE. Contrast that with Bitcoin: $1 is about 0.001 BTC. This puts Bitcoin and Dogecoin in two completely different mental buckets for most people. One BTC is comparable to an ounce of gold. The press reinforces this idea, and many people view Bitcoin as a digital store of value. The daily transaction volume of BTC is about 0.2 percent of the total bitcoins in existence, which means that BTC does not circulate very much yet.

Contrast this with Dogecoin, for which the daily transaction volume is close to 15%. Where does that money go? Perhaps the most common usage of DOGE is to give online tips. Compare the activity of Reddit’s bitcointip and dogetipbot, and you’ll see the latter is much more active. What would you prefer as a tip, 100 DOGE or 0.000002 BTC? Both are almost meaningless in terms of monetary value, but receiving 100 units of a coin does feel better. It’s also easier to give tips; you don’t have to think much about tipping someone 10, 25 or 100 DOGE. With BTC you either have to choose a dollar amount, or be very careful with the number of zeroes.

The reason a DOGE is worth so little is the total supply of coins. The Bitcoin software has an embedded constant called MAX_MONEY. For Bitcoin it’s set to 21 million, which means that if Bitcoin takes over as a world currency it will be impossible for most people to ever own one. Litecoin is only slightly better, at 84 million. For DOGE, it’s one hundred billion (perhaps more, yet to be decided). This makes it unlikely that one DOGE will be worth $1 any time soon (or ever). It’s easy and fun to exchange $20 for 10k DOGE and give a fraction of them to strangers on the internet. Anyone can still mine hundreds of dogecoins per day with a desktop computer, and not feel very attached to them. Being a “slumdoge millionaire” is still affordable to many.

In a world where people get a kick out of likes or retweets, Dogetips take it up a notch. A Dogetip is an upvote that you can use, internet karma points that are actually worth something. So fun, very value.

Image credit: /u/binxalot, this person deserves tips. Of course I accept them too 🙂

DHpZsQCDKq9WbqyqfetMcGq87pFZfkwLBh

The Harsh, Logical World of Cryptocurrencies

There’s a subtle point that escapes most people buying bitcoin these days: Bitcoins are not exactly something you own. Rather, they are something you know.

Let’s say you you buy a bitcoin, and you transfer it to an address. The world now knows that there is one BTC there. This is public information. What the world doesn’t know is the key to use that money. Only the person who knows that key (you and only you, hopefully) can spend it. If you lose that key, nobody will ever be able to spend that bitcoin, and it will stay locked forever.

How is this possible? Why can’t we recover the key somehow? Because it’s like finding an infinitesimally small needle in an absurdly gigantic haystack. There is a very large number of possible bitcoin keys, comparable to the number of atoms in the universe (it’s a 77-digit number). Your bitcoin address is derived from that key, but not in a way that can be reversed. A very rough analogy would be: if you know John, you know his height is 5’10”. If I tell you that someone’s height is 5’10”, you have no idea who that person is.

This analogy doesn’t take us very far; if being 5’10” were the only requirement to spend John’s money, it would be easy to find someone with that height. Now imagine we are traveling around the universe and measuring the diameters of atoms. One day, and for no particular reason, they all expand into random sizes. Some remain tiny, others are are as large as a cow, a planet or a galaxy. Your job is to find an atom whose diameter is exactly 176,891,292,523,293.23412 miles. Good luck.

When you install a bitcoin wallet on your computer it generates a unique, virtually unrepeatable key for you. This key is 256 random bits, and you can derive your address from it. The second you transfer significant money to that address, you have a problem. You need to make sure that:

a) You never lose your key.

b) Nobody else can ever know your key.

You can see how this is difficult. If you wrote down your key, someone could find it and spend your money. If that happens, you wouldn’t know until you checked the blockchain transactions involving that address. Same thing if you backed it up. What if someone finds your backup drive? What if you upload it to Dropbox, and a disgruntled employee reads your files? On the other hand, you cannot afford to not back up your key. You wouldn’t want to be like the guy whose hard drive containing millions of dollars worth of BTC is buried somewhere in a dumpster.

So how do we solve this problem? Suppose you want to store a life-changing amount of bitcoin somewhere (why you’d want to do that is an interesting question). The best solution I’ve found so far involves two-factor authentication. You can use an algorithm like BIP38 to generate a passphrase-encrypted key, which you can then print and or / store somewhere offline. You may want to have a few paper / digital copies of this key in different locations (why not, it’s cheap). Still, there are issues with this approach:

– You have to know what you’re doing. For example, you have no reason to be online to generate a bitcoin key/address. Technically you don’t even need a computer; you could generate a bitcoin key by rolling a die 100 times, writing down the numbers (3, 6, 3, 2, 1 …) and performing some mathematical transformations with pen and paper. That would be a bit too paranoid, though. It’s easier (and less error prone) to use a live read-only Linux distribution on a computer that’s never been connected to the internet. You’d need to install a trusted, open-source program to generate a BIP38-encrypted paper wallet (you’d copy it from a portable drive, of course). There are several implementations out there. This is my favorite, although it requires a graphical interface. I’ve also written my own command-line paper wallet generator and BIP38 encrypt/decrypt utility (although I urge you to not trust my hasty implementations very much).

– If you use BIP38, now you have two things you can’t lose: your encrypted key and your password. Losing either now means that your money is gone forever. And if you can remember your password easily, it’s probably not good enough. Relevant XKCD:

Even though the BIP38 algorithm is pretty hard to brute-force, a determined adversary who possessed your encrypted key could try billions of combinations in a relatively small amount of time. Therefore, choosing a password for an encrypted key that controls millions of dollars worth of bitcoin is not a trivial matter.

Are there any better alternatives? Probably not. Some people are partial to the idea of a Brain Wallet, which works as follows: you pick a very complicated passphrase that you can remember, and use a mathematical function to convert it into a 256-bit bitcoin key. Therefore, the “wallet” exists only in your brain. The problem with this approach is that you’re losing significant entropy by picking something you can remember, therefore making the job of an attacker easier (see this Reddit thread about a guy who used an “obscure poem in Afrikaans” as a passphrase).

Could we go the opposite way perhaps? Could we generate a random key and then come up with a mnemonic that encodes 256 bits of information in a way that most people can remember? I gave the problem a little bit of thought this morning and came up with a few ideas, none of which convince me fully:

– Memorize your key as a decimal number. It’s “only” 77 digits, and you could use the techniques described in Moonwalking with Einstein. I’m sure I could remember 77 digits for an hour or two. Next year? Forget it 🙂

– Use the 256 bits to generate a fictional character. For example, the first bit could be man / woman. The next six bits could be age (say 18 to 82). Favorite color, height, weight, country of origin, occupation, etc. I made a reasonably long list of attributes that I thought I might remember about someone I really cared about, and it was hard to get past 100 bits.

– Generate meaningful text, perhaps a poem or a story, using arbitrary rules, e.g.

Random word from [ A/The/My/Your/One/His/Her/Our ] -> 3 bits

[Color name picked from a list of 16] -> 4 bits

[Nationality from 128 countries] -> 7 bits

[Type of animal from the most memorable ones, maybe 6 bits]

For example, the first 20 bits might encode into “your blue Zimbabwean otter,” and  that would be the start of a story generated by other rules that consumed the remaining 236 bits. It would be mandatory to lay out all your choices beforehand, and strictly adhere to what the 256 bits of your random key had dictated. The structure could yield something more memorable than the on-the-fly narratives “memory athletes” use for short-term recall. Even though it’s clearly possible to do this, I wouldn’t trust my memory to precisely remember such a story for years. It’s still a fun thought experiment.

If there is a point to this post, it’s that using bitcoin to safely store large amounts of money long-term is still impractical for most people. There’s a reason bitcoin wallets are not called vaults. Unless you really know what you’re doing, don’t use them to store more money than you would carry in your pocket.

By the way, the key to the address linked in the second paragraph of this post is the number 1. You can see that people send tiny amounts of money to it frequently, perhaps to test out new software. Want to catch a digital dime once in a while? 🙂

If you found this post useful, tips are welcome.

1EmwBbfgH7BPMoCpcFzyzgAN9Ya7jm8L1Z

Hacker News discussion of this post.

Markov Chains in Clojure, Part 2 – Scaling Up

Last month I posted a very simple Markov Chain generator that takes one paragraph of text and produces gibberish. The one issue is that it doesn’t scale, for two reasons:

1) It reads all the text into memory at once. That’s fine if you’re processing a few paragraphs, but for large amounts of text it’s better to process one line at a time.

2) The word lists contain repeated words. These lists would take much less space if they had word counts instead. For example:

{"He" ("is" "is"), :start ("He" "She" "He" "She")}

Could be

{"He" {"is" 2}, :start {"He" 2, "She" 2)}

Obviously it’s not much of a gain for this simple example. If we were processing book though, we might have hundreds of sentences that started with He or She. The second structure would be a couple orders of magnitude smaller.

Let’s start with the second problem. Here’s an elegant code snippet to traverse the list of words and build the structure with counts from a Stack Overflow discussion:

Notice the usefulness of fnil, it makes the code more concise. More importantly, the above is a lazy function; if we read the file lazily then we can process the input stream without reading it all into memory first. How would you do it?

Below is a snippet of code based on another Stack Overflow question. By the way, isn’t Stack Overflow awesome? There was a time when we had to code without it, or without internet access. Those were the bad old times 🙂

Notice that I sandwiched the words in a sentence between the :start and :end markers because I want to know what words are good starting and ending points for generated sentences. The transform function sees a long stream of words that could be a single sentence without these markers. As anecdotal evidence that this approach is more memory-efficient, running the JVM on my Macbook against a text file with 50k English words takes up 20Mb less than the original program.

Now that we have a suitable structure, how do we pick words at random but with a probability that’s proportional to the number of times it occurs after a given word?  For example, if I have {“He” {“is” 15, “was” 5}} I want to pick “is” 3/4 of the time. This was an interview question that we asked frequently at Inktomi in the late 90s. The simplest solution is to pick a random number in the range of the number of individual instances of words, and pick the index of the word corresponding to the slice it would take in the total.

In the above example, we have two slices:

0 -14 -> "is"

15 - 19 -> "was"

so choosing a random number between 0 and 19 and then checking what slice it belongs to would yield “is” 75% of the time and “was” the remaining 25%. Here’s a nice implementation from Rick Hickey:

And that’s pretty much it. The entire code is here. I ran it against this collection of Hacker News headlines from this post and generated a few of my own:

  • “Microsoft’s Decline of Employee #1, Leaving Github”
  • “Y Combinator’s First iOS 6 Countdown”
  • “Congress Is the HipHop VM”
  • “Welcome to Get From Russia in Javascript charts and Numpy”

Where do we go from here? We could change the code to pick only sentences of a certain length, or to make chains of n-grams (instead of single words) to create more plausible text. Have fun!

Fun with Markov Chains and Clojure

Markov chains are a simple concept. Start with a state (e.g. a word, a location on a map, a musical note) and move to another one at random. Repeat until you have a chain of states that you can use for something. The key is that you don’t choose randomly from all possible states, only from those that have some probability of following the current state.

For example, suppose you know that a sentence starts with the word “She.” It’s reasonable to bet on the next word being “is” or “was.” It’s possible (but less likely) to see less common words such as “asserted” or “obliterated.” Most likely your text doesn’t contain instances of phrases like “She chemistry” or “She absolutism.” If that’s the case, those words have zero probability of being chosen.

Let’s take the following text as an example:

He is a boy. She is a girl. He is young. She is younger.

Here is a list of all the transitions we see in the sentences from this text:

{"She" ("is" "is"), "a" ("boy." "girl."), 
"is" ("a" "a" "young." "younger."),
"He" ("is" "is"), "*START*" ("He" "She" "He" "She")}

Note that the word “is” has four words that it could transition to. Two of them are “a” so if we start generating random sentences, “is” will be followed by “a” half the time. Let’s choose a word that follows the *START* state (nothing) and continue until we find a word with a period. Here are some sentences we could see:

"He is younger." "She is a girl." "She is a boy." "He is a boy."

Ok, time to automate this with some Clojure code (or skip to the “generate gibberish” button below if you’re not a programmer):

The first function generates a map of transitions like the one shown above. The second one chooses words from the map until it finds a word that ends in a period. It’s easy to compile this into JavaScript via ClojureScript. I’ve done that, check out the full code on Github. You can test it below: click the “generate gibberish” button and check the output. The input box contains Blowin’ in the Hamlet by Bob Shakespeare. Try replacing that with a few paragraphs of your own.


Output:

——-

Side note: the code above is the simplest algorithm I came up with for generating a Markov chain out of a few paragraphs, but it doesn’t scale for large amounts of text. If you try running it against a book, it will run out of memory. In a follow-up post I’ll show how to put together a solution that scales better. It will be more about Clojure than about Markov Chains.

Software Development Estimates, Where Do I Start?

For some reason many people discuss the problem of estimating software development timeframes without properly understanding the issue. There is a famous answer on Quora that exemplifies this. Lots of people like that story, even though it’s inaccurate and misguided. “Software development” is such a huge endeavor that it doesn’t even make sense to talk about estimates without an understanding of the kinds of problems software can solve. To put this in context, let’s forget software for a second and look at a few tangible problems of different magnitudes.

  • You are a medical researcher. A new disease makes headlines. It’s a virus. It seems to be spreading through sexual contact. How long is it going to take you to find a cure?
  • It’s 1907, and you’ve just built the first airplane. The government wants you to build a spaceship to fly to the moon. How long will it take?
  • You are in charge of a construction company that has built hundreds of buildings in your metropolitan area. I want you to build me a twenty-story apartment complex very similar to the one you just finished on the other side of town. When will it be done?
  • You own a chair factory that produces 1000 chairs per month. I need 3500 chairs. Can you make them in a month?

The four questions above are radically different in nature. The first two involve significant unknowns, and require scientific or technological breakthroughs. The other two, not so much. Meta-question: does software development look like the first or the second kind? Another meta-question: what kind of software development are we talking about?

Let’s focus on the construction company. People have been constructing buildings for centuries. There’s relatively little variation in the effort and costs of putting up vanilla high-rises (we’re not talking about Burj Khalifa). Of course there is uncertainty: economic conditions could change, suppliers could go out of business. The new mayor could have a personal vendetta against your company because your big brother bullied him in school. All those things have happened before, perhaps in combination. Let’s say you can give me an estimate of 15 to 24 months with 98% confidence based on past data. Sounds good to me.

If buildings were software you could break out a template, customize it a bit, install it on my plot of land in the cloud, the end. There are companies doing this for software, for example Bitnami (disclosure: I’m an investor). The process is so quick that you don’t even need to ask for a time estimate. You just see it happen in real time.

Let’s imagine that it were impossible to clone software at almost zero cost, like it is for physical things. Most software developers would be like monks copying manuscripts. If you have been handwriting pages for long enough, you can confidently tell me that it will take you at least 15 years to produce a high-quality copy of the entire Bible (it can be done 4x faster today, not sure about the quality though). You could get sick, or suffer interruptions. However, the number of absolute hours you’ll need is well known. There is a type of software development that works like this: porting old applications to a new language (say COBOL to Java back in the day).

Of course, the more rewarding problems in software development look nothing like this. I enjoy trying to solve problems that nobody has solved before. The malleability of software makes it easy to explore an open problem. Some problems are deceiving; at first they may look like building a house, and as you discover unknowns they sometimes mutate to resemble a quest for an AIDS vaccine. If a problem is solvable with software, it may take weeks or months to come up with an imperfect solution. It will probably take years to build one that’s scalable and robust. The meta-problem is that some problems cannot be solved with software, or at least not yet, or not by me / my team / my company. I might give you an estimate that would look like:

  • less than a month: 30% chance
  • less than a year: 40% chance
  • never: 30% chance

Another kind of software development somewhere in the middle, and it may be the one that generates the most software jobs. Usually an organization wants a solution to a problem that has already been solved by others (e.g. building a cluster manager for a social network graph). Even though you don’t have access to the design or the code, you have an idea of what the solutions look like. You don’t know what key issues they ran into, how good the teams were, how lucky they got. Still, you know the problem can be solved by companies that look like yours for a reasonable cost. There is still quite a bit of uncertainty:  you can estimate small tasks reasonably well, but you cannot predict which “week-long” task might expand to several months (e.g. it turns out that no open source tool solves X in a way that works for us, we’ll have to write our own).

The gist of why estimates are hard: every new piece of software is a machine that has never been built before. The process of describing how the machine works is the same as building the machine. The more your machine looks like existing ones, the easier it is to estimate its difficulty. Of course it won’t be exactly the same machine; that can happen with a chair but not with software. On the other hand, you may want to boldly build what no one has built before. In that case, you’ll most likely adjust your scope so that you can build something that makes sense for the timeframes you work with. The solution to the original problem might take iterations over generations. Not necessarily generations of humans, perhaps generations of product versions, teams or even companies. You may set out to put a man on the moon, and your contribution would be the first airplane.

Discuss on Hacker News

Uber for Everything

San Francisco has 800,000 inhabitants. How many cordless drills are there in this city? Probably orders of magnitude more than we actually need. I bought one six months ago, used it just once. It’s now another object in a box filled with stuff. At the time I must have thought that the most convenient way to hang a shelf on the wall was to buy a drill for $22 with free shipping with Amazon Prime.

What I really needed were the holes, not the drill. Maybe I should put it out in the street and forget that it existed. What if for $22 I could have had a reliable person show up at my place within an hour, drill the holes, and go away? I’d probably do that every time I needed holes (maybe once a year, no idea). I know I could find someone on TaskRabbit to do it, but is it as easy as buying a drill on Amazon?

drillimage

I know this is a silly example of a First World problem; that’s not the point. What’s interesting to me is how typical households in the developed world contain caches of random objects that we use with varying frequencies. You probably use your toothbrush at least once a day (let’s hope). How about other stuff in your bathroom? What’s in your closet, or in your garage? Perhaps you have a tennis racket that you bought ten years ago when Rafa Nadal was still an unknown. He’s gone through hundreds of rackets since, while yours sat idle next to your mother’s old dining set, on top of a case containing a $300 guitar that you played for a week, inspired by Slash’s performance during a Guns’N’Roses concert you saw. Then the band fell apart, and Axl spent ten years “working” on an album called Chinese Democracy that few people remember. While far from their best work, Chinese Democracy is way better than, say, Liz Phair’s Funstyle. But that’s not important right now, let’s get back to your neglected guitar.

Why did you spend $300 on a guitar? It probably seemed like the best option at the time given the alternatives you had. You probably couldn’t borrow one from a friend, and you thought there was a good chance you’d use it for a long time. It seemed justified. We humans are pretty bad at predicting the future, and sometimes that’s very costly. On a much larger scale, those of us who live in California know how this state embraced the car/freeway combo during the twentieth century. The state was developed during the short window of time when cars and freeways seemed like the solution to all transportation problems. Now we are stuck with an inefficient transportation system, and we need to own our private cars to drive on public freeways.

What if we had to design the United States transportation system from scratch today? With today’s technology, perhaps we’d want public roads and public cars. It might work like this:

You need to go from A to B, so you walk outside. There are a bunch of cars parked within a minute of your doorstep. They all more ore less indistinguishable, like parking meters or traffic lights. You pull out your phone, click on the “car” icon, see the lights of a silver sedan blink. You drive it to B, and you park it somewhere. Your app charges you a toll for the trip. That exact car probably won’t be there when you get back, so you have to take your stuff with you. Perhaps you have a standard robot trunk that fits into all cars. It follows you around when you walk, and inserts itself into the car you drive. Another robot goes around refueling cars. Cars that break down are mysteriously repaired at night. In this imaginary country, owning a car makes as much sense as owning a road.

Of course I’m not suggesting that we build the above system (I’d prefer self-driving Segways). All I’m saying is that we have the technology to do it if we wanted. In fact, let’s forget cars. What other kinds of objects that we own could be replaced by services? I can imagine startups taking advantage of niche opportunities the same way ride-sharing services like Uber and Lyft are disrupting the taxi business. Could the “drill-me-a-hole” app become a billion dollar startup? Perhaps not, but what are the conditions for an object-replaced-by-a-service to be a viable business? Here are a few:

  • Latency: if I need a cab, I don’t want it tomorrow. It’s reasonable to wait for ten minutes, but an hour might be too much. For a hole in the wall, I could wait until tomorrow. What about owning a dinner set for 12 people in case we have guests over? I may want to schedule it to show up Friday at 5 pm, as well as a dirty dish pick up tomorrow after 11.
  • Liability: what if I make my drill available for peer-to-peer rental, and the next person to use it breaks it? What if my drill is used to commit a crime?
  • Liquidity: what if I request a dinner set at 4 pm, but there are none available until tomorrow? What if I want a relatively rare object of which only five exist in San Francisco?
  • Peer-to-peer (“AirBnB for drills”) or centralized (“Zipcar for blenders”)?
  •  Cost-effectiveness: could someone put a drill in my hands in the next hour, and pick it up tomorrow morning for less than it costs Amazon to deliver one in 48 hours (and never get it back)?

There must be lots of things for which new “sharing economy” and “unusual things as a service” startups could figure out the operational details. Uber, Lyft, TaskRabbit, Zipcar, that’s just the beginning. Imagine the free space and extra money you’d have if you could have a vacuum cleaner at your place in 30 minutes and gone an hour later, two extra chairs for the weekend, an air mattress for a week, a barbecue for six hours on Labor Day, a Bigfoot Garden Yeti for… well, never?

Discuss on Hacker News

Why Search Is Hard

Yesterday I was reading this Hacker News thread about the Common Crawl, and one comment caught my attention:

Common Crawl is awesome. I wonder how complex it would be to run a Google-like frontend on top of it, and how good the results would be after a couple days of hacking…

I wondered why it wasn’t obvious to this person that it would be very complex to do that, and that the result would not be great. I realized that many people who search Google in 2013 are too young to remember the early days of web search (I’ll use Google as an example for this post, replace with your favorite search engine if you’d like). There was one time when building a decent search engine was relatively simple:

  • crawl a few million pages (in 1997, that would have included pretty much every interesting web page out there).
  • create an index in which every word points to all the web pages containing it, just like one at the end of a textbook.
  • Write a cgi script to parse a query from a search box into separate words, find the words in the index, compute the intersection of all the pages, create a list of (at most) 10 results.
  • Render those results as ten blue links on a white page. Include the search box in that page in case the user wasn’t satisfied with the (probably crappy) results we just provided.

Fast forward 16 years. Today’s search engines are no longer about the web. They are about reading people’s minds, and finding answers to questions we cannot even articulate properly. A naive search engine built using the Common Crawl data on top of 1990s technology would feel clunky and tedious, like riding a horse for a ten-mile commute. If you don’t believe me, just go to Google (either on your computer or your phone) and start typing a few characters. Pay attention to all the things that happen:

  • Google fixes spelling mistakes as you type. Every keystroke is sent back to a server that tries to help you as quickly as possible. What does this person want? How much information do I have so far? Should I suggest searches, or should I display results? What device is this person using, and what does the screen look like?
  • Google knows who you are. On Google, you’re searching your own personal universe. If you and I start typing a first name (J-o-h), odds are it will show us people we know named John, along with their pictures. They probably won’t be the same people.
  • The web is only a small fraction of your information needs. It’s been a long time since search engines restricted themselves to finding web pages. Today they will do math for you, find real-time flight information, tell you about the local weather, show you when the next bus will come by your stop. If you type an artist or a song title, you’ll probably see a video or two. Back in 2007 Google called this Universal Search, but the concept had been around for years. In 2000-2001 we were already working on blending different types of results at Inktomi.

A mainstream search engine acts a bit like a very diligent psychic. It has performed a background check on you. It knows more about you than you can probably imagine, including patterns you are not even aware of. It uses that information plus a number of large data sources (of which the web is just one) to guess what you want. To make matters more complicated, the usual input methods don’t let you articulate your desires as if you were talking to your personal search genie:

“Hey Mysterious Finder of Things, I’m looking for an article that I read sometime last week. There was a phrase that was really funny, it had something to do with George Carlin. Or maybe Louis CK, I’m not sure.”

Instead, you’d go to Google and start typing something like “George Carlin article.” Perhaps you’d do it directly from your browser bar without even noticing that you’re searching Google. As you type, you might see the url and title of the article you read 6 days ago. If you don’t, you could actually pick a suggested query, submit the search and refine the results. How often do you do that? If you see really old results, you could tell Google to restrict the search to articles from the past week. How do you think Google knows that an article is from the past week, or from the past year? It’s not as easy as it seems; Google knows when it first saw an article. What if the article was created ten years ago but only made visible to Google last week?

I could keep going, and make this into a book about the immense amount of work that has taken search to the current state of the art. I’ll stop here. The point of this post was to show how even tech-savvy people take for granted the extremely complex mechanisms that power a search engine. It may be possible to make some investors believe that you can build a usable alternative to Google in a year, but anybody who calls him/herself a hacker should know better.

Discuss on Hacker News

Open-source Something Often

If you write code for a living, when was the last time you released something as open source? If you can’t remember, I’d hold that against you in an interview. Why? Assuming you take pride in your work (if you don’t… well), open-source code is an incentive to:

  • Make sure the code is not horrendously embarrassing.
  • Verify that a random person can check it out and make it work.
  • Explain to others why (and maybe how) your code works.

You may be a star coder at some top-notch Silicon Valley darling, and you may work on code that has been working flawlessly for years. Whether you like it or not, the more obscure and undocumented your code, the more the company depends on you. Your peers may quickly review your code every so often (ship it! ship it good!), but that doesn’t mean anybody else knows what’s going on. Some companies like LinkedIn understand this, and make it known that your code could (and probably will) be open-sourced one of these days.

I’m writing this post because this weekend made me remember the three bullet points I mentioned, when I open-sourced my code to generate word clouds from Twitter searches.

I wrote this stuff as I was learning Clojure last year. I wanted to create word clouds such as the ones on Wordle. I discovered a great Processing-based library called WordCram, and I wrote some code around it to fetch and render text from a variety of sources: Twitter timelines, searches, the app.net global timeline, rss feeds, etc. Here’s an example of a word cloud for snowden early this morning:

Snowden

I thought it would be really easy to release this as open source, but it turned out to be a fair amount of work for what’s barely over 100 lines of Clojure. Here are some of the issues:

1) As I decided to release the code, I tested it against Twitter and it didn’t work. Why? Because Twitter had deprecated the search api supported by clojure-twitter, based on pagination. I had to replace it with twitter-api, and adapt my code to work with timelines.

2) I realized that my config.clj file had my private Twitter app credentials. I had to add it to .gitignore so I don’t commit it by mistake, and I had to do it before adding the repo to github because otherwise it would be part of the history. I had to create a dummy config.sample file, and explain in my README.md how to obtain/insert your credentials.

3) I still have to document my code and make it more idiomatic, but I don’t care that much about it because I’m not looking for a job as a coder. I’ll do it at some point, or maybe someone will tell me how much I suck in a pull request 🙂

4) I thought I was done, so I created the github repo. I did a fresh check out, and it worked. I tweeted about it, and I ask a friend to see if it worked for him. Of course it did not work. Why? This is where the fun begins.

I had created the project with leiningen 1.7, and I’d had to create a local maven repository because WordCram and other needed libraries are not in any public ones. I’d followed the instructions on this post from late 2011. I had since upgraded to leiningen 2.1.3, for which this recipe doesn’t work anymore. The problem is that the .m2 directory under my home already had all the dependencies it needed, so a fresh build did not need to fetch anything from the maven repo. Sure enough, I moved .m2 and then everything broke.

A search on Google took me here, which was the start of the solution. That almost worked, except for the fact that it created a slightly different repository structure than the one leiningen expected for my four artifacts. I had to recreate the maven repo with a script that gave a different groupId to each artifact, and then my friend sgrove was able to make the code compile and run.

I still wasn’t done. The first query he tried was somewhat obscure, and a call with 0 results broke my program (not enough arguments to the reduce function).

Moral of the story: I’ve been coding for 30 years, and look at all the stuff I had to learn in order to open source what’s essentially a puny script. Humbling and inspiring at the same time.

What have you open-sourced for me lately?

Another Silly Startup Analogy

When I was a teenager in Argentina in the 1980s, there was a weekly TV show called “Feliz Domingo” (Happy Sunday). It ran live for nine hours, between one and ten pm. It was a game show for high school students near graduation, who competed in events involving different skills. A typical event would have four or five individual students (or small teams), each representing a 30-40 person group from a graduating class. Students competed on national history trivia, blindfolded obstacle courses, performance art, etc. Some of the events were somewhat weird, it’s not easy to fill up nine hours worth of live TV (there were also live musical performances by all sorts of local bands such as Los Fabulosos Cadillacs, but I digress).

One of the coolest events was about memory and diction. You’d get a random trivia category such as Greek Philosophers, and you’d have to name as many as possible in ten seconds, no duplicates. I remember one girl who had memorized 25 items for each of 30 categories, and was capable of intelligibly reciting any 25 items alphabetically in 10 seconds. That talent earned her school one of 20 or so spots in the “final round.”

The final round went like this: there would be a locked “coffer” (El cofre de la felicidad) containing the grand prize: enough cash to send the contestant’s group (40 to 50 people) on a week-long trip to a ski town by the Andes. The host would put the key into a plastic cylinder and mix it with other similar-looking keys that wouldn’t open anything. There would be one key per contestant. Contestants lined up according to what event they’d won. The host would ask the next contestant in line his or her name, school, and number of people in the group. He/she would walk up to the cylinder and pick a random key. There would be five seconds of suspense while the boy or girl jiggled the key as it failed to open the lock. The process would repeat until one lucky contestant picked the right key, and… watch the video below, no subtitles needed.

I probably don’t have to spell out the analogy at this point, but I will anyway. You had to figure out how to participate in the show in order to have a chance to win. I’ll spare the details, but this wasn’t trivial (be in the right place at the right time). You needed skill in order to get a key (a chance of a good exit). And ultimately you needed dumb luck to pick the right key. If you didn’t get lucky… well, you might have one or more two shots before the end of your graduation year. Feliz Startup!

Hacker News discussion here.