Random Partition of a Sequence

Here’s a mildly interesting puzzle: suppose you have an ordered sequence (e.g. the numbers 1 to 10). How would you randomly divide it into partitions such that any possible partition is equally likely? Your partitions can have anywhere between one and ten elements. One example partition could be [1 2 3 4] [5] [6 7] [8] [9 10].

Before I give you the solution, here are a couple of hints:

  • There are four possible partitions of three elements:
    [1] [2] [3]
    [1 2] [3]
    [1] [2 3]
    [1 2 3]
  • If you were to run your function a large number of times, you would expect each one of those to occur close to 25% of the time.

In Clojure, you could test your random partition function like this:

(frequencies
 (take 100000
       (repeatedly #(random-partition [1 2 3]))))

And if your function worked properly, you’d expect your output to look like:

{((1 2 3)) 24726, ((1) (2) (3)) 25049, 
((1) (2 3)) 25077, ((1 2) (3)) 25148}

A good way to look at this problem is to first calculate the number of possible partitions. If we think of the sequence as a piece of tape that we can cut, we realize that for three elements there are two places to cut. More generally, for N elements there are N-1 cutting points. At every point we can choose to cut or not to cut, which means that there are 2^(N-1) possible partitions.

With this in mind a straightforward solution to the problem is:

  • traverse the sequence from left to right
  • toss a coin at every cutting point
  • leave it alone if it’s heads, cut if it’s tails

Note that you don’t even need to know the length of the sequence in advance. In fact, you could start cutting an infinite sequence and grab the first few random pieces (i.e. stop after a certain number of tails).

Here’s my Clojure implementation of this algorithm:

(defn random-partition [xs]
  (lazy-seq
   (when (seq xs)
     (loop [i 1]
       (if (zero? (rand-int 2))
         (recur (inc i))
         (let [[a b] (split-at i xs)]
           (cons a (random-partition b))))))))

And a much more concise one based on partition-by, courtesy of Gary Fredericks via the #clojure irc channel on Freenode:

(defn random-partition [xs]
  (partition-by (fn[_] (rand-nth [true false])) xs))

You can take 10 random pieces of the natural numbers with:

(take 10 (random-partition (range)))

and expect a result like:

((0) (1) (2) (3) (4 5 6 7) (8 9 10) (11 12) (13 14 15) (16 17) (18 19))

Please do not use this puzzle as an interview question :)

Life, the Universe and Technical Interviews

“There is a small satellite orbiting an unremarkable planet 50 light years from Earth. In spite of its tiny size, this floating cube of metal and silicon carries a load of paramount importance to the infrastructure of our galaxy. Inside its computer resides the Big Developer Matrix. As you can infer from its name it is fairly large, and it contains information about developers. More specifically software developers, because it was created by software developers. Each row of this matrix represents a company, and each column a developer. Every cell contains a binary value, which means either HIRE or DON’T HIRE.

The Big Matrix is rather sparse, and it is widely used by software managers across the Milky Way. In fact, it is the standard tool for hiring decisions pretty much everywhere. There are some fringe planets like Earth who rarely query the Big Matrix, and nobody is quite sure why. One of the maintainers of the software suspects it may have to do with a mismatch between the latency of communications and the average life spans of humans.”

— The author of this post, just now.

Unfortunately for us humans, we have little choice but to conduct technical interviews. We need an answer to the question of whether we want to hire someone or not, and we don’t have the time to really assess someone level of technical competence. Many try to approximate the Big Matrix with a series of progressive Bloom filters: techniques that tell you DON’T HIRE this person or INCONCLUSIVE. FizzBuzz is a good example of a Bloom filter question. I’ve never asked it to anyone and I don’t think I will; if you have to ask that question, you probably should rethink your pre-interview selection process.

One of the things I dislike about FizzBuzz is that it reminds me of customs forms with questions such as “have you ever been a member of the Communist party?” It’s almost insulting, and there is no way to show that you’re a good developer by coding FizzBuzz. There are questions that serve a similar purpose, while at the same time giving the candidate an opportunity to show off some skills. Here’s an example of a question that I like, even though I’ve never asked it:

Write some code that calculates how many numbers under a million (positive integers) have digits that add up to 42.

Why do I like this question? For one, it doesn’t require any specific knowledge besides basic programming techniques and math. Anybody who spent some time programming in some language should know enough to solve that in a few lines. Also, it has a mundane element that you normally encounter in real-life programming: one straightforward way of answering the question involves converting integers to strings and back. Furthermore, it lends itself to elegant functional solutions. Finally, getting the reference is an added bonus for personality fit :)

I asked this question on Twitter a couple of nights ago just for fun, and got some quick responses. The first one, in Haskell:

another one, in Python:

Clojure:

Ruby:

One of my favorites, in C and bash:

One I did in Clojure, without using strings and using recursion:

(count (filter #(= 42 %)
               (map (fn d [n]
                      (if (> 10 n)
                        n
                        (+ (mod n 10) (d (quot n 10)))))
                     (range 1e6))))

One in Forth by @technomancy:

s dup if 10 /mod recurse + then ; : f 0 1000000 0 do i s 42 = if 1 + then loop ;

I couldn’t fit a Java version into a tweet, but someone did it with Java 8:

class S{public static void main(String a[])
{int i=0,j=0;for(;i++<1e6;j+=(i+"").
chars().map(x->x-48).sum()==42?1:0);System.out.print(j);}}

@ejenk reduced the Haskell version to:

length . filter (== 42) . fmap sum . replicateM 6 $ [0..9]

There were a few other good solutions. The shortest were in languages like sed or Wolphram, which allowed the authors to “cheat” somewhat.

Conclusion

I will never get to ask the 42 question in an interview. If I did, I wouldn’t expect a candidate to show off code-colfing skills. I might ask about performance considerations for a larger space, and be pleasantly surprised if the candidate pondered solving the problem  mathematically for a second.

While this particular question may seem very easy in the comfort of your browser and favorite editor, it’s not trivial in the context of an interview. Always remember that technical interviews are very stressful; some of the best candidates I interviewed had sweaty palms upon entering the room. Even the best of us sometimes forget to bring a towel.

Job interviews go both ways (My story with Netscape in ’98)

Note: this is a post from 2011. It used to be on the IndexTank blog which no longer exists, so I’m putting it up here. 

When I interview a candidate here at IndexTank, I take it very seriously and try to be at the top of my game. I have to prove to this person that it’s worth choosing our company over others. Desirable candidates have many choices, so I’m competing with every hot tech company in the Bay Area.

Candidates often forget this fact. I expect them to bombard me with questions about where we are going, what I’m thinking, what it’s like to work hereMaybe even ask me a hard technical question, it’s only fair. I plan for ample interview time for this reason.

It turns out, the majority of the people don’t do this. They simply put on their best face, answer every question and try to sell themselves. Some ask questions everyone should ask, such as the financial situation of the company. Very rarely someone probes me to see if they would like to have me as their boss.

In a previous post I mentioned an interview I had at Netscape in 1998. I can freely talk about that one; Netscape has been effectively dead so long that younger generations may think it was a brand of milk chocolate.

It all started when a got a call from Netscape through a friend who had joined a few months before and recommended me. That was enough for them to fly me over to the Bay Area for an interview a week later. Arrived from Pittsburgh late at night, drove from SFO to a generic hotel in Mountain View where I spent the evening reading about how Netscape was doing. They had just announced that they would be releasing their source code to the public, which was very intriguing to me.

I remember arriving at Netscape, a gigantic campus that used to belong to HP. Maybe they had four thousand employees at the time. I was told that I was interviewing to replace employee #7 (everyone talked about their own employee number) who was leaving because he was fully vested that month (red flag).

At first glance I was impressed with the place. People with purple hair roller-skating around the office with their dogs. It was my first encounter with the “creative minds of Silicon Valley” in their natural environment.

During lunch came another red flag. One manager talked about how they were a bit bummed because Yahoo had just surpassed their market valuation. Both companies were public at the time, and it was right when the Nasdaq was starting to take off like a rocket (which turned out to be the Challenger if you pardon the bad taste of my pun).

What really sealed it for me where the technical interviews. The first guy who interviewed me was a C guru. He spent most of the interview talking about himself and asked me one trivia question: why is EOF in C defined as -1 (btw, this is platform dependent). Look it up if you’re interested. This was a prelude to explain a hideous bug he’d been fighting based on that behavior and how proud he was about it.

Another guy asked me a few “aha” questions which I got (I was really into puzzles and brainteasers at the time). They were unrelated to programming for the most part. The only hard programming question was how to traverse a binary tree with two pointers and without using a stack. Good luck figuring that one out during an interview.

If I ask you that question during an interview and you answer it correctly, there are three possibilities:

  • you’re super smart
  • you knew the answer beforehand and are a good actor
  • you lucked out into the solution and are reasonably smart

Not a lot of information. But most importantly, it says little about whether you’re capable of fixing horrendous bugs in the Mac version of Netscape Navigator (release 4, now with extra bloat!).

I had been a big fan of Netscape until then. I left the building that evening with mixed feelings. Nobody seemed excited about open-sourcing the code, the original people were leaving, the coder gurus projected strange attitudes to a lowly candidate. I didn’t want to work there. It turned out I never had to make a decision because they didn’t call me back. A few months later they were acquired by AOL, who proceeded to gut the company and turn netscape.com into yet-another-portal.

I still thank them for flying me over from Pittsburgh, rental car, hotel and meals paid. That’s how I came to the Bay Area for the first time and thirteen years later I’m still here.

If you liked this post, follow me on Twitter. If you didn’t, thanks for reading this far :)

Why I Built cointipping.com

Problem: owning cryptocurrency is hard. That’s right, just owning it. Most people have no idea where to start. Want to buy bitcoins? Sure, you can sign up for any of the major exchanges. You just need a bank account, a credit card, a phone number. Fill up these forms with black ink and go to the next window. Most people on this planet won’t even bother unless they have a compelling reason. If you own cryptocurrency, have you convinced any friends or family members to even take your crypto-money as a gift? At the very least you’d have to create a “paper wallet” for them, or help them install geeky software that’s easy to screw up (Bitcoin Qt, I’m looking at you).

The idea for cointipping.com came to me a few weeks ago, after having dinner with my friend Sean Grove. He has a habit of insisting on paying for our dinners, and likes to refuse my cash. After sharing some delicious Chinese, I told him I’d send him some Dogecoins for my share of the check. He could refuse my dollar bills, but he’d have to take my crypto-coins.

When I got home, I told Sean to get a Cryptsy account and tried to do an internal transfer of 35k dogecoins. I gave him my referral link, he opened an account, I asked him for his internal trade key (33c21fd675a31442d263dcd923ce974ba71425f1). I made the transfer, and it took about three days for the coins to show up on his Cryptsy account. We never knew exactly why.

I didn’t want to create work for Sean, so the alternative would have been to print out a paper wallet with 35k dogecoins and give it to him. It would look like this:

Dogecoin paper wallet

What would you do if I gave you a piece of paper with that on it? Probably file it in a drawer and forget about it.  Enough is enough, I thought. In the famous words of Samuel L. Jackson, I’d had it with the monkey-fighting bureaucracy in this Monday-to-Friday blockchain. Why couldn’t it be as easy as getting this in front of Sean’s face?

tip-example

Let’s reduce the barrier to entry, shall we? Nobody needs to download software or create a wallet to take your cash. You’ll grab my $20 bill even if you don’t have your wallet with you, or if you can’t look up your bank account number.

So about three weeks ago I went into a coding frenzy and came back to the surface with cointipping.com in its current form. Sure, it’s an MVP. It reeks of Bootstrap, and it lacks some basic usability features (thanks Sean for all your help with the UI, the technical details would make for an interesting post). I try to be security-conscious, and I’m backing up everything hourly. Still, there is no guarantee that the site won’t go up in flames, losing $25 worth of dogecoins in the process (that’s the current amount in the online wallet). Forgive me, I’m just one rusty old hacker.

Still, it serves one purpose: it does make it really easy to put dogecoins in the hands of people who know absolutely nothing about cyptocurrency. Sean can leave his coins on the site and forget about them until it’s worth his time to figure out what to do with them. He could withdraw them whenever, of course. He can always tip other people through the site, some of whom might withdraw them straight to their wallets. Easy breezy beautiful!

Before all my readers rush to deposit their life savings into cointipping.com, let me warn you:

This is a flimsy little site that I developed as a side project. It may get hacked. It may disappear. I respond for all the funds deposited, but I may get hit by a bus. Think of this as some loose change you may carry in your pocket to drop into a tip jar at a coffee shop. Maybe one day it will be a bank, but most likely it won’t. Don’t risk any money that you wouldn’t mind forgetting in the pocket of a shirt that you just took to the laundromat.

With all that out of the way, happy tipping. To the moon, fellow shibes!

Discuss on Hacker News

Do You Know Enough to Securely Own Bitcoins?

[I wrote this draft a week ago and forgot about it, but in the light of the Mt. Gox news I decided to post it because someone might find it useful]

Over the past few months I’ve noticed a significant number of forum posts by people losing crypto coins due to their own mistakes. This happens because many do not understand the basic concept of Bitcoin/Altcoin ownership, and the implications when it comes to protecting coins. I’ll try to explain here.

Suppose you roll a die one hundred times and write down the results:  3, 5, 2, 6, 2, 5, 6, 5, … (90 more numbers) … 4, 2. What you have now is a very unique list of numbers. You could repeat the operation billions of times every second for trillions of years and it is extremely unlikely that you would see the same sequence again. If you hadn’t written it down, it would be lost forever. Nobody would be able to reproduce it.

The number of possible sequences of 100 die rolls is about the same order of magnitude as the number of Bitcoin keys. In fact, the procedure I just described is a secure way to generate a key and its corresponding address [there are fewer addresses than keys, so you could generate a secure key with fewer rolls; let’s ignore that for now]

Here are two basic facts that you must know:

1) If you generate a Bitcoin address securely (as described above, for example), deposit money into it, and forget the key, the money is gone. Poof. Nobody can get it back.

2) If you store that key on your computer, anybody who can access your computer can steal all your coins.

The first fact should be obvious by now. Unless there is a problem with the Bitcoin protocol (in which case Bitcoin would become worthless anyway), there is no way to obtain the key for an address. The second fact is obvious too, but most people don’t realize its implications.

Until a few years ago, professional virus writers could not derive a huge amount of value from controlling a remote computer. They could create a botnet, launch distributed denial-of-service attacks, generate fake ad clicks. “Owning” a random computer wasn’t particularly valuable. That was then.

Right now there are probably thousands of computers on the internet that contain the keys to a large amount of crypto wealth. There are a few well-known formats for storing Bitcoin private keys. It is trivial for a virus writer to include a few lines of code that search for those just in case. You might think the solution is to keep your wallet encrypted, and you would be wrong. If malware is running on your computer, it could easily log your keystrokes when you type the password to your wallet. This happens all the time.

Knowing all this, many people opt to keep their bitcoins in an exchange. However, exchanges are not banks. Your money is not protected by governments. Over the past few weeks Mt. Gox (once the largest Bitcoin exchange) has been in the news because there is a significant change that they are insolvent. If they go out of business, chances are their customers will lose all their funds.

What is the solution, then?

First of all, you should not generate a Bitcoin key / address pair on an online computer. The easiest way to do this is to use a trusted, open source wallet on an offline computer, make a backup of the keys and securely erase the hard drive (or never connect tot he internet again). Perhaps the easiest way to do it is to boot a volatile Ubuntu from a USB drive (and ignore any requests to connect to a network).

Now that you know that your keys are not accessible via the internet, the question is how to store them. This is a more difficult problem because there is a conflict: on one hand you don’t want to have your keys in just one place. It’s best to have a few copies in different locations. On the other hand the more copies you have, the higher the chances that someone might find one. So far the best solution to this problem is BIP38 encryption (warning: highly technical document). The gist: instead of storing copies of the keys in a usable format, you keep them encrypted. If your passphrase is strong enough (e.g. at least four English words truly picked at random), then nobody should be able to decrypt and use your key.

There is still one more problem left: what if you lose or forget the passphrase? What if something happens to you? It does make sense to write down the passphrase, and store it separately from the keys. Perhaps you can give the passphrase to a trusted family member, and leave the location of the keys in your will. At this point there won’t be one solution that works for everyone. Once again, the key point to remember is that if any piece of information needed to restore a Bitcoin key is lost, the bitcoins are lost.

Complicated, isn’t it? My conclusion is that cryptocurrencies have a way to go before they can be as user-friendly as established payment technologies. In the meantime, don’t let this happen to you:

Have You Received Bitcoin Spam?

Over the past few days, someone has been sending tiny amounts of bitcoin (one satoshi) to thousands of addresses. Those transactions are unlikely to be confirmed, but they do show up on blockchain.info for now. Here’s one example:

bitcoinspam

The satoshis come from two vanity addresses: 1Enjoy… and 1Sochi…

If you click on the first address as I write this, you will see that its most recent transaction contains a link with the text “Play and win BTC”

bitcoinspam2

The link leads to what looks like a bitcoin gambling site, bitwars dot org. I’m not linking to it for obvious reasons.

Some googling shows it has been done before. Here’s a discussion on Stackexchange. If you have received these tiny amounts and wondered what they are about, now you know. You can safely ignore them. Just like life, spam will always find a way.

Dogecoin and the Appeal of Small Numbers

Dogecoin is a unique phenomenon in the fascinating world of cryptocurrencies. It’s barely six weeks old, and as I write this post its network has more computing power than any other cryptocurrency except for Bitcoin. It made headlines this weekend when its community raised enough money to send the Jamaican bobsled team to the Sochi Winter Olympics.

From a technical standpoint, Dogecoin is essentially a branded clone of Litecoin (the second cryptocurrency in terms of total market value). Without a doubt one of the most important factors contributing to Dogecoin’s popularity is its community. The Dogecoin subreddit has almost 40k users right now. The front page usually has a good mix of humor, good will, finance, and technology. Check it out if you haven’t already.

There’s another more subtle factor that I believe plays in Dogecoin’s favor: its tiny value. One DOGE is worth about $0.0015 right now. In other words, one dollar buys you about 600-700 DOGE. Contrast that with Bitcoin: $1 is about 0.001 BTC. This puts Bitcoin and Dogecoin in two completely different mental buckets for most people. One BTC is comparable to an ounce of gold. The press reinforces this idea, and many people view Bitcoin as a digital store of value. The daily transaction volume of BTC is about 0.2 percent of the total bitcoins in existence, which means that BTC does not circulate very much yet.

Contrast this with Dogecoin, for which the daily transaction volume is close to 15%. Where does that money go? Perhaps the most common usage of DOGE is to give online tips. Compare the activity of Reddit’s bitcointip and dogetipbot, and you’ll see the latter is much more active. What would you prefer as a tip, 100 DOGE or 0.000002 BTC? Both are almost meaningless in terms of monetary value, but receiving 100 units of a coin does feel better. It’s also easier to give tips; you don’t have to think much about tipping someone 10, 25 or 100 DOGE. With BTC you either have to choose a dollar amount, or be very careful with the number of zeroes.

The reason a DOGE is worth so little is the total supply of coins. The Bitcoin software has an embedded constant called MAX_MONEY. For Bitcoin it’s set to 21 million, which means that if Bitcoin takes over as a world currency it will be impossible for most people to ever own one. Litecoin is only slightly better, at 84 million. For DOGE, it’s one hundred billion (perhaps more, yet to be decided). This makes it unlikely that one DOGE will be worth $1 any time soon (or ever). It’s easy and fun to exchange $20 for 10k DOGE and give a fraction of them to strangers on the internet. Anyone can still mine hundreds of dogecoins per day with a desktop computer, and not feel very attached to them. Being a “slumdoge millionaire” is still affordable to many.

In a world where people get a kick out of likes or retweets, Dogetips take it up a notch. A Dogetip is an upvote that you can use, internet karma points that are actually worth something. So fun, very value.

Image credit: /u/binxalot, this person deserves tips. Of course I accept them too :)

DHpZsQCDKq9WbqyqfetMcGq87pFZfkwLBh

The Harsh, Logical World of Cryptocurrencies

There’s a subtle point that escapes most people buying bitcoin these days: Bitcoins are not exactly something you own. Rather, they are something you know.

Let’s say you you buy a bitcoin, and you transfer it to an address. The world now knows that there is one BTC there. This is public information. What the world doesn’t know is the key to use that money. Only the person who knows that key (you and only you, hopefully) can spend it. If you lose that key, nobody will ever be able to spend that bitcoin, and it will stay locked forever.

How is this possible? Why can’t we recover the key somehow? Because it’s like finding an infinitesimally small needle in an absurdly gigantic haystack. There is a very large number of possible bitcoin keys, comparable to the number of atoms in the universe (it’s a 77-digit number). Your bitcoin address is derived from that key, but not in a way that can be reversed. A very rough analogy would be: if you know John, you know his height is 5’10”. If I tell you that someone’s height is 5’10”, you have no idea who that person is.

This analogy doesn’t take us very far; if being 5’10” were the only requirement to spend John’s money, it would be easy to find someone with that height. Now imagine we are traveling around the universe and measuring the diameters of atoms. One day, and for no particular reason, they all expand into random sizes. Some remain tiny, others are are as large as a cow, a planet or a galaxy. Your job is to find an atom whose diameter is exactly 176,891,292,523,293.23412 miles. Good luck.

When you install a bitcoin wallet on your computer it generates a unique, virtually unrepeatable key for you. This key is 256 random bits, and you can derive your address from it. The second you transfer significant money to that address, you have a problem. You need to make sure that:

a) You never lose your key.

b) Nobody else can ever know your key.

You can see how this is difficult. If you wrote down your key, someone could find it and spend your money. If that happens, you wouldn’t know until you checked the blockchain transactions involving that address. Same thing if you backed it up. What if someone finds your backup drive? What if you upload it to Dropbox, and a disgruntled employee reads your files? On the other hand, you cannot afford to not back up your key. You wouldn’t want to be like the guy whose hard drive containing millions of dollars worth of BTC is buried somewhere in a dumpster.

So how do we solve this problem? Suppose you want to store a life-changing amount of bitcoin somewhere (why you’d want to do that is an interesting question). The best solution I’ve found so far involves two-factor authentication. You can use an algorithm like BIP38 to generate a passphrase-encrypted key, which you can then print and or / store somewhere offline. You may want to have a few paper / digital copies of this key in different locations (why not, it’s cheap). Still, there are issues with this approach:

– You have to know what you’re doing. For example, you have no reason to be online to generate a bitcoin key/address. Technically you don’t even need a computer; you could generate a bitcoin key by rolling a die 100 times, writing down the numbers (3, 6, 3, 2, 1 …) and performing some mathematical transformations with pen and paper. That would be a bit too paranoid, though. It’s easier (and less error prone) to use a live read-only Linux distribution on a computer that’s never been connected to the internet. You’d need to install a trusted, open-source program to generate a BIP38-encrypted paper wallet (you’d copy it from a portable drive, of course). There are several implementations out there. This is my favorite, although it requires a graphical interface. I’ve also written my own command-line paper wallet generator and BIP38 encrypt/decrypt utility (although I urge you to not trust my hasty implementations very much).

– If you use BIP38, now you have two things you can’t lose: your encrypted key and your password. Losing either now means that your money is gone forever. And if you can remember your password easily, it’s probably not good enough. Relevant XKCD:

Even though the BIP38 algorithm is pretty hard to brute-force, a determined adversary who possessed your encrypted key could try billions of combinations in a relatively small amount of time. Therefore, choosing a password for an encrypted key that controls millions of dollars worth of bitcoin is not a trivial matter.

Are there any better alternatives? Probably not. Some people are partial to the idea of a Brain Wallet, which works as follows: you pick a very complicated passphrase that you can remember, and use a mathematical function to convert it into a 256-bit bitcoin key. Therefore, the “wallet” exists only in your brain. The problem with this approach is that you’re losing significant entropy by picking something you can remember, therefore making the job of an attacker easier (see this Reddit thread about a guy who used an “obscure poem in Afrikaans” as a passphrase).

Could we go the opposite way perhaps? Could we generate a random key and then come up with a mnemonic that encodes 256 bits of information in a way that most people can remember? I gave the problem a little bit of thought this morning and came up with a few ideas, none of which convince me fully:

– Memorize your key as a decimal number. It’s “only” 77 digits, and you could use the techniques described in Moonwalking with Einstein. I’m sure I could remember 77 digits for an hour or two. Next year? Forget it :)

– Use the 256 bits to generate a fictional character. For example, the first bit could be man / woman. The next six bits could be age (say 18 to 82). Favorite color, height, weight, country of origin, occupation, etc. I made a reasonably long list of attributes that I thought I might remember about someone I really cared about, and it was hard to get past 100 bits.

– Generate meaningful text, perhaps a poem or a story, using arbitrary rules, e.g.

Random word from [ A/The/My/Your/One/His/Her/Our ] -> 3 bits

[Color name picked from a list of 16] -> 4 bits

[Nationality from 128 countries] -> 7 bits

[Type of animal from the most memorable ones, maybe 6 bits]

For example, the first 20 bits might encode into “your blue Zimbabwean otter,” and  that would be the start of a story generated by other rules that consumed the remaining 236 bits. It would be mandatory to lay out all your choices beforehand, and strictly adhere to what the 256 bits of your random key had dictated. The structure could yield something more memorable than the on-the-fly narratives “memory athletes” use for short-term recall. Even though it’s clearly possible to do this, I wouldn’t trust my memory to precisely remember such a story for years. It’s still a fun thought experiment.

If there is a point to this post, it’s that using bitcoin to safely store large amounts of money long-term is still impractical for most people. There’s a reason bitcoin wallets are not called vaults. Unless you really know what you’re doing, don’t use them to store more money than you would carry in your pocket.

By the way, the key to the address linked in the second paragraph of this post is the number 1. You can see that people send tiny amounts of money to it frequently, perhaps to test out new software. Want to catch a digital dime once in a while? :)

If you found this post useful, tips are welcome.

1EmwBbfgH7BPMoCpcFzyzgAN9Ya7jm8L1Z

Hacker News discussion of this post.

Markov Chains in Clojure, Part 2 – Scaling Up

Last month I posted a very simple Markov Chain generator that takes one paragraph of text and produces gibberish. The one issue is that it doesn’t scale, for two reasons:

1) It reads all the text into memory at once. That’s fine if you’re processing a few paragraphs, but for large amounts of text it’s better to process one line at a time.

2) The word lists contain repeated words. These lists would take much less space if they had word counts instead. For example:

{"He" ("is" "is"), :start ("He" "She" "He" "She")}

Could be

{"He" {"is" 2}, :start {"He" 2, "She" 2)}

Obviously it’s not much of a gain for this simple example. If we were processing book though, we might have hundreds of sentences that started with He or She. The second structure would be a couple orders of magnitude smaller.

Let’s start with the second problem. Here’s an elegant code snippet to traverse the list of words and build the structure with counts from a Stack Overflow discussion:

Notice the usefulness of fnil, it makes the code more concise. More importantly, the above is a lazy function; if we read the file lazily then we can process the input stream without reading it all into memory first. How would you do it?

Below is a snippet of code based on another Stack Overflow question. By the way, isn’t Stack Overflow awesome? There was a time when we had to code without it, or without internet access. Those were the bad old times :)

Notice that I sandwiched the words in a sentence between the :start and :end markers because I want to know what words are good starting and ending points for generated sentences. The transform function sees a long stream of words that could be a single sentence without these markers. As anecdotal evidence that this approach is more memory-efficient, running the JVM on my Macbook against a text file with 50k English words takes up 20Mb less than the original program.

Now that we have a suitable structure, how do we pick words at random but with a probability that’s proportional to the number of times it occurs after a given word?  For example, if I have {“He” {“is” 15, “was” 5}} I want to pick “is” 3/4 of the time. This was an interview question that we asked frequently at Inktomi in the late 90s. The simplest solution is to pick a random number in the range of the number of individual instances of words, and pick the index of the word corresponding to the slice it would take in the total.

In the above example, we have two slices:

0 -14 -> "is"

15 - 19 -> "was"

so choosing a random number between 0 and 19 and then checking what slice it belongs to would yield “is” 75% of the time and “was” the remaining 25%. Here’s a nice implementation from Rick Hickey:

And that’s pretty much it. The entire code is here. I ran it against this collection of Hacker News headlines from this post and generated a few of my own:

  • “Microsoft’s Decline of Employee #1, Leaving Github”
  • “Y Combinator’s First iOS 6 Countdown”
  • “Congress Is the HipHop VM”
  • “Welcome to Get From Russia in Javascript charts and Numpy”

Where do we go from here? We could change the code to pick only sentences of a certain length, or to make chains of n-grams (instead of single words) to create more plausible text. Have fun!

Fun with Markov Chains and Clojure

Markov chains are a simple concept. Start with a state (e.g. a word, a location on a map, a musical note) and move to another one at random. Repeat until you have a chain of states that you can use for something. The key is that you don’t choose randomly from all possible states, only from those that have some probability of following the current state.

For example, suppose you know that a sentence starts with the word “She.” It’s reasonable to bet on the next word being “is” or “was.” It’s possible (but less likely) to see less common words such as “asserted” or “obliterated.” Most likely your text doesn’t contain instances of phrases like “She chemistry” or “She absolutism.” If that’s the case, those words have zero probability of being chosen.

Let’s take the following text as an example:

He is a boy. She is a girl. He is young. She is younger.

Here is a list of all the transitions we see in the sentences from this text:

{"She" ("is" "is"), "a" ("boy." "girl."), 
"is" ("a" "a" "young." "younger."),
"He" ("is" "is"), "*START*" ("He" "She" "He" "She")}

Note that the word “is” has four words that it could transition to. Two of them are “a” so if we start generating random sentences, “is” will be followed by “a” half the time. Let’s choose a word that follows the *START* state (nothing) and continue until we find a word with a period. Here are some sentences we could see:

"He is younger." "She is a girl." "She is a boy." "He is a boy."

Ok, time to automate this with some Clojure code (or skip to the “generate gibberish” button below if you’re not a programmer):

The first function generates a map of transitions like the one shown above. The second one chooses words from the map until it finds a word that ends in a period. It’s easy to compile this into JavaScript via ClojureScript. I’ve done that, check out the full code on Github. You can test it below: click the “generate gibberish” button and check the output. The input box contains Blowin’ in the Hamlet by Bob Shakespeare. Try replacing that with a few paragraphs of your own.


Output:

——-

Side note: the code above is the simplest algorithm I came up with for generating a Markov chain out of a few paragraphs, but it doesn’t scale for large amounts of text. If you try running it against a book, it will run out of memory. In a follow-up post I’ll show how to put together a solution that scales better. It will be more about Clojure than about Markov Chains.