This weekend I fetched 1M random user profiles from Twitter, just because. I figured it would be enough to answer some interesting questions about Twitter. Here’s what I did, along with some conclusions.
As you may know, each Twitter user is assigned a numeric id. These ids started at a very low number and are always increasing. The highest Twitter user id when I started the experiment was around 637M (found by trial and error). I figured there would be gaps in user ids mostly because of massive deletions of spammer accounts, and a quick sample estimated the gaps to be on the order of 20%. So I generated 1.25M unique user ids in the range 0-637M, and tried to fetch the profile details for them.
The Twitter API allows requesting 100 user profiles with one call, so that means I had to issue 12,500 calls. Twitter limits API requests from a given IP address to 150 per hour (in practice sometimes less). I had to use a few addresses to fetch all the data over the weekend (tip that came in handy: some mobile carriers refresh your IP every time you go in and out of airplane mode).
After fetching the 12,500 batches I was left with 1,039,556 Twitter profiles. This means that there must exist approximately 530 million Twitter accounts (*): 83% of 637M. Of course, this number doesn’t say a lot. Let’s look into these accounts in more detail.
Obligatory chart: signups over time (July of 2012 incomplete, could only fetch accounts created before July 18th for some reason).
I left out the Paleozoic era of Twitter (2006 and 2007) because it was visually insignificant compared to the Great Expansion of 2009.
The Tweets and the Tweet-nots
Approximately half of the accounts have tweeted at least once. The other half may be lurkers, or an example of what would happen if domain names were free: lots of parked ones. Still, the number of accounts that have never tweeted seems surprisingly high. Furthermore, 16% of all accounts (over 80 million) have no followers, no friends and no tweets. Hey Twitter, how about releasing some of those to the wild?
The average Twitter user has tweeted 307 times. That’s a total of 163 billion tweets since the dawn of Twitter [this space reserved for a snarky comment about that amount of collective wisdom].
It may be more meaningful to count only users who tweeted at least once. For those, the average is 520 tweets.
The distribution of followers per account is a power law (duh). The most followed account has tens of millions of followers, the median account has 1. The average user follows (or is followed by) 51 people. Of course this average is pretty meaningless, but it means that the Twitter “follow” graph has about 33 billion edges.
Followers and friends
For all accounts: median followers = 1, median friends = 5 (average is 51 for both).
For the 272M accounts that tweeted at least once: median followers = 4 and median friends = 15 (average is 85 for both).
For the 80M accounts that tweeted in the past month (these are what I’d call active users by the way): median followers = 31 and median friends = 72 (averages, 235 and 188).
In early Twitter times (i.e. 2007), the average user name was eight letters long. It increased to 9 in mid-2008, and to 10 in 2010. That’s the current average, even though for most months of 2012 new accounts had an average of 11.
Ok, enough numbers. What does this all mean?
To me, the most telling number is the people who actually tweet at least once a month. 80M is a respectable number, but it’s still a tiny fraction of the internet. Of course, the elites of the world are overrepresented on Twitter; it’s a free megaphone for them. They are also the prime audience for many advertisers, but obviously not for all.
Now, here’s an interesting off-the-cuff hypothesis: what if the ratio between existing and active accounts on Twitter were not very different from the one on Facebook? Facebook’s definition of an active user is quite generous: anyone who interacts with the site in any conceivable way. That would mean that even though there are close to a billion “active users” on Facebook, perhaps between 100M and 200M are people who actually spend time posting and consuming content on Facebook.com.
Ok, enough speculation. Let me know if there are any other numbers you’d like me to extract from the data, or if you see anything wrong with my methodology. Here’s my dataset if you’d like to run some experiments of your own [update 7/31/2012 12 pm: dataset removed per Twitter’s request]
On a final note: I apologize, but at this point I have to remind you to follow me on Twitter (well, technically Posterous is Twitter too).
(*) Edit: simonw pointed out that because of the Snowflake update, the 530M estimate could be off by a few percentage points. I believe it’s close to noise, but take the figure with a grain of salt (e.g. 500M to 550M). A more accurate experiment would require generating ids differently for the period after October of 2011 because there are much larger gaps in id numbers since then.