Twitter’s Snowflake Migration Problems

It is a general knowledge to most of us that Twitter is one of the most visited website and the most used web service. Millions of tweets are being sent per day coming from different users. In a technical point of view, Twitter’s numerical tweet IDs are incremented in sequence by millions each day resulting in unique IDs to nearly running out. According to Taylor Singletary:

Hi Developers,

It’s no secret that Twitter is growing exponentially. The tweets keep coming with ever increasing velocity, thanks in large part to your great applications.

Twitter has adapted to the increasing number of tweets in ways that have affected you in the past: We moved from 32 bit unsigned integers to 64-bit unsigned integers for status IDs some time ago. You all weathered that storm with ease. The tweetapoclypse was averted, and the tweets kept flowing.

Now we’re reaching the scalability limit of our current tweet ID generation scheme. Unlike the previous tweet ID migrations, the solution to the current issue is significantly different. However, in most cases the new approach we will take will not result in any noticeable differences to you the developer or your users.

We are planning to replace our current sequential tweet ID generation routine with a simple, more scalable solution. IDs will still be 64-bit unsigned integers. However, this new solution is no longer guaranteed to generate sequential IDs.  Instead IDs will be derived based on time: the most significant bits being sourced from a timestamp and the least significant bits will be effectively random.

Please don’t depend on the exact format of the ID. As our infrastructure needs evolve, we might need to tweak the generation algorithm again.

If you’ve been trying to divine meaning from status IDs aside from their role as a primary key, you won’t be able to anymore. Likewise for usage of IDs in mathematical operations — for instance, subtracting two status IDs to determine the number of tweets in between will no longer be possible.

For the majority of applications we think this scheme switch will be a non-event. Before implementing these changes, we’d like to know if your applications currently depend on the sequential nature of IDs. Do you depend on the density of the tweet sequence being constant?  Are you trying to analyze the IDs as anything other than opaque, ordered identifiers? Aside for guaranteed sequential tweet ID ordering, what APIs can we provide you to accomplish your goals?

Taylor Singletary
Developer Advocate, Twitter
http://twitter.com/episod

In order to solve this problem they opted to use a unique system of generating IDs without it being sequential. The plan is to use a 64-bit unsigned integer by using timestamps instead of sequence numbers; the internal code to do this is referred to as Snowflake.

A Snowflake ID is a 64-bit unsigned integer that is composed of:

  • 41bits for millisecond precision time (69 years)
  • 10bits for a configured machine identity (1024 machines)
  • 12bits for a sequence number (4096 per machine)

Everything seems fine and planned for a smooth unnoticeable transition. But the major problem is Javascript and some other languages.  This is due to the fact that some of these languages cannot completely support 64-bit numbers. Javascript for instance, can only support around 53 bits of number. Other old languages can probably support less. Considering that the Twitter API is served primarily in JSON (Javascript Object Notation) (take note of the word Javascript!), it must be completely compatible with Javascript.

The temporary solution is to add redundant data to the JSON object that the API is generating. Instead of generating only the numerical ID of a tweet, it will also generate along side it a string representation of the ID.

{“id”: 10765432100123456789, “id_str”: “10765432100123456789”}

Same data with different representation seems redundant and not advisable in most conditions. Considering the increase in the amount of data that will be transferred to the clients, especially those who are using the API on mobile and low-bandwidth devices. It wouldn’t be that much of a problem if only the ID will have a duplicate representation, the following will also have the string representations:

  • id (DM, Saved Search, User, List )
  • in_reply_to_status_id
  • in_reply_to_user_id
  • new_id (Streaming only. Will be removed when Snowflake is enabled)
  • current_user_retweet_id (When include_my_retweet=1 is passed)

For a meantime, this changes will be inevitable. The timeline of the changes are as follows:

  • 22nd October 2010 (Friday): String versions of ID numbers will start appearing in the API responses
  • 4th November 2010 (Thursday) : Snowflake will be turned on but at ~41bit length
  • 26th November 2010 (Friday) : Status IDs will break 53bits in length and cease being usable as Integers in Javascript based languages

Now I wonder what will happen to some applications that are affected by the migration?

References:

The Traditional “Welcome to my Site” Post

I have nothing to do anyway so why don’t we get on with the customary tradition of making the first post as the “Welcome to my Site” post.  I’m not so sure how this so called tradition came about, but I’m guessing it started during the internet boom in the 90’s when everybody who has an internet connection back then wanted to create their own website. I’m sure most of you remember how the old websites look back then.  Just think about Geocities, Tripod and Angelfire. These were the early website hosting providers capable of hosting a few MBs of disk space. I don’t know if there was a consensus on designing webpages on this era, but the colorful repeating backgrounds with animated GIFs scattered around the whole page are common features of a homepage. Big neon-colored headers that almost fill the 800×600 resolution of monitors back then. And don’t forget the classic “Sign my Guestbook” links.  Leaderboard banners of advertisements loading on the top of the page were the trademark of free web hosting.  Static HTML pages were the only way to go, no free server-side scripting was available. And of course, web browsing wouldn’t be complete without Microsoft Internet Explorer and Netscape Navigator.

Come to think of it, every time I visit a personal homepage, the main page would almost always include a variation of the phrase “Welcome to my homepage!”  When I start to ponder, most of the people who make “homepages” back then just make them for the sake of it.  They just create their personal web space just because it was the fad.  Even if they don’t have anything to share the world, they still dedicate a page to show something about themselves.

But after nearly two decades, much has changed.  Blogs and social networking sites allows us to do the same purpose but without the eye-murdering web design.  New concepts has evolved due to the necessity that the web brings us.  Although a lot has been constantly changing, the “Welcome to my Site” pages still won’t fade that easily especially to those, like me, who are literary-challenged and have nothing to special to share the world.