It is a general knowledge to most of us that Twitter is one of the most visited website and the most used web service. Millions of tweets are being sent per day coming from different users. In a technical point of view, Twitter’s numerical tweet IDs are incremented in sequence by millions each day resulting in unique IDs to nearly running out. According to Taylor Singletary:
Hi Developers,
It’s no secret that Twitter is growing exponentially. The tweets keep coming with ever increasing velocity, thanks in large part to your great applications.
Twitter has adapted to the increasing number of tweets in ways that have affected you in the past: We moved from 32 bit unsigned integers to 64-bit unsigned integers for status IDs some time ago. You all weathered that storm with ease. The tweetapoclypse was averted, and the tweets kept flowing.
Now we’re reaching the scalability limit of our current tweet ID generation scheme. Unlike the previous tweet ID migrations, the solution to the current issue is significantly different. However, in most cases the new approach we will take will not result in any noticeable differences to you the developer or your users.
We are planning to replace our current sequential tweet ID generation routine with a simple, more scalable solution. IDs will still be 64-bit unsigned integers. However, this new solution is no longer guaranteed to generate sequential IDs. Instead IDs will be derived based on time: the most significant bits being sourced from a timestamp and the least significant bits will be effectively random.
Please don’t depend on the exact format of the ID. As our infrastructure needs evolve, we might need to tweak the generation algorithm again.
If you’ve been trying to divine meaning from status IDs aside from their role as a primary key, you won’t be able to anymore. Likewise for usage of IDs in mathematical operations — for instance, subtracting two status IDs to determine the number of tweets in between will no longer be possible.
For the majority of applications we think this scheme switch will be a non-event. Before implementing these changes, we’d like to know if your applications currently depend on the sequential nature of IDs. Do you depend on the density of the tweet sequence being constant? Are you trying to analyze the IDs as anything other than opaque, ordered identifiers? Aside for guaranteed sequential tweet ID ordering, what APIs can we provide you to accomplish your goals?
Taylor Singletary
Developer Advocate, Twitter
http://twitter.com/episod
In order to solve this problem they opted to use a unique system of generating IDs without it being sequential. The plan is to use a 64-bit unsigned integer by using timestamps instead of sequence numbers; the internal code to do this is referred to as Snowflake.
A Snowflake ID is a 64-bit unsigned integer that is composed of:
- 41bits for millisecond precision time (69 years)
- 10bits for a configured machine identity (1024 machines)
- 12bits for a sequence number (4096 per machine)
Everything seems fine and planned for a smooth unnoticeable transition. But the major problem is Javascript and some other languages. This is due to the fact that some of these languages cannot completely support 64-bit numbers. Javascript for instance, can only support around 53 bits of number. Other old languages can probably support less. Considering that the Twitter API is served primarily in JSON (Javascript Object Notation) (take note of the word Javascript!), it must be completely compatible with Javascript.
The temporary solution is to add redundant data to the JSON object that the API is generating. Instead of generating only the numerical ID of a tweet, it will also generate along side it a string representation of the ID.
{“id”: 10765432100123456789, “id_str”: “10765432100123456789”}
Same data with different representation seems redundant and not advisable in most conditions. Considering the increase in the amount of data that will be transferred to the clients, especially those who are using the API on mobile and low-bandwidth devices. It wouldn’t be that much of a problem if only the ID will have a duplicate representation, the following will also have the string representations:
- id (DM, Saved Search, User, List )
- in_reply_to_status_id
- in_reply_to_user_id
- new_id (Streaming only. Will be removed when Snowflake is enabled)
- current_user_retweet_id (When include_my_retweet=1 is passed)
For a meantime, this changes will be inevitable. The timeline of the changes are as follows:
- 22nd October 2010 (Friday): String versions of ID numbers will start appearing in the API responses
- 4th November 2010 (Thursday) : Snowflake will be turned on but at ~41bit length
- 26th November 2010 (Friday) : Status IDs will break 53bits in length and cease being usable as Integers in Javascript based languages
Now I wonder what will happen to some applications that are affected by the migration?
References: