Zeitgeist - the most shared ³ÉÈË¿ìÊÖ links on Twitter
is a prototype to highlight the most shared ³ÉÈË¿ìÊÖ webpages on Twitter, a digest to link people to the hottest ³ÉÈË¿ìÊÖ pages. The project is part of a larger area of exploration to see how the ³ÉÈË¿ìÊÖ can use real-time trending data to enrich user experiences. One of our recent projects shows how the artists played on ³ÉÈË¿ìÊÖ radio are trending on other music services, such as and .
We developed Zeitgeist as a simple information source for users and to provide insight into users' interests and behaviours for our production teams. There are some interesting commercial alternatives available such as , , and , which are worth checking out but we had some specific requirements for our prototype.
The system combines a custom built ingest chain using to search for tweets containing a ³ÉÈË¿ìÊÖ URL. As it's running in real-time these links come and go depending on what Twitter users are talking about. You can see the 'liveness' in the view or take a broader view of the .
Zeitgeist uses the web page's URL and metadata to determine where it comes from and assign it a category, e.g , , or . These give links a context for the user and a means of navigating deeper.
The links are ranked by a tweet count (including retweets) for the chosen time period. Each entry details the page title, category, media type, short description and when it was first tweeted. The date of publication is indicated where available as it's not just new links that seem to get picked up on Twitter.
We have a different view for ³ÉÈË¿ìÊÖ employees (shown below), which allows us to see; the tweet history of each page, a full list of tweets, most retweeted messages, hashtags and keywords. We are unable to show this to everyone as the messages would need to be moderated.
We use the Twitter streaming API to access the Gardenhose sample stream, which provides a subset of the full Twitter message stream, at a rate of about 100 messages per second and to track "³ÉÈË¿ìÊÖ" as a keyword. These messages are then fed into a pipeline of processes written in connected by queues provided by , a fast and reliable messaging server.
These are the stages that each incoming tweets goes through:
- Twitter combines retweets with it's original tweet, these are split to deliver both messages to the pipeline
- A tweet from the API contains a lot of extraneous data which needs to be removed, such as the user's page background colour
- Links in the message are extracted and resolved following through redirections and expanding shortened links, provide a for this
- Only tweets containing links to ³ÉÈË¿ìÊÖ pages are kept. Automatically generated ³ÉÈË¿ìÊÖ tweets from accounts such as are filtered out and links to the are also removed as they skew the results
- These are saved to the database
- The link category is determined by its domain and in-page metadata
We split these steps into separate processes for two reasons: it's easier to develop and test a process if it does only one thing; and more importantly, it allows us to balance different parts of the system depending on load. For example, there is only one process required to strip data out of tweets, but ten to resolve the URL. By load balancing this way, we can maintain a steady throughput of messages that does not get overloaded at any point.
To make Zeitgeist, we have had to handle large data sets at high speed. As a rough guide, the Zeitgeist ingest chain handles about 300,000 tweets an hour, of that 900 contain links, 500 of which link to the ³ÉÈË¿ìÊÖ. Finally, short lists work well as there's a steep drop-off of tweets lower down the chart and as you might expect the majority of links point to ³ÉÈË¿ìÊÖ News articles.
Zeitgeist is now up and running for a limited period and we trust that you'll find it an interesting resource. We think a system like this could feed into ³ÉÈË¿ìÊÖ Search as a ranking algorithm, as an additional real-time feed for News recommendations, or as a 'news on the move' mobile service. In any case it shows how audiences can help shape and prioritise content.
Visit the ³ÉÈË¿ìÊÖ prototype
Comment number 1.
At 14th Jul 2010, lucas42 wrote:Zeitgeist looks interesting, it reminds me a bit of Shownar, but obviously covers much more than just programmes. Does Zeitgeist make any attempt to match up content that is available at multiple URLs (e.g. /programmes and /iplayer)?
Also, you mentioned using the bit.ly API. What benefits does this provide over sending a http HEAD request to the url and seeing where it redirects to?
Complain about this comment (Comment number 1)
Comment number 2.
At 15th Jul 2010, tristanf wrote:Hi @lucas42...
1. It doesn't at the moment, it is just bbc.co.uk URL based but we could add special rules for programmes (PIDs) later.
2. I think that bit.ly prefer you to use the API if you're expanding lots of links, I guess it's more efficient for them. We just use redirection for other shortening services.
Complain about this comment (Comment number 2)
Comment number 3.
At 15th Jul 2010, Tom Martin wrote:Hi,
Very interesting ingest chain, what the first process that's connected to the Gardenhose is it Ruby / Node.js or something else?
Complain about this comment (Comment number 3)
Comment number 4.
At 19th Jul 2010, seanohalpin wrote:Hi Tom,
The entire pipeline is written in Ruby. The process that connects to the Twitter API is a custom client using the , and libraries. We've found this is perfectly capable of handling our use case of up to 150 tweets/second.
I'll be publishing a detailed technical blog post later this week. Watch this space!
Regards,
Sean
Complain about this comment (Comment number 4)
Comment number 5.
At 25th Jul 2010, marcdraco wrote:I DON'T CARE!!
I do care that you're always playing with MY money to mess around with these fleeting, COMMERCIAL technologies.
HELL'S TEETH, ³ÉÈË¿ìÊÖ, WRITE YOUR OWN!
Complain about this comment (Comment number 5)
Comment number 6.
At 28th Jul 2010, Phoenix85 wrote:Can't believe they stole a very meaningful and powerful word and used it for something so irrelevant. Ugggh (vomits)
A word that has also been associated with an anti corporation movement
Complain about this comment (Comment number 6)