³ÉÈË¿ìÊÖ

« Previous | Main | Next »

Sports Refresh: Dynamic Semantic Publishing

Post categories: ,Ìý,Ìý,Ìý

Jem Rayfield | 11:00 UK time, Tuesday, 17 April 2012

Hi, I'm Jem Rayfield, and I work as Lead Technical Architect for the News and Knowledge Core Engineering department.

This blog post describes the technology strategy the ³ÉÈË¿ìÊÖ Future Media department is using to evolve from a relational content model and static publishing framework towards a fully dynamic semantic publishing (DSP) architecture. The DSP architectural approach underpins the recently re-launched and refreshed ³ÉÈË¿ìÊÖ Sports site and indeed the ³ÉÈË¿ìÊÖ's Olympics 2012 online content.

DSP uses linked data technology to automate the aggregation, publishing and re-purposing of interrelated content objects according to an domain-modelled information architecture,Ìýproviding a greatly improved user experience and high levels of user engagement.

The DSP architecture curates and publishes HTML and aggregations based on embedded Linked Data identifiers, ontologies and associated inference.

( - is based upon the idea of making statements about concepts/resources in the form of subject-predicate-object expressions. These expressions are known as in RDF terminology. The subject denotes the resource; and the predicate denotes traits or aspects of the resource and expresses a relationship between the subject and the object. For example, to represent the notion "Frank Lampard plays for England" in RDF is as a triple, the subject is "Frank Lampard"; the predicate is "plays for" and the object is "England Squad".)

RDF semantics improve navigation, content re-use, re-purposing, search engine rankings, journalist determined levels of automation ("edited by exception") and will in future support semantic advertisement placement for audiences outside of the UK. The DSP approach facilitates multi-dimensional entry points and a richer navigation.

³ÉÈË¿ìÊÖ News, ³ÉÈË¿ìÊÖ Sport and a large number of other web sites across the ³ÉÈË¿ìÊÖ are authored and published using an in-house bespoke content management/production system ("CPS") with an associated static publishing delivery chain. Journalists are able to author stories, manage indices and edit audio/video assets in the CPS and then publish them pre-baked as static assets to the ³ÉÈË¿ìÊÖ's Apache web server farm. In addition, journalists can edit and manage content in the CPS for distribution to the ³ÉÈË¿ìÊÖ Mobile and Interactive TV services, and IPConnected TV services. The CPS has been constantly evolving since it was developed to publish the ³ÉÈË¿ìÊÖ News website, which launched in November 1997, and the latest version (v6) underpins the summer 2010 redesign of the ³ÉÈË¿ìÊÖ News site that won

The first significant move away from the CPS static publishing model by the ³ÉÈË¿ìÊÖ's Future Media department was through the creation of the .

From first using the site, the most striking changes are the horizontal navigation and the larger format high-quality video. As you navigate through the site it becomes apparent that the rich ontological domain model provides a far deeper way of exposing ³ÉÈË¿ìÊÖ content than can be achieved through a traditional content management system with its associated relational model and static publishing solution.

AsÌýI wrote here at the time,Ìýthe ³ÉÈË¿ìÊÖ World Cup 2010 site featured 700-plus team, group and player pages, which are powered by our high-performance DSP architecture.

Previously, ³ÉÈË¿ìÊÖ Sport would never have considered creating this number of indices in the CPS, as each index would need an editor to keep it up to date with the latest stories, even where automation rules had been set up. To put this scale of task into perspective, the World Cup site had more index pages than the rest of the ³ÉÈË¿ìÊÖ Sport site in its entirety.

The DSP architectural approach enables the ³ÉÈË¿ìÊÖ to support greater breadth and scale, which was previously impossible using a static CMS and associated static publishing chain. DSP allows the ³ÉÈË¿ìÊÖ to support and underpin the scale and ambition of the recently refreshed ³ÉÈË¿ìÊÖ Sports site and indeed the Olympics 2012 pages.

The entire football section of the refreshed sports site is orchestrated by automated annotation-powered aggregations. The DSP architecture automatically authors a page for every football team and football competition within the UK in addition toÌýa page for every Olympic athlete (10000+), team (200+), discipline (400-500) and dozens of venue pages.

The number of automated pages managed by the DSP architecture is now well in excess of ten thousand. This number of pages is simply impossible to manage using a static CMS driven publishing stack.

Since the World Cup the DSP architecture has been augmented with a Big Data scale content store () for managing rapidly changing statistics, navigation and in the future all content objects, thus evolving the architecture completely away from its static publishing roots.

DSP enables the publication of automated metadata and content state driven web pages that require minimal journalist management, as they automatically aggregate and render links to relevant stories and assets.

( is data about data. In this instance, it provides information about the content of a digital asset. For example, a World Cup story might include metadata that describes which football players are mentioned within the text of the story. The metadata may also describe the associated team, group, or organization associated to the story.)

The published metadata describes the ³ÉÈË¿ìÊÖ Sport content at a fairly low-level of granularity, enabling rich content relationships and semantic navigation. Querying the published metadata enables the creation of dynamic page aggregations such as Football Team pages or Athlete pages. Published sports stats and navigation are mapped to the ontology and allows dynamic publication of statistics and navigation against automated indices.

The ³ÉÈË¿ìÊÖ is evolving its publishing architecture towards a model which will allow all content objects and aggregation to be served and rendered on a dynamic request-by-request basis to support rich navigation, state changes such as event or time and, potentially, personalisation; with the information architecture and page layout reacting to underlying semantics and meta model.

The remainder of this post will describe howÌýthe ³ÉÈË¿ìÊÖ intends to evolve the static publishing CPS and the semantic annotation and dynamic metadata publication used for ³ÉÈË¿ìÊÖ Sport site towards its eventual goal of a fully dynamic semantic publishing architecture.

Static publishing and CPS content management

The CPS has been designed and developed in-house, and so its workflow and process model has evolved to its current form (v6) through continuous iteration and feedback from the ³ÉÈË¿ìÊÖ journalists who use it. They author and publish content for the product development teams to build the ³ÉÈË¿ìÊÖ News and Sport websites. When looking at the requirements for the recently redesigned and refreshed News site, the FM department considered evaluating proprietary and open-source solutions in the CMS market for shiny new features.

However the wonderful and interesting thing about the CPS is that most ³ÉÈË¿ìÊÖ journalists who use it value it very highly. Compared to my experience with many organisations and their content management systems it does a pretty decent job.

The CPS client is built using Microsoft .Net 3.5 and takes full advantage of (WPF). The following screen shots of the CPS user interface illustrates some of its features.

Editing interface with a story, including embedded video, being edited.

Fig 1a: Screen shot of the CPS story-editing window

An editor, editing a list of stories

Fig 1b: ³ÉÈË¿ìÊÖ CPS, showing the index editor

Figure 1 depicts a screen shot of its story-editing window. The CPS has a number of tools supporting its story editing functions such as managing site navigation, associating stories to indices and others such as search.

As you can see there is a component-based structure to the story content - figure 1a showsÌýa video, an introduction and a quote.

These components are pre-defined allowing a journalist to drag and drop as desired. It is clear that the UI is not a editor. The current incarnation of the CPS focuses on content structure rather than presentation or content metadata.

Although the editor is not WYSIWIG, CPS content is available for preview and indeed publication to a number of audience facing outputs and associated devices. On publication, CPS assets are statically rendered for audience-facing output - flavours include RSS, Atom, High-Web XHTML, JSON, Low-Web XHTML and mobile outputs.

Figure showing how journalists publish into the Content Creation Network, which, via a Delivery Chain, reaches the Audience Facing Network, then the CDN, and then a stick figure of an audience member.

Fig 2: ³ÉÈË¿ìÊÖ News CPS static publishing

The static CPS delivery architecture (depicted in Fig 2 above) provides a highly scalable and high performance static content object-publishing framework.

The CPS UI utilises a data layer API abstraction which proxies the underlying persistence mechanism (anÌýOracle ). The abstracted relational data model captures and persists stories and media assets as well as site structure and associated page layout.

The CPS UI allows the journalist to author stories, media and site structure for preview, eventual publication and re-publication.

A daemon process, the CPS publisher, subscribes to publication events for processing and delivery.

The CPS publisher contextualises content objects in order that they are appropriate for required audience/platform output. Filtered, contextualised assets are rendered by the CPS publisher as a static file per output type. The CPS publisher uses a Ìý(MVC) architectural patternÌýto separate the presentation logicÌýfrom the .

Each output representation is made persistent onto a (SAN). The ³ÉÈË¿ìÊÖ's home-grown content delivery chain subscribes to SAN changes and publishes each output from a secure content creation network onto a set of head Ìýservers accessible to the audience.

Although the CPS relational content model and static publishing mechanism scales and performs well it has a number of functional limitations. CPS authored content has a fixed association to manually administered indices and outputs are fixed in time without any consideration to asset semantics, state changes or semantic metadata. Re-using and re-purposing CPS authored content to react to different scenarios is very difficult due to the static nature of its output representations. Re-purposing content within a semantic context driven by metadata is impossible without manual Journalist management and re-publishing. Manual complex data management inevitably leads to time, expense and data administration headaches.

The CPS relational data model currently has a very simple metadata model capturing basic items such as author, publish date and site section. Extending the CPS relational content model to support a rich metadata model becomes complex. When designing a knowledge domain annotation schema using a relational approach, one can start by trying to create a flat controlled vocabulary, which can be associated to content objects. However, this quickly breaks - as semantics are very unclear. Evolving this further, a flat controlled vocabulary can be grouped into vocabulary categories; nevertheless, a restrictive and hierarchal taxonomical annotation schema soon evolve again. As concepts need to be shared this gives rise to vocabulary repetition and ambiguity. A taxonomic hierarchy further evolves into a graph, allowing concepts to be shared and re-used to ensure that semantics are disambiguous and knowledge is concise.

Implementing a categorised controlled vocabulary within a relational database introduces complexity; creating a hierarchy introduces further complexity, and implementing within a relation model takes things past the useable limits of a relational model. If you then add in requirements for reasoning based on metadata semantics then relational databases, associated SQL and schemas are no longer applicable solutions and are simply redundant in this problem space.

Dynamic Semantic Annotation Driven Publishing

The primary goals of the ³ÉÈË¿ìÊÖ World Cup 2010 web site were to promote the quality of the original, authored in-house ³ÉÈË¿ìÊÖ content in context and to increase its visibility and longevity by improving the breadth and depth of navigational functionality.

Increasing user journeys through the range of content while keeping the audience engaged for longer browser session durations meant that a larger more complex information architecture was required than that traditionally managed by ³ÉÈË¿ìÊÖ journalists.

Creating a website navigation for 700+ Player, Team, Group and Match pages posed a problem as the traditional CPS manual content administration processes would not scale. An automated solution was required in order that a small number of journalists could author and surface the content with as light a touch as possible; and automatically aggregate content onto the 700+ pages based on the concepts and semantics contained within the body of the story documents.

Screenshot of ³ÉÈË¿ìÊÖ Sport World Cup England page

Fig 3: Dynamic RDF automated

The information architecture gave rise to a domain model which included concepts and relationships such as time and location; events and competitions; groups, leagues and divisions; stages and rounds; matches; teams, squads and players; players within squads, teams playing in groups, groups within stages, etc.

Clearly, the sport domain soon gives rise to a fairly complex metadata model. When you then include a model that describes the assets that need to be aggregated with a semantic association to the sport domain, it is quickly apparent that using a relational database is not an appropriate solution. The ³ÉÈË¿ìÊÖ needed to evolve beyond a relational CPS static architecture.

The DSP architecture and its underlying publishing framework do not author content directly; rather it publishes data about the content - metadata. For the World Cup, the published metadata described the content at a fairly low-level of granularity, providing rich content relationships and semantic navigation. By querying this published metadata we were able to create automatic dynamic page aggregations for Teams, Groups and Players.

The foundation of these dynamic aggregations was a rich ontological domain model. The ontology described entity existence, groups and relationships between the things/concepts that describe the World Cup. For example, "Frank Lampard" was part of the "England Squad" and the "England Squad" competed in "Group C" of the "FIFA World Cup 2010".

The ontology model also described journalist-authored assets - stories, blogs, profiles, images, video and statistics - and enabled them to be associated to concepts within the domain model. Thus a story with an "England Squad" concept relationship provides the basis for a dynamic query aggregation for the England Squad page "All stories tagged with England Squad" (Figure 3). The required domain ontology was broken down into three basic areas asset, tag and domain ontologies (Figure 4) forming a triple, thus allowing a journalist to apply a triple-set to a static asset, such as associating the concept "Frank Lampard" with a story "Goal re-ignites technology row".

The tagging ontology was kept deliberately simple in order to protect the journalist from the complexities of the underlying domain model. A simple set of asset/domain joining predicates, such as "about" and "mentions", drive the annotation tool UI and workflow, keeping the annotation simple and efficient, without losing any of the power of the associated knowledge model.

Ontology model diagram. See bbc.co.uk/ontologies for computer readable interactions between domain and assets.

Fig 4: The Asset (left), Tag (middle) and Domain (right) Ontologies used in the World Cup 2010, simplified for brevity

In addition to a manual selective tagging process, Journalist-authored content is automatically analysed against the domain ontology. A natural language determiner process automatically extracts concepts embedded within a textual representation of a story. The concepts are moderated and, again, selectively applied before publication. Moderated, automated concept analysis improves the depth, breadth and quality of metadata publishing.

The following screen shots describe the process of content annotation.

To the left, a story about Gareth Barry with suggested tags below. To the right; a picture of Gareth Barry.

Fig 5a: A journalist, using the Graffiti tool, applies the sport concept "Gareth Barry" to a story about the footballer

To the left: a story about a crash in Milton Keynes. To the right; Milton Keynes is marked in a map of the UK automatically.

Fig 5b: Annotating a story with the location Milton Keynes in the Graffiti tool

The journalist applies suggested annotations as well as searching for triplestore-indexed concepts.

As you can see all ontology concepts are linked to (LOD) identifiers (DBPedia, Geonames etc.). ("Linked open data" describes a method of exposing, sharing, and connecting data via ). This allows a journalist to correctly disambiguate concepts such as football players or geographical locations.

Journalist-published metadata is captured and made persistent for querying using the resource description framework (RDF) metadata representation and triple store () technology.

Diagram of data flow from Content Creation Network through to Audience Facing Network

Fig 6: Semantic World Cup 2010 publishing, powered by a triplestore

Figure 6 depicts the dynamic semantic architecture built to publish metadata driven static asset aggregations. A (RDF metadata database) and (RDF query language)Ìýapproach was chosen over and above traditional relational database technologies due to the requirements for interpretation of metadata with respect to an ontological domain model.

The high-level goal is that the domain ontology allows for intelligent mapping of journalist assets to concepts and queries.

The chosen triple-store provides reasoning following the and thus implicitly inferred statements are automatically derived from the explicitly applied journalist metadata concepts.

For example, if a journalist selects and applies the single concept "Frank Lampard", then the framework infers and applies concepts such as "England Squad", "Group C" and "FIFA World Cup 2010" (as generated triples within the triple store). Thus the semantics of the ontologies, the factual data, and the content metadata are taken into account during query evaluation. The triple-store was configured so that it performed reasoning with the semantics of all this data - at real time, hundreds of updates per minute while millions of concurrent requests occur against the same database.

This inference capability makes both the journalist tagging and the triplestore powered SPARQL queries simpler and indeed quicker than a traditional SQL approach. Dynamic aggregations based on inferred statements increase the quality and breadth of content across the site. The RDF triple approach also facilitates agile modelling, whereas traditional relational schema modelling is less flexible and also increases query complexity.

The ³ÉÈË¿ìÊÖ triple store is deployed multi-data centre in a resilient, clustered, performant and horizontally scalable fashion, allowing future expansion for additional domain ontologies and if required, linked open data sets.

The triple store is abstracted via a that uses the , the , ÌýJava web services framework, andÌýthe API specification.

The REST API is accessible via HTTPs with an appropriate certificate.

The API is designed as a generic façade onto the triple-store allowing RDF data to be re-purposed and re-used pan ³ÉÈË¿ìÊÖ. This service orchestrates SPARQL queries and ensures that results are dynamically cached with a low,Ìýone minute 'time-to-live' (TTL) expiry cross data centre, using .

All RDF metadata transactions sent to the API for CRUD operations are validated against associated ontologies before any persistence operations are invoked. This validation process ensures that RDF conforms to underlying ontologies and ensures data consistency. The validation libraries used include . The API also performs content transformations between the various flavours of RDF such as N3 or XML RDF.

Automated XML sports stats feeds from various sources are delivered and processed by the ³ÉÈË¿ìÊÖ. These feeds are now also transformed into an RDF representation. The transformation process maps feed-supplier IDs onto corresponding ontology concepts, and thus aligns external provider data with the RDF ontology representation within the triple store. Sports stats for Matches, Teams and Players are aggregated inline and served dynamically from the persistent triple store.

Page Rendering

The dynamic aggregation and publishing page-rendering layer is built using a and memcached stack.

The PHP layer requests an RDF representation of a particular concept or concepts from the REST service layer based on the audience's URL request. So if an "England Squad" page request is received by the PHP code several RDF queries will be invoked over HTTPs to the REST service layer below.

The render layer will then dynamically aggregate several asset types (stories, blogs, feeds, images, profiles and statistics) for a particular concept such as "England Squad". The resultant view and RDF is cached with a low TTL (one minute) at the render layer for subsequent requests from the audience. The PHP layer dynamically renders views based on HTTP headers providing content negotiated HTML and/or RDF for each and every page.

The World Cup made use of existing infrastructure utilising the significant number of existing static news kit (apache servers, HTTP load balancers and gateway architecture) all HTTP responses are annotated with appropriate low (one minute) cache expires headers. This HTTP caching increases the scalability of the platform and also allows caching if demand requires.

The DSP architecture served millions of page requests a day throughout the World Cup with continually changing semantic RDF data. It served an average of a million SPARQL queries per day for the duration of the tournament, with a peak RDF transaction rate of hundreds of player statistics per minute. Cache expiry at all layers within the framework isÌýone minute enabling a dynamic, rapidly changing domain and statistic-driven user experience.

Sport Refresh and Olympics Dynamic Publishing

The refreshed ³ÉÈË¿ìÊÖ Sports site is currently served to the audience using a combination of the two architectural approaches previously described: static publishing and DSP. The parts of the Sports site which are published using DSP or static publication are visible to the audience - the flavours of URL show which system publishes the page.

The refreshed ³ÉÈË¿ìÊÖ Sports site mashes static and dynamic published assets onto statically published pages via a server side include mechanism. This enables the ³ÉÈË¿ìÊÖ to migrate a proportion of its content onto the DSP architecture in a gradual phased manner. The end goal is that the static publication chain can be retired.

Assets which are published via the static publication chain are exposed to the audience via URL's which are prefixed with https:// www.bbc.co.uk/sport/0/. For example:

Screenshot of new sport home page

Fig 7: The statically published ³ÉÈË¿ìÊÖ Sport ³ÉÈË¿ìÊÖ page(Including dynamic navigation and dynamic sport statistics)

The CPS powered static publishing mechanism is currently used to curate, author, manage and publish ³ÉÈË¿ìÊÖ sports stories and editorially curated indices such as the main sports index and football index.

These assets are hand crafted, content managed, orchestrated and published by journalists.

When these Sports site pages are statically published they include and combine references to dynamic content. These references, known as server side includes (SSI), are resolved at render time at the apache web server farm. (SSIs are part of a simple interpreted scripting language which allows content from one or more sources to combined into a static web page.)

The mainly static pages then combine dynamic content such as statistics and navigation into a single page output for consumption by the audience. A static story combined with dynamic navigation and dynamic statistics would be a good example of this mixed publication chain approach. The cacheable proxied SSI mechanism mashes together the content from the static platform and dynamic platform allowing a phased migration towards a fully dynamic ³ÉÈË¿ìÊÖ sports site.

Automated annotation driven aggregation pages such as Football Team, Olympic Athlete, Olympics Discipline, and Olympics Venue are powered using the DSP approach. These pages are fully automated requiring no content management or journalist content management overhead. These pages do not contain any static content; they are fully dynamic and contain only references to static content objects such as stories or videos.

Journalists annotate ³ÉÈË¿ìÊÖ content objects such as a sports story or a video with concepts such as an athlete or a football team. Content objects are then automatically aggregated onto pages published using the newer DSP stack. For example:

  • Chelsea Football Club: All the content objects associated to the concept "Chelsea"

/sport/football/teams/chelsea

  • Tom Daley: All the content objects associated to the concept "Tom Daley"

/sport/olympics/2012/athletes/02025fcb-457d-4a77-8424-f5b8fe49b87f

  • Team GB: All the content objects associated to the concept "Team GB"

/sport/olympics/2012/countries/great-britain

Screenshot of Chelsea FC aggregation page

Fig 8: The Chelsea FC team dynamic ³ÉÈË¿ìÊÖ Sport page including automated metadata aggregations, dynamic sports stats and dynamic Sport navigation.

The navigation and sports statistics contained on this page are rendered on a request-by-request basis from the underlying XML content Store (MarkLogic).

The story, video, comment and analysis assets contained on this page are rendered on a request-by-requests basis from the underling RDF store (BigOWLIM).

The Sport ontology and Meta model which powers these automated annotation powered aggregations has now been published and can be re-used under a .

The sport Ontology diagram of Chris Hoy

Fig 9: The ³ÉÈË¿ìÊÖ Sport ontology as applied to Olympics 2012 Track Cycling

As you can see the model defines a simple yet generic sport ontology, which is capable of modelling sports from Football to the Men's Cycle Sprint within the Olympics 2012.

All the DSP powered pages on the sport site use this ontology model as its foundation. A simple asset model describing assets such as stories and videos linked to the Sport domain representation allows very rich dynamic content object aggregation.

The DSP's Natural Language processing and concept suggestion tool, which powers the Graffiti annotation tool, is now ontology aware. When additional concepts are added into the triple store (for example a new athlete) these concepts are immediately suggested to the Journalist as concepts for annotation. This feedback loop ensures that changes in the ontology instance data are reflected in all components of the DSP architecture.

Flow diagram of how information goes from Graffiti to the data layer via ³ÉÈË¿ìÊÖ APIs

Fig 10: Ontology aware natural language processing and annotation suggestion

The refreshed ³ÉÈË¿ìÊÖ Sport site's horizontal navigation is powered by a content model, which links ontology concepts to navigation entries.

This allows navigating to and automatically aggregating content from navigation linked to metadata concepts.

The underlying navigation data and associated content model are stored within a new addition to the DSP architecture - a highly scaled and high performance fault tolerant Big Data Store namely MarkLogic.

Sports statistics provided by third party suppliers are also now stored as XML content within this query-able Content Store. The ³ÉÈË¿ìÊÖ sports site queries these XML fragments adds value and re-formats the statistics in a form consumable on the sports site.

The Content Store which currently powers all of the statistics and navigation on the sports site has been scaled to handle ingesting many thousands of content objects per second whilst concurrently supporting many millions of dynamic page renditions and impressions a day. This high performance content store will allow the ³ÉÈË¿ìÊÖ Sports site to ingest and render sport statistics including live football scores, live football tables, live Olympics event statistics and results in near real-time whilst rendering this content dynamically using the DSP approach.

The refreshed sport site makes use of this new addition to DSP architecture for pages and content such as Live Scores: Football and Live Premier League Tables, Results, and Fixtures.

The DSP's triple store will be used in a purer sense and will now only be concerned with domain and asset metadata - it will not persist or manage content object data.

This clear separation of concerns makes the DSP persistence mechanism scalable.

Metadata is stored within a persistent RDF store suitable for modelling rich graphs. Content objects are stored within a document store suitable for live ingest and rendering.

A clean domain model, which only contains references to unique content objects, allows the content model to evolve and also allows the content to be stored in a de-coupled fashion. As long as the content has a unique identifier which is addressable the asset->tag->domain RDF model allows the triple store to model extendable real work concepts and lets the content store model raw referenced assets.

The Sport RDF currently maps third party statistic identifiers from the sport ontology concepts into sport content objects. This allows querying across the triple-store and content store for sports statistics related to a sport concept e.g. "The league table for the English Premiership".

League Table for the Premier League

Fig 11: Dynamic Content Store powered sports statistics

Content objects and sports statistics can then be cut up and arranged on a personalised, metadata driven, request-by-request basis.

The Olympics 2012 sports statistics are to be ingested and delivered to the audience using the same content store and dynamic render architecture. Statistics will be supplied from every Olympics event and venue for every event within the Olympics. These statistics will be ingested in near-real time for inclusion on metadata driven pages and video feeds. This gives the ³ÉÈË¿ìÊÖ's online Olympics output a very real sense of live.

The triple-store and content store are abstracted and orchestrated by a REST API. The API will continue to support SPARQL and RDF validation but it will now support XQuery and XML persistence across both the triple-store and the content store.

This allows a content aggregation to be generated using a combination of SPARQL for domain querying and XQuery for asset selection. All content object and metadata are made persistent in transactional manner across both data sources.

The content API "TRiPOD" (Figure 12) makes use of a multi-data centre memcached cluster to store content aggregations and protect the triplestore and content-store from query storms. The API cache is split into a live cache with a typically low cache profile circa one-minute TTL and a second, longer stale cache with an expiry TTL of 72 hours.

Memcache is also used to control SPARQL/ invocation using a memcache-based locking strategy.

If the live cache has expired a lock is created and a single query invocation thread per data-center is invoked. Subsequent requests are served from stale until the query responds refreshing both the stale and live cache. This caching and locking strategy enables the DSP platform to scale to many millions of page requests and associated backend queries a day.

The data layer (BigOWLIM and ContentStore) supports Jacuzzi (Tripod v2), via a content transaction manager. Content is input via CPS, Graffiti, MQ, a REST API and a validation service (using the BigOWLIM content repository) into the same content transaction manager. Content feeds back up via Memcache and Restful API to the Page Abstraction Layer which shows it in the Olympics 2012 and other Sports web pages.

Fig 12: DSP architecture combining SPARQL/XQuery, RDF store, and XML Store

The Future: Fully Dynamic Publishing

Although the ³ÉÈË¿ìÊÖ Sport architecture enables static asset content aggregation and re-purposing based on dynamic triple-store RDF metadata it currently does not support dynamic editorial authored asset rendering.

Assets such as stories are currently statically published rendering them fixed and immutable.

The refreshed ³ÉÈË¿ìÊÖ Sports site will eventually require content objects to be cut-up, arranged and rendered with respect to state changes and persona.

The ability to render all content object fragments by state and indeed metadata concepts will enable the ³ÉÈË¿ìÊÖ Sport web site to facilitate personalised, event driven pages with greater flexibility than that currently achieved for the ³ÉÈË¿ìÊÖ sport web site. A re-usable content API which contextualises content objects for device and platform will enable the ³ÉÈË¿ìÊÖ to create new outputs and open the ³ÉÈË¿ìÊÖ archive to the public.

The DSP architecture (Figure 6) will now take a final evolution - deprecating the static, fixed asset publication in preference for dynamic content object renditions.

Content objects will be dynamically rendered on a request-by-request basis rather than 'fixed-in-time' static publication.

Textual content objects such as stories and editorially authored indexes such as the football home page will be made persistent within the schema independent content store.

The content store supports fine-grained XQuery, enabling search, versioning, and access control.

All editorially authored content objects such as stories and manually managed indices will also be stored within the content store.

The content store is horizontally scalable and allows content to be handled in discreet chunks, supporting the cutting up and repurposing of fine-grained content. Each content object within the content store will be modelled as a discrete document with no interrelationships.

Discrete content objects are to be modelled and referenced via the asset ontology RDF within the triple-store.

Triple-store SPARQL is used to locate, query and search for documents by concept providing all the aggregation and inference functionality required.

The content store is used for fast, scalable queryable and searchable access to the raw content object data while the triple-store continues to provide access to asset references and associated domain models.

The Graffiti annotation tool UI currently only makes it possible for a journalist to annotate static content objects post-publication; it does not integrate with the CPS UI.

Using the Graffiti API within the CPS UI will soon unify and rationalise the journalist's toolset. Merging the Graffiti UI into the CPS UI will provided a single UI for the journalist, supporting the creation and annotation of documents within a single view.

Real-time concept extraction and suggestion will occurr as the journalist authors and then publishes content.

The DSP platform caching approach is fundamental to enable a scalable and performant platform. The API memcache strategy is augmented with HTTP caching between the PHP render layer and the API. The PHP layer also makes use of memcache for page module caching; all page fragments are cached at a ESI page assembly layer with corresponding HTTP caching. The site as a whole is also for further scalability and resilience during very large traffic spikes.

Conclusion

A technical architecture that combines a document/content store with a triple-store proves an excellent data and metadata persistence layer for the ³ÉÈË¿ìÊÖ Sport site and indeed future builds including ³ÉÈË¿ìÊÖ News mobile.

  • A triple-store provides a concise, accurate and clean implementation methodology for describing domain knowledge models.
  • An RDF graph approach provides ultimate modelling expressivity, with the added advantage of deductive reasoning.
  • SPARQL simplifies domain queries, with the associated underlying RDF schema being more flexible than a corresponding SQL/RDBMS approach.
  • A document/content store provides schema flexibility; schema independent storage; versioning, and search and query facilities across atomic content objects.
  • Combining a model expressed as RDF referencing content objects in a scalable document/content-store provides a persistence layer that uses the best of both technical approaches.

This combination removes the shackles associated with traditional RDBMS approaches.

Using each data store for what it is best at creates a framework that scales and is ultimately flexible.

Replacing a static publishing mechanism with a dynamic request-by-request solution that uses a scalable metadata/data layer will remove the barriers to creativity for ³ÉÈË¿ìÊÖ journalists, designers and product managers, allowing them to make the very best use of the ³ÉÈË¿ìÊÖ's content.

Simplifying the authoring approach via metadata annotation opens this content up and increases the reach and value of the ³ÉÈË¿ìÊÖ's online content.

Finally, combining the triple approach with dynamic atomic documents as an architectural foundation simplifies the publication of pan-³ÉÈË¿ìÊÖ content as "open linked data" between ³ÉÈË¿ìÊÖ systems and across the wider linked open data cloud.

Jem Rayfield is a lead architect in ³ÉÈË¿ìÊÖ Future Media, specifically focusing on News, Sport & Knowledge products.

Comments

  • Comment number 1.

    I'm sure this is all wonderful stuff, but here's the thing: in yesterday's match between Arsenal and Wigan, the three places on the Live Football page showing the score all showed different scores throughout the game.

    Very ontological I thought.

    Russ

  • Comment number 2.

    Absolutely fascinating post. Excellent job.

  • Comment number 3.

    All technical areas of work create their own jargon which is fine for those that work in those areas, but makes them opaque for those outside them. I always told my staff to avoid such jargon in communications to any outside audience which, in their case, was the whole of the NHS. Our technical area was information technology. The above article completely fails the "opaqueness" test for me and that in an area that I thought I knew something about!

    Can somebody please send Jem on a plain English course?

    Ian

  • Comment number 4.

    Are there any plans to publish the Olympics ontology as the ³ÉÈË¿ìÊÖ has published its Sport ontology? It would be useful to have access to the RDF of sports, disciplines, events, etc. as Linked Open Data.

  • Comment number 5.

    Wow, this is not a blog post, but a lecture. Very interesting.
    I have a first question. You mention early in your post:

    From first using the site, the most striking changes are the horizontal navigation and the larger format high-quality video.
    Can you explain to me what horizontal navigation has to do with the dynamic publishing. Why can that not be done using vertical navigation?
  • Comment number 6.

    @glossmighty (#3):

    Thank you for your comment.

    I subedit blog posts before they go out and make sure they're apt for the blog, so I'll answer your point.

    I do work with bloggers to make their posts as accessible as possible. For example, terms are defined and linked to explanations. But some blog posts will always be more specialist than others.

    For example, Patrick Sinclair's blog post about the Radio 1 home page was aimed at general audience that's interested in how the ³ÉÈË¿ìÊÖ uses technology. Jeremy Tarling's blog post about the technical architecture of the new ³ÉÈË¿ìÊÖ Weather website was aimed at fellow web professionals.

    Jem's blog post is more specialist; it's certainly never going to be for a mass audience.

    I try to make sure that the start of the blog post is a pretty good guide to how technical it is.

    Nick and I do think about how to make posts as readable as possible, and it is good to get feedback on this.

    Looking at other reactions, Jem's fellow specialists seem to find his post extremely useful and interesting.

  • Comment number 7.

    @Russ - an interesting observation. I guess this is not ontology related at all, but is rather a cache issue. As I understand all the statistics are stored as XML in MarkLogic Content Store and a page is a mix of RDF-ized data and XML data - so far so good. I'm puzzled with this massive caching - maybe those 3 places on the Live Football page were getting the match result data with 3 different queries, so when the match result changed not all 3 places got updated at the same time. I agree it is confusing, but it's not that bad, because the TTL is only one minute (still should be fixed though).

    @Dan - I guess they're using the Sports ontology for the Olympics as it is generic enough.

    @Jem - I'm looking forward to your comment on my comment ;)

  • Comment number 8.

    @Nafets and @Russ the caching is a bigger problem on the new sports pages. Since the new ³ÉÈË¿ìÊÖ Sports system, I often get the football, cricket, or rugby league subhomepage from a week ago (on Firefox with Windows 7) and have to manually refresh.

  • Comment number 9.

    Very interesting story.

  • Comment number 10.

    @Nafets
    Firstly, sorry for the delay in response. I have been at scaladays here in London. Which was awesome. =

    Anyhow...
    You are correct cache TTL's will effect the statistics on the sport pages.

    "maybe those 3 places on the Live Football page were getting the match result data with 3 different queries, so when the match result changed not all 3 places got updated at the same time"

    Different request times and TTL alignments can give rise to out of synch stats. We currently rely on a fairly low set of TTL's to try and minimize this issue. We also do have different cache profiles for different stats. However it is difficult to drop TTL's much lower that 1 minute as this has a direct correlation to the number of requests which traverse the entire stack. Given the large number of requests we receive we need to be be very careful how we protect our back-end for example.

    The problem becomes more tricky when you consider that we have 2 data centers which act in a stateless fashion with 2 isolated memcached cache clusters. For example if your browser DNS resolution cache TTL's you may be load-balanced to another data center. Where the cache TTL's may be slightly different to those on the last data-center. Thus again you may see inconsistencies.

    So... given the current pull model, stateless isolated data center model and min TTL restriction you may from time to see inconsistencies that will eventually resolve. Not ideal however we are working on tuning the cache profiles so hopefully this will improve.

    Cache eviction is also non-trivial as we have many layers of caching from memcached to http caches and indeed ISP and browser caches. The programatic model for forcing cache evictions becomes complex. At this stage we believe this isn't the best model for consistency.

    So...we are currently investigating a replacement stats delivery method. Moving from the current browser pull method towards websockets and push delivery... Perhaps the subject of another blog post when/if we get to a position to move forward with this option.

    Also on this

    "I guess they're using the Sports ontology for the Olympics as it is generic enough."

    The current sports ontology is generic enough with a few v.minor modifications which we will be publishing at some point soon.

    @JamesRogers
    "Can you explain to me what horizontal navigation has to do with the dynamic publishing. Why can that not be done using vertical navigation?"

    Horizontal navigation has nothing to do with dynamic publishing this was purely a product decision which enables more real estate and simpler inline navigation. Navigation inline following the domain ontology rather than complex taxonomical left had navigation out of context.


    @Dan

    "Are there any plans to publish the Olympics ontology as the ³ÉÈË¿ìÊÖ has published its Sport ontology?

    The sport ontology will be extended with some minor olympics changes and re-published.
    We also do have plans on publishing open rdf. However this may not be available in time for the olympics. We plan on opening our underlying service API to the public. This API is able to produce content negotiated RDF. However we have a number of scaling and throttling issues with opening these API's at the moment which we need to work through. In addition an open sparql endpoint is high on my wish list. All things we have on our backlog...but yet with a firm delivery date.

    @glossymighty

    Apologies for the language, I am partial to the odd acronym. I tried to include explanations within the blog to clear things up. Hopefully as Ian mentioned the text is usable by others. I will attempt to make things clearer in following posts...

    Cheers and thanks for your comments!
    Jem

  • Comment number 11.

    You say "The PHP layer dynamically renders views based on HTTP headers providing content negotiated HTML and/or RDF for each and every page." but I don't seem to be able to access any RDF. I've trying using an accept header and also adding an extension to the end of the URL (which /music and /programmes accept).
    Perhaps I've misunderstood - you've said that the underlying service API isn't public yet, but I assume these issues don't affect the RDF view rendered by the PHP layer.

    Also, how do things like "Featured Athletes" and "Featured Countries" fit into this? I assume they're editorially chosen - does this mean they rely on the static publication chain? Or if they're in the dynamic publication side, how do journalists (or whoever chooses) input them; is it through Graffiti?

    Finally, you've explained the difference between /sport and /sport/0 (though I don't understand why this difference should be visible in the url). I've noticed lots of pages still use news.bbc.co.uk/sport1/. Are these just legacy pages which don't use this new system yet, or is there a third publication system which you haven't mentioned here?

  • Comment number 12.

    Jem, this was a fascinating read. Thanks!

  • Comment number 13.

    Since this, and previous of Jem's posts, are enthusiastically read across the World by people in the media sector, @glossymighty, perhaps for the sake of clarity you should explain what 'NHS' means.

  • Comment number 14.

    Any idea why I'm getting March RSS feeds from the ³ÉÈË¿ìÊÖ Sport Everton page, today, in April?

    The rss feed seems to be broken for the last couple of months.

    feed://newsrss.bbc.co.uk/rss/sportonline_uk_edition/football/teams/e/everton/rss.xml

  • Comment number 15.

    RSS link on the Everton page is /sport/football/teams/everton/rss.xml

  • Comment number 16.

    Jem terrific post and look forward to your updated preso 'Dynamic Semantic Publishing Empowering the ³ÉÈË¿ìÊÖ Sports Site and the 2012 Olympics' at SemtechBiz SF June 3 - 7

  • Comment number 17.

    @lucas42 The content-negotiated RDF views you mention are part of the road map, but not delivered yet I'm afraid.

    Re: Featured Countries/Athletes: these are chosen editorially, but modeled in the Olympics ontology (oly:oneToWatch) against a sport:SportsDiscipline (such as "Archery") or a sport:MedalCompetition (such as "Men's Synchronised 10m Platform") or a sport:CompetitiveSportingOrganisation (such as "Great Britain & N. Ireland"). There are specific RDF feeds for Ones to Watch which we hope to publish soon.

    re your question about the other url patterns: yes, legacy urls.

  • Comment number 18.

    I am sure this will increase the user experience even more. It never hurts to make something good better!

  • Comment number 19.

    Sorry to say that I've left it months to see if it ever got better, but I'm just a bog standard user who's now not a visitor to the Sport website as it's still awful, colours & content in no way match the previously easily negotiable & user friendly site.
    Money apart, why oh why do the ³ÉÈË¿ìÊÖs home pages all have to all look & the same - yes, uploading becomes simple but viewing has become truly awful?

Ìý

More from this blog...

³ÉÈË¿ìÊÖ iD

³ÉÈË¿ìÊÖ navigation

³ÉÈË¿ìÊÖ Â© 2014 The ³ÉÈË¿ìÊÖ is not responsible for the content of external sites. Read more.

This page is best viewed in an up-to-date web browser with style sheets (CSS) enabled. While you will be able to view the content of this page in your current browser, you will not be able to get the full visual experience. Please consider upgrading your browser software or enabling style sheets (CSS) if you are able to do so.