Chempound - a Web 2.0-inspired repository for physical science data

Abstract

Chempound is a new generation repository architecture based on RDF, semantic dictionaries and linked data. It has been developed to hold any type of chemical object expressible in CML and is exemplified by crystallographic experiments and computational chemistry calculations. In both examples, the repository can hold >50k entries which can be searched by SPARQL endpoints and pre-indexing of key fields. The Chempound architecture is general and adaptable to other fields of data-rich science.

Introduction

The need to capture and manage scientific data is now seen as critical by many commentators and funders of scientific research (see Borgman 2011 for a recent review), but still, the majority of scientific data is never published. While some areas of the biosciences – e.g. protein structure characterisation (PDB 2011), genome sequencing (Benson et al. 2011) – have a strong culture of open data publication and sharing, many disciplines do not. There are various reasons for this. Data publication takes time and effort that could otherwise be spent carrying out research, with little obvious reward. Many scientists are concerned about what their competitors may be able to deduce about their work, from access to their data. And ultimately, there is a widespread feeling that data publication is difficult, and ‘my data isn't really that useful to other people anyway’. Unfortunately, much of the data that does get published in the chemical sciences is not made available in a usable form. Rich, machine understandable data ends up in PDF files, and binary image formats or described in text. These processes often render it virtually unreadable, even by humans, and almost totally un-indexed, and undiscoverable.

Figure 1. Rich chemical data is commonly transformed into virtually unreadable blocks of text, with significant information loss, when it is published at all. Extract from Chen et al. (2011).

In this paper, and the development of Chempound, we have focused on the publication of chemical data, but the issues and outcomes are transferable to other areas of the physical and biological sciences, and beyond.

Domain Intelligence

Attempts have been made to store chemical data in both ePrints (Southampton eCrystals repository) and DSpace (WWMM collection in Cambridge DSpace ; SPECTRa Test Repository at Imperial College, London) repositories. These efforts have not found widespread acceptance amongst chemists since existing repository software is primarily focused around the handling of textual content. While bit-steams for other media types such as images, audio, video and scientific data can be stored, they are only indexed, and hence discoverable, on the basis of a limited range of textual metadata, such as titles, descriptions and associated keywords. This can make it difficult to discover non-textual data in a repository, and contributes to the disengagement of scientists who are used to the much richer user interfaces available in domain aware information systems.

Sticking with the example of chemistry, most chemists will be familiar with a number of Chemical Information Systems (tools such as Scifinder Scholar or DiscoveryGate are introduced during undergraduate courses), and from their experience using such systems chemists expect to be able to search databases by drawing a full or partial structure, or on the basis of domain specific properties such as chemical formula or molecular weights – data which cannot be indexed as ‘text’. These information systems are, however, primarily compound registries whose main purpose is to provide a searchable index of molecular structures, rather than a repository for data. There are a few exceptions, such as the Cambridge Structural Database (CSD) which serves as a repository of small molecule crystal structures, and the Spectral Database for Organic Compounds. However, these systems are highly specific to one particular sub-domain, with strict metadata schemas supporting a very limited subset of chemical data, and often requiring proprietary protocols and software to access. Furthermore the data they contain may only be available under strict license terms, limiting the uses to which it can be put.

Valid names:
1,3,7-trimethylxanthine
trimethylxanthine
methyltheobromine
7-methyltheophylline
mateine
guaranine
...

Systematic name:
1,3,7-trimethyl-1H-purine-2,6(3H,7H)-dione

You know it as: caffeine

Figure 2. Chemical structures cannot be indexed by a text-based search engine. They can have a huge variety of alternative names, and users may expect to be able to search for particular features of the structure, which are not necessarily encoded in the name. The only way these issues can be addressed is through supporting domain specific indexing techniques.

Linked Data

Over the last few years ‘Linked Data’ (Berners-Lee 2006; Heath and Bizer 2011) has emerged as a standard approach to publishing structured data on the web. In order to generate linked data, URIs are used to identify resources. Ideally these URIs should resolve to provide information about the resource they identify. Statements can be made about a resource using RDF. These statements (termed ‘triples’) essentially consist of the resource to which they relate, and a name-value pair, where the value may be either a primitive (string, number, date etc.) or a more complex data structure. Statements can also be used to record relationships between resources through the use of common identifiers.

One of the major benefits of adopting an RDF/Linked Data approach to data publication is that it does not require the agreement of strict schemas and controlled vocabularies. The development of such schemas and vocabularies is a difficult process – different individuals and organisations will have completely different ways of thinking about a particular thing. Instead RDF gives publishers the opportunity to mix-and-match existing vocabularies, sharing and reusing parts of them as appropriate, and to introduce new vocabularies where needed. RDF also makes it possible to align vocabularies at a later stage through making assertions about relationships between properties from different vocabularies. This flexibility enables information about arbitrary ‘things’, with different properties and vocabularies, to be collected together in a single graph.

Figure 3. Collections of statements about resources form graphs of linked data. In this example two crystal structures, a calculation, a synthesis, and data about each of them are linked together through the use of shared identifiers.

Overview of Chempound

As part of two JISC-funded projects (CLaRION and JISC-XYZ), we have prototyped a novel repository system (Chempound) which can support the capture, storage and search of chemical and crystallographic data, with domain intelligence. Chempound has been developed with two major goals in mind: data should be presented to users via a rich, domain aware interface, and should be presented in a manner understandable by machines as well as humans.

Chempound approaches these challenges through a combination of embracing the technologies developed for supporting Linked Data, and presenting a highly modular, extensible architecture.

At its core Chempound contains two components: a triple store containing RDF statements describing the structure of the data held in Chempound, and its associated metadata, and a resource store which holds the actual data files.

Figure 4. Overview of the architecture and components of Chempound.

Machine Understandable

Following the Linked Data paradigm, Chempound is built on RDF. The structure of items and collections are expressed as ORE Resource Maps (Lagoze et al. 2008), using RDF, and metadata about items and collections is stored as RDF. Where data is in a format that Chempound is aware of, an RDF representation is generated, and used to index the data. All of this semantic data is added to a triple store, allowing rich searches using SPARQL (Prud'hommeaux and Seaborne 2008) – the standard RDF query language – and linking of concepts between different items in the repository.

Figure 5(a). Items in Chempound are represented as an ORE Aggregation of resources (files). A crystal structure typically contains the original CIF file, a CML file and full-sized and thumbnail images of the crystallographic unit cell.

ORE Aggregation representing a collection

Figure 5(b). Similarly, Chempound represents collections as ORE Aggregations of items.

All of this RDF is exposed to software agents through the HTTP Content Negotiation system. This allows humans and machines to retrieve alternative representations of a data item from the same URI. When a human requests a data item from Chempound, their web browser will automatically add a header saying that it prefers to receive HTML or XHTML. Chempound will recognise this, and will direct the user to the item's HTML splash page. If a software agent requests the same data item, using the same URI, it can include a header saying that it prefers to receive RDF. Again, Chempound will recognise this when it receives the request, and directs the software agent to a machine understandable RDF representation of the data item.

Chempound publishes an Atom (Nottingham and Sayre 2005) archive (Nottingham 2007) feed, providing a history of the data items that have been deposited in the repository. Feed readers and other software can poll this in order to monitor a repository's content.

Domain-Awareness

Chempound provides a modular, plugin-based system allowing the installation of components providing domain intelligence for different types of data. These can be used to provide a domain specific user interface and customised search tools.

In common with traditional repository systems, Chempound presents users with an HTML ‘splash’ page when they navigate to an item in the repository. Chempound defaults to a simple page displaying core metadata such the item's title, author, description and the list of files making up the item; however alternative templates can be installed for specific data types. These allow Chempound to select the appropriate data fields to present to users when viewing different types of data, based on the RDF class of the item. For instance, the splash page for a crystallographic structure might include an interactive 3D model of the molecule, along with the contents of fields describing the crystal system, and the quality of the recorded data. Alternatively a computational chemistry result's splash page may present the ground-state energy of the molecule, and various calculated properties.

We have deliberately ignored full-text searching in the development of Chempound, since so long as the repository is exposed on the open web full-text searches will be supported by the likes of Google and Bing far better than we could hope to. Instead search in Chempound has focused on the areas that text-based search engines do not support: numbers, ranges, dates and domain specific data types such as molecular structures and bio-sequences.

Representing data using RDF makes it possible to perform a wide variety of searches over the data. Common queries can be simply constructed using web forms, which are provided by the domain aware plugins.

Figure 6. Crystallography data search form. Custom, domain-aware, search tools can be installed into a Chempound repository.

A SPARQL endpoint is provided for advanced users to perform more complex queries, and to enable the development of other services around the repository. The graph structure of RDF makes it possible to generate complex queries crossing between multiple data items, or even types, where common identifiers are available.

e.g. Find the reported crystal structures in orthorhombic space groups for molecules having a calculated ground state energy greater than -40Kcal/mol

    PREFIX cif: <http://www.xml-cml.org/dictionary/cif/>

    PREFIX cc: <http://www.xml-cml.org/dictionary/compchem/>
    PREFIX cml: <http://www.xmlcml.org/rdf-schema#>
    SELECT ?cif ?system ?comp ?energy ?inchi {
      ?cif cif:space_group_crystal_system ?system .
      ?comp cif:chemical_formula_sum ?formula .
      ?cif cml:inchi ?inchi .
      ?comp cml:inchi ?inchi .
      FILTER (?system = "orthorhombic" && ?energy > -40.0)
    }

Figure 7. Example SPARQL query linking a crystal structure to computational chemistry calculations. The data items are linked on the basis of shared InChI molecular identifiers.

Finally, index plugins can be registered to receive notification of all new deposits in order to provide domain specific searches, beyond those supported by the triple store's indexing, such as chemical structure indexes or bio-sequence similarity searches.

Creation and ingest of data for Chempound

Plugins to Chempound are able to programmatically deposit data items, however the primary means of deposit is via a SWORD2 (Jones and Lewis 2011) endpoint. Deposits consist of a package of files, with associated metadata. On ingest, RDF data and metadata are added to the repository's triple store to facilitate searching.

We have developed simple command-line tools for the deposit of crystal structures and the output of computational chemistry calculations. These take a legacy format file and transform it into Chemical Markup Language (CML – an XML dialect for chemistry) and RDF, and load the results into Chempound via a SWORD2 deposit. We expect this approach to be extended to other data types as support for them is added to Chempound.

Figure 8. The workflow for the import of chemical data into Chempound. Data stored in a legacy format, such as a Crystallographic Information File (CIF), or a computational chemistry Log File is converted into CML. Common data elements and conventions are identified, and transformed into standard structures – so, for example a molecule is represented in the same manner whether it comes from a crystal structure determination, a computational chemistry calculation or a synthetic report. Finally an RDF representation of the data is generated for indexing purposes, containing the values of scalar properties, and recording the presence/absence of more complex data types.

Current implementations and experience

Chempound repositories have been deployed in two major areas of chemistry:

Computational Chemistry

Several million computational chemistry experiments are produced each year and yet there is currently no mechanism for capturing this data. A novel, bottom-up community project (Quixote) has recently formed to address this issue. Working with the Quixote project we have developed systems for converting the conventional output of computational chemistry calculations (flat ASCII log files) into structured XML with dictionary support. A Chempound repository has been set up to hold this computational chemistry data and to date the results of approximately 7000 calculations have been deposited.

Figure 9. A domain-aware splash page for a computational chemistry calculation. The template embeds a 3D model of the molecule, and reports details of the software and type of calculation performed, along with selected calculated properties.

Crystallography

CrystalEye is an index of crystallographic data, openly accessible as supplemental information accompanying conventional publications. A ‘PubCrawler’ system frequently spiders the journal tables of contents from a number of major chemical publishers and identifies any Crystallographic Information Files (CIF) included as supplemental information. The CIFs are converted into Chemical Markup Language and published on the CrystalEye website.

We have deposited approximately 127000 crystal structures from CrystalEye into an instance of Chempound, to provide richer search functionality than was previously possible. Growing the repository to this size has identified some performance black spots, due to the large number of triples generated. We are currently exploring caching and pagination options in order to resolve these problems.

Figure 10. A domain-aware splash page for a crystallographic structure. As with the computational chemistry calculation, the template embeds a 3D model of the molecule, but reports a different selection of properties, of interest to crystallographers.

Potential Deployment

Although Chempound can be deployed at institutional level, we expect that it is initially more likely to be deployed at a local – research group, or departmental – level. As webs of linked open data grow, we expect that emerging tools will allow an effective federation of distributed repositories. Because of this, there is much less need for central planning and specialist support. We also expect that different repositories will have different emphases and might be differentiated by type of content (e.g. different chemistry), quality (e.g. validation and coherence) or associated with institutional or international research projects.

Chempound itself can be operated in a distributed manner. Since the files/resources constituting an item in Chempound are identified by URIs, there is no reason why some or all of them cannot be stored outside of Chempound. We envisage the scenario where the raw data files might be stored in a traditional institutional repository, to benefit from the preservation that this entails, while an instance of Chempound acts as a ‘standoff’ or ‘overlay’ repository, providing domain-aware browsing and indexing capabilities over the same data. The current standardised protocol for repository harvesting, OAI-PMH 2.0, only specifies how to retrieve metadata (e.g. title, author list etc.) about an item, and does not specify a standard method for retrieving a list of an item's constituent files/resources. It is hoped that a future version of the protocol will provide this, but until then harvesters/scrapers must be customised for different sources.

Figure 11. The Quixote project's vision for computational chemistry data publication. Researchers load the outputs of their calculations and simulations into repositories. Some will have access to institutional repositories (e.g. Daresbury, Cambridge in this figure), or even a personal repository. Researchers who do not have access to a local repository can instead use public repositories hosted in the cloud. A ‘CompEye’ service provides a federated view over all this data, and may monitor the various repositories Atom feeds to discover new content. Open data published in the literature may be collected by spidering the supporting information of articles, in a similar manner to CrystalEye. Researchers interested in monitoring latest developments can monitor Atom feeds of new calculations using services such as Google Reader. Chempound can support much of this scenario already, and other aspects are currently under development.

Conclusions

The Chempound architecture is straightforward to extend to any branch of science which has implemented a semantic architecture. This need not be a formal ontology such as OWL2 and in many cases simple dictionaries (expressible in RDF triples) are all that is required. The major current limitations are the scale of large collections and the relatively slow performance of triple-based searches. Some of this can be solved by pre-indexing common queries but we also expect that RDF systems will continue to improve in performance and that the simple ingest procedure will be sufficient for many organisations.

Repositories seem to be an obvious location for integrating and exploring data, but the primary focus of most existing repositories is growing and preserving their collections, and too little attention is paid to their usability. Our experiences promoting the publication of scientific data and engaging with communities of data producers have shown us that scientists will only deal with domain repositories that are trivial to use and understand the scientists' data. This requires specialist, per-discipline tools. Such activities require a large investment to develop, but if we want data rich communities to engage with repositories then we need to address their needs.

Repositories can be much more than archives or museums, but their focus needs to change. Enabling reuse is more important than preservation.