Recent years have seen a change in attitude within academia towards research data. It may be characterized as a shift along the spectrum away from data as a disposable commodity or private concern and towards data as a first-class product of research, though attitudes have always varied widely across disciplines (Wilson et al. 2010; Jones 2009). The transition from analogue to digital data, and concurrent advances in information and communication technology, have certainly assisted this shift by making it more practical to access and re-use datasets, but many other drivers can be identified. There is the quality argument, that publishing data drives up the quality of research not just because it makes the results easier to verify and (where applicable) reproduce, but also because it makes the entire research process more transparent and accountable. If researchers know their data will be made public, this provides a strong incentive to ensure all aspects of its generation and processing can be justified (Stodden 2009; RIN and NESTA 2010). There is the opportunity argument, that interdisciplinary studies, metastudies and 'Citizen Science' are providing more opportunities than ever before to re-use and re-process data in novel ways and contexts. Besides which, within disciplines such as astronomy (STScI 2011) and bioinformatics (Edwards et al. 2009), there is ample evidence of significant quantities of research based on secondary uses of data. This argument forms the basis for the economic argument, that by publishing data and allowing its reuse, the return on the investment made to create it is increased. There are, additionally, efficiency savings to be gained from the non-duplication of effort in generating data. Further arguments can be made on how publishing data increases its academic, social and economic impact (Pienta et al. 2010; Piwowar, Day and Fridsma 2007).
Providing access to research data is in itself no small thing. If personal or otherwise sensitive information is involved, the data can only be released if appropriate permissions have been granted, and only then if the conditions of the agreement (e.g. anonymization, embargo periods, restricted access) are met. Equally there may be intellectual property issues to resolve. But if the above benefits are to be realized, mere access is not enough. It must be possible to find the data, whether by means of citations or search services, and most especially of all the data must be in a fit state for other researchers to re-use and re-purpose them. This has plenty of implications for how the data are managed throughout the course of the research. For integrability it is helpful if the data are generated and collected using common, respected methods and models. For interoperability it is helpful if they are stored in well-supported file formats. In order to remain explicable they must be fully documented and contextualized. In order to remain part of the research record, they must be preserved at the bit level and adapted over time in response to changes in prevalent software, data models and semantics. This level of management cannot be left to happen by chance: it must be properly planned and resourced.
This being so, research funders are becoming more forthright in their requirements that researchers plan their data sharing and data management activities. In the US, both the National Institutes of Health and, since January 2011, the National Science Foundation require that all applications for funding should include a data management plan (NIH 2003; NSF 2011, Chapter II.C.2.j). In the UK, the Wellcome Trust and five out of the seven Research Councils require some form of data management plan to be submitted as part of research proposals (Jones 2009). In May 2011, the major funder for engineering research, the Engineering and Physical Sciences Research Council (EPSRC), introduced new data management requirements for institutions that receive its funding (EPSRC 2011), but stopped short of a requirement to submit data management plans with funding proposals; this, and the fact that there is no tradition of data sharing within the discipline, means there is little other than generic guidance or experience for researchers to draw on when planning the management of their data. In response to this need, the Engineering Research Information Management (ERIM) Project has developed a data management planning regime for a research centre at the University of Bath, using an approach that may be re-used by other engineering research centres. This project is introduced in greater depth in Section 2.
In the course of the project's work, a technique has been developed for visualizing the context of research data records for the purposes of better management. This paper describes the modelling technique. Section 3 introduces the terminology and metamodel used in the technique, while Section 4 shows how the records and processes used in the research are visualized. Section 5 places this modelling technique in the wider data management context, and finally some conclusions are presented in section 6.
In 2009, the Joint Information System Committee (JISC) of the UK further and higher education funding councils initiated a programme of research entitled Managing Research Data (Down 2010). One of the strands in this programme comprised a set of projects providing detailed case studies of data management planning within different disciplines, with each case study drawn from the research interests of a different major UK funding body. These projects all aimed to illuminate the most pressing data management issues for their disciplines, and illustrate how they might be overcome by producing exemplar data management plans.
The EPSRC as a funding body and engineering as a discipline were represented in this programme strand by the ERIM Project. ERIM was an eighteen-month collaboration between the Innovative Design and Manufacturing Research Centre (IdMRC) at the University of Bath and the UK Digital Curation Centre, specifically its constituent partner at the University of Bath, UKOLN. The project had several broad aims:
- to specify in practical terms how data management may most effectively be performed within the IdMRC's research projects and, by extension, similar projects undertaken elsewhere;
- to explore the opportunities for and barriers to the re-use of engineering information, including the results of research conducted using highly sensitive industrial data and information;
- to identify the information required to ensure that research datasets are re-usable, either by the original researchers or others.
These aims translated into a set of objectives that included:
- a report into the state of the art of the digital curation of research data (Ball 2010);
- a set of six case studies intended to exhibit the gamut of data types, working practices, challenges and re-use opportunities in evidence within engineering research (Howard et al. 2010);
- a data management regime for use within the IdMRC that could be adapted by other similar centres.
The case studies in the second objective were selected from among research activities within the IdMRC itself and within a recent, major research project in which the IdMRC was the lead partner (Ball et al. 2006). In order to achieve the widest possible variety between the six cases eventually chosen, the research activities were classified according to factors such as: whether they relied on pre-existing data; the variety or otherwise of data types used; whether the research described an existing practice or sought a new or improved one; and whether the activity focused on the real world or a simulation of it.
Having selected the case studies, the question arose of how best to analyse them, both to uncover the pressing management issues and to suggest mitigating policies and procedures. Existing models such as the OAIS Reference Model (CCSDS 2002) and the DCC Curation Lifecycle Model (Higgins 2008) did not provide sufficient detail for stages prior to data being ingested into a repository, and while the I2S2 Idealised Scientific Research Activity Lifecycle Model (Patel 2010, p. 24) was more relevant it did not go down to the required level of granularity. The authors therefore embarked on course of theoretical work to model in detail the trail of data objects left behind by research activities, the validity of which model would be tested in the analysis of the case studies. This theoretical work, quite unexpectedly, became a major output of the ERIM Project.
When characterizing engineering research data during this project, two very important findings emerged that are worthy of mention here. The first is that, as it turns out, engineering research data are very diverse. From the samples taken, every conceivable type of research data might reasonably be encountered as a result of engineering research. It follows that methods for the more effective management of engineering research data will be usefully applicable to many data in general. The second finding, in hindsight unsurprising, is that to understand the data output of a research activity after the event, such that the data may usefully be used again in some way, requires that the context in which the data was generated be understood too, and that the relationship between individual elements of information (at the file level at least; see below) be known. This implies that it is not just research data that must be preserved after a research activity, but that contextualizing information be gathered and stored with that data, such that its future use can be assured. Furthermore, in order to ensure interpretability of research data by those unfamiliar with it, is necessary to make explicit the association between research data and between research data and its explicating information.
It is the realization of the second point that is the chief motivation for the modelling approach developed during the ERIM Project, and recognition of the first point that makes the ERIM theoretical work and modelling approach of interest outside the bounds of engineering research data management alone.
Fundamental to the ERIM model of research activity was a Data Management Terminology, full details of which may be found in Darlington (2011a). In order to clarify discussions on how data, once created, might be used, the following terms were defined.
- Data Use. Using research data for the current research purpose/activity to infer new knowledge about the research subject.
- Data Re-use. Using research data for a research purpose/activity other than that for which it was intended.
- Data Purposing. Making research data available and fit for the current research activity.
- Data Re-purposing. Making existing research data available and fit for a future known research activity.
- Supporting Data Re-use. Managing existing research data such that it will be available for a future unknown research activity. Unfortunately, no apposite verb could be found for this activity, hence the slight circumlocution; however it overlaps significantly with certain definitions of archiving, preservation and curation.
The data management performed in the course of a project can be seen as primarily serving the needs of data purposing, while the needs of data re-purposing are usually served, if at all, at the point of project completion when working up data for archiving. Supporting data re-use predominantly happens after ingest into a repository. Put into these terms, the vision for the new data management regime was to adapt the processes that served data purposing so that they also served data re-purposing and supporting data re-use. In this way, either the same effort on the part of the researcher would result in more re-usable data, or the same amount of re-usable data could be achieved with less effort. Realising this vision would entail some analysis of existing processes and the resulting data assets, hence the direction of the theoretical work.
When it came to describing the data associated with research activities, there was some concern over the granularity with which they were modelled (Darlington et al. 2008). The notion of being able to track the provenance of each individual datum in a published dataset back to its point of creation was attractive, and indeed there are efforts in some disciplines to move in this direction (Groth, Gibson and Velterop 2010). It was felt, however, that within the scope of the project a more achievable aim would be to concentrate on data management at the file level and higher. With this in mind, the following terms were defined.
- Information. Any type of knowledge that can be exchanged. In an exchange, it is represented by data.
- Data. Reinterpretable representations of information in a formalized manner suitable for communication, interpretation or processing.
- Data Object. Either a physical object or a digital object containing data.
- Data Record. A data object created, received and maintained as evidence of an activity.
- Data Case. The set of data records associated with some discrete research activity (project, task, experiment, etc.).
The notion of a digital object corresponds roughly with that of a file. As digital (and physical) objects can be nested, in cases where a single file contains multiple data objects, the file can still be treated as a digital object in its own right. The only obstacle to evading the distinction between object and file are therefore the cases where a single digital object is represented by a series of files.
Key to the entire data management endeavour is the notion of a data record. It implies that the data object has a level of importance within the research activity, and of conscious curatorial effort, exceeding that given to those which merely happen to survive. In order to model and visualize the roles of the various records in a data case, and thereby make decisions on how they should be managed, the authors constructed the following taxonomy (see Figure 1).
- Research Data Record. A record containing data that are descriptive of the research object, whether generated by the current research activity or a previous one. Its role is to provide the basis for inferring new knowledge about the object.
- Context Data Record. A record containing data that support the research activity but do not describe the research object nor are the research object itself. Such records might include a description of the methodology, an explanatory narrative, dictionaries, ontologies, standards documents or environmental data. Where this information has not been provided explicitly to explain or illuminate other data or records, it is referred to instead as an unintentional context data object.
- Associative Data Record. A context data record that makes explicit the associations between other data records or data.
- Research Object Data Record. A record containing data which is, itself, the object of the research enquiry. For example, in textual analyses of nineteenth-century newspapers, the newspapers themselves act as Research Object Data Records.
- Experimental Apparatus Data Record. A record containing symbolic representations which are functionally analogous to the physical experimental apparatus familiar in much laboratory-based research. For example, in textual analyses of nineteenth-century newspapers, the source code of the text-mining software and the parameters used to control it would form the content of Experimental Apparatus Data Records.
When it came to using this taxonomy in the six case studies, the authors did not always find it easy to decide in which category a particular record should go. When analysing records of what Blessing and Chakrabarti (2009) term prescriptive research, where experimental processes and designs are tested, they found the lines between research data records, research object data records and context data records especially blurred. Nevertheless, they found the process of classifying records useful as a method of appreciating the overall scope and shape of a data case.
Having decided on a terminology for describing the data objects associated with a research activity, two further tasks remained. One was to model the processes by which a data case is developed, and the second was to provide a way of visualizing both data objects and data development processes. For the second task, the desiderata were for the visualization:
- to use an existing standard, so that it could be at least partially understood using just the documentation for that standard;
- to be both expressive enough to model most data cases but rigorous enough to permit comparisons between diagrams;
- to be amenable to automation, both in terms of generating diagrams automatically and translating them to machine readable code/linked data.
After a review of the available standards, the most suitable representation languages were found to be IDEF3 (Mayer et al. 1995) and UML (OMG 2009). While both had strengths and weaknesses, UML was chosen as being more familiar to those in the data management community. As both data objects and development processes were to be modelled, UML activity diagrams were chosen as the primary language to use, extended with UML object diagram notation. Certain aspects of UML notation had to be relaxed to accommodate the fact that the primary focus of the modelling would be on the objects rather than on the processes, and that the models would show actual instances rather than idealized classes; so, for example, as an aid to visual clarity instance names were not underlined as they would normally be in UML, activity flows were allowed to be incomplete, objects could be included without being an explicit part of a flow, and where multiple flows were shown on a single diagram they were not interpreted as strictly concurrent.
The process of modelling a research activity in this way was given the name Research Activity Information Development (RAID) modelling, and the resulting diagrams produced were termed RAID diagrams. The intention was and is for one diagram to represent one research activity and hence one data case, though large data cases may require several diagrams.
Digital objects considered in the abstract (i.e. not as part of a file) and physical objects are visualized as simple UML objects. Digital objects that existed as files but were not kept as records are represented as UML objects, with the file name as the instance name and the role as the class name. If the role cannot be determined, the class 'Temporary File' may be used. Properties of the file are given in the lower partition (see Figure 2); to save space, the property keys 'mediaType', 'origin' and 'description' have been abbreviated to their initial letters.
Digital data records are modelled as UML datastores. If the record is manifested as a single file, its file name is given as the instance name, otherwise a brief identifying phrase is used. The role of the record is given as the class name; if the role cannot be determined, the generic role 'Data Record' may be used. Again, properties of the file(s) are given in the lower partition. If the record is not to be packaged as part of the data case being modelled, it is marked with the keyword 'external'.
If an object from which data are generated is not itself a record, it is modelled as a datastore with class 'Source' (see Figure 3).
When it came to enumerating the types of data development processes that occur in the course of research, the authors considered activities from across the spectrum of data purposing, data re-purposing and supporting data re-use, as well as from the perspectives of individual data, data records and entire data cases. The list they constructed was intended to be, but is not claimed to be, comprehensive from the perspective of engineering researchers (Howard et al. 2010). In all, fifteen processes were identified; these are defined below. When it came to visualizing these processes, some compromises had to be made due to the fact that RAID diagrams display the data records that make up a data case, and do not decompose these records into individual data. Thus processes that, strictly speaking, operate at the data level are visualized as linking the records that contain them. In order to clarify where such compromises have been made, the definitions below are accompanied by notes on how they are visualized.
Addition is the act of supplementing data at the data level. On a RAID diagram (see Figure 4), it is shown between two data records when data from within the first record has been added to the data within the second record.
Aggregation is the act of combining similar data from different sources for the purposes of increasing the sample size. A clear example found in the case studies was of questionnaire responses being combined to form a single dataset. On a RAID diagram (see Figure 5), it is shown as taking several records as inputs and outputting a further record when data from within the former records have been aggregated together and stored within the latter record.
Annotation is the act of adding 'information or additional marks formulated on a document for enhancing it with brief and useful explanations' (Evrard and Virbel 1996). In digital terms, this may be done either within the file itself using whatever commenting facilities the format provides ('inline'), or within a separate file that links to the annotated file or sections thereof ('stand-off'). On a RAID diagram (see Figure 6), inline annotation is represented as a UML activity, showing the data object whence the annotations originate if appropriate (examples [a] and [b]). Stand-off annotation is indicated by a UML note attached to the annotated record; the annotation can be placed directly in the RAID diagram if required (c), otherwise the file holding the annotations is represented as a data record within the note.
Association is the act of making explicit the relationship between data, data records and data cases. There are many ways in which this can be done, for example by using a file naming convention, embedding metadata or links, or constructing a RAID diagram. As such, the act of association itself is not appropriate for inclusion in a RAID diagram, as it would in most cases be either too abstract, too broadly applicable, or recursive. There is a convention in RAID diagrams, though, to record arbitrary associations as named object flows where the other action types do not apply (see Figure 7).
Augmentation is the act of adding a data record to a data case. In other words, it refers to a commitment to include a data object in any subsequent packaging of the data as evidence of the research activity. It is a process that occurs at a higher level than that depicted by RAID diagrams, but may be inferred from the inclusion within a RAID diagram of data objects marked as records (as opposed to transient data objects or data sources) without the 'external' keyword (see Figure 3, above).
Collation is the act of giving order to data assembled from different sources, and as such usually occurs in conjunction with aggregation. While it may seem a trivial operation to record, it can destroy potentially useful clues about data provenance, and therefore deserves some management attention. On a RAID diagram (see Figure 8), an act of collation that takes place at or close to the point at which a data record is created is represented by an action outputting the record (example [a]). Acts of collation that occur later or repeatedly are indicated by an action both inputting and outputting the same data record.
Collection is the act of acquiring or bringing together pre-existing data. On a RAID diagram (see Figure 9), it is shown between a data source and a data record containing (nothing but) pre-existing data acquired from the source. Since the definition of collection here applies equally to the use of pre-existing records and the aggregation/extraction of pre-existing data into a new record, on a RAID diagram the latter case may be represented either as a single 'collect' action or multiple 'collect' actions followed by an aggregation/extraction.
Deletion is the act of expunging or obliterating data. It can happen either by the destruction of an entire record, or by removing data from within a record. The latter is distinct from refinement (see below) in that the data are removed entirely rather than tidied or cleaned through the selective removal of individual data. Deletion is also distinct from neglect, that is, treating a data object as transient rather than as a record. On a RAID diagram (see Figure 10), if a deletion action is followed by a flow final node, this indicates that the preceding record has been destroyed; if the deletion is followed by a record, this indicates some data from the preceding record has been removed to create the following record. In-place deletion of data, where the changes are saved over the previous version instead of into a new version, should be explained in an annotation on the 'delete' action.
The term derivation is used in data contexts to describe many different kinds of transformation; for example, NASA use it for combining datasets, normalizing data to standard grids and creating different visualizations (CDMC 1986; EOSDP 1986). In the ERIM terminology, derivation is used only of processes that deliberately create new data from existing data, such as logical inference, extrapolation, or interpolation. On a RAID diagram (see Figure 11) derivation is represented as an action taking one or more records and optionally functions as input, and outputting a further record. The interpretation of this is that the function(s) have been used in conjunction with the data in the former record(s) to produce the data in the latter record.
Duplication is simply the act of making an identical copy of a file. Since such an operation is usually a small, transient step in larger process (e.g. circulating a questionnaire), it is not usually included in RAID diagrams. If visualizing a data case prior to packaging for archiving, it might be useful to indicate duplicate records on a RAID diagram so they may be reviewed as candidates for deletion, in which case a 'duplicate' action can be used (see Figure 12).
Extraction is the creation of a new record from portions of the data in one or more existing records. On a RAID diagram (see Figure 13), it is shown between two data records when data from within the first record has been copied or removed and stored within the second record.
Generation is the act of creating new data by acting on or interacting with a research subject. On a RAID diagram ([see Figure 14), it is represented by an action taking as inputs one or more sources (research subjects, experimental apparatus) and outputting one or more data records containing the new data. Records that explain the methodology may be indicated as annotations on the 'generate' action.
Migration is the transfer of digital information from one format to another with the intention of preserving the full information content. In the case studies it was shown to occur most frequently to allow the data to be processed by a different piece of software. On a RAID diagram (see Figure 15), it is shown between two records when the second record is the result of changing the file format of the first record. If information is deliberately lost in the migration, it should be represented as a combined migration and deletion process.
Population is the act of entering data from a record into an existing framework such as a knowledge base, database, pro-forma or modelling system. On a RAID diagram (see Figure 16), it is shown instead of an 'add' action when the second record is a database, knowledge base or some other form of record where the structure is driven by existing systems and frameworks rather than arising naturally from the contents.
Refinement is the act of re-expressing data in a different way or according to a different data model. It covers both reversible processes such as unit conversion and the correction of systematic errors, and irreversible processes such as precision reduction and the removal of outlying data points, noise, duplicates, and so on. Refinement differs from derivation in that it emphasizes correcting and simplifying existing data, rather than creating new data from them.
On a RAID diagram (see Figure 4), refinement is shown between two data records when data within the second record is a refined version of data within the first record. In cases of reversible refinement, it can sometimes be of benefit (in terms of storage and curatorial effort) to delete the earlier record while providing sufficient documentation to allow it to be recreated from the later record. Depending on the nature of the documentation, it should be added to the RAID diagram either as an input to or an annotation on the the 'refine' action.
Identifying the processes to which data are subject is important not only to understanding the data related to individual research activities, but also because associated with such processes are, particularly at the data level, a number of side-effects. The side-effects identified by the authors are not only a logical contingent of data development activities, but also can be identified and demonstrated in specific instances of development. These side-effects are: information loss, information gain, function loss, function gain and state loss (Howard et al. 2010).
By understanding and recording the processes through which instances of data are developed appropriate management intervention can take place to minimize the impact that these side-effects might otherwise have on data re-use and re-purposing.
The above modelling technique was developed iteratively in the course of its application to the six ERIM case studies. The use of named flows to represent arbitrary associations, for example, came about in response to finding important inter-relationships between records that could not be expressed succinctly in process terms. In addition to the visual grammar already described, the authors found it helpful to group records and activities into partitions representing avenues and stages of research within the research activity. For this, the authors adopted UML swimlane notation.
Figure 18 shows a RAID diagram mapping out a subset of the data records produced in the course of doctoral research into cryogenic machining. Swimlanes have been used to separate out two branches of the research, the first concerning the temperature profiles of a material (Set 1) and the second the material's cutting profiles (Set 2), with the latter further divided into the theoretical stage and the experimental stage.
Prior to the case study being performed, this research activity had only one data record – the thesis reporting the results of the research – and no intentional context data objects. It would have been impossible for any future researcher to validate the research findings, let alone build directly on the work. Performing RAID modelling retrospectively on the research improved matters in at least two regards. First, in considering the value of each of the remaining data objects to the data case, the researcher identified a large number of objects that ought to be treated as data records, that is, preserved as evidence of the research. Second, the modelling process yielded an intentional context data record – the RAID diagram itself – that would clarify for any future user of the data how each of the depicted records contributed to the research conclusions. The modelling technique also highlighted where gaps exist in the data case, either because information was not recorded at the time (as with some parts of the experimental procedure in this case) or because it was overwritten or deleted in the course of processing. Had the researcher been consciously building a data case from the beginning, these deficiencies might well have been avoided.
The authors estimate the cost of applying the RAID modelling technique retrospectively to this case study, to this limited extent, to be of the order of £800. A fuller analysis would have taken much longer and therefore cost more. Ideally, switching researchers over to using the technique when planning, carrying out and archiving their research should result in them spending the same amount of time or less on it overall. The authors acknowledge that this would be difficult if not impossible to achieve while expecting researchers to perform the modelling manually. They have therefore started to explore the feasibility of a software tool with the ability to collect associative and other contextual data semi-automatically by monitoring the computer-based activity of a researcher (Darlington 2011b).
The RAID modelling technique alone does not make a data management strategy. Even to derive maximum benefit from the technique implies a sophisticated research environment that can collect associational information largely automatically, establish bidirectional links between RAID diagrams and the records they depict, update these links when transferring records from working directories to archival packages, and so on. Moving towards that state implies some discipline with regard to file name and directory structure conventions, version control and in-file metadata. There are also many data management issues that the technique does not address at all: interoperability with other data being produced within the discipline in terms of file formats, data models, terminology and methodology; data quality assurance; data security and access; and licensing. This wider set of issues was addressed by the ERIM Project through a general data management plan for all IdMRC projects (Darlington et al. 2010a) and a template for individual projects' data management plans. The plan and template were derived from a set of data management principles (Darlington et al. 2010b) in conjunction with existing guidance, most notably the specification underlying the DCC's DMP Online tool (Donnelly and Jones 2010). The template calls for each project to keep a data record manifest, and the place of RAID modelling within the plan is as the preferred method of completing this manifest.
The motivation behind the project data record manifest is both to help researchers keep track of their own data, and to help future researchers understand the data, assess their suitability and re-use them for new research. It also satisfies the principle that data management plans should provide a specification of contextualizing methods in order to support re-purposing. In particular, the relationships recorded in the manifest between data records satisfy some users' requirements for provenance information.
The RAID modelling technique provides a method for visualizing in context the data records produced in the course of a research activity, emphasizing the development processes that lead from one data record to another. It aims to produce easily understandable diagrams, using UML activity diagrams as a base, for inclusion in data management plans. Such diagrams simplify the task of tracing the provenance of data in the final result, and of prioritizing data records for preservation.
While the technique has been developed and validated using six case studies of real-world engineering research, it has only been used retrospectively so far. Further work is needed to confirm the benefits of using the technique in the course of research, for the primary researcher as well as potential re-users. Furthermore, it is recognized that manually modelling a data case using this method may be burdensome for researchers, especially where the research involves a large number of small data records, and that the technique will need to be at least partially automated if it is to gain acceptance. To this end, a set of use cases and functional requirements for a RAID associative tool were drawn up. Additional funding was sought and secured from the JISC for a continuation project to develop this tool, while also expanding the IdMRC data management regime to the entire Department of Mechanical Engineering. This project is entitled REDm-MED (Research Data Management for Mechanical Engineering Departments) and runs for six months from November 2011.