REDDNET and Digital Preservation in the Open Cloud:
Research at Texas Tech University Libraries on Long-Term Archival Storage

James Brewer
Texas Tech University Library
jim.brewer@ttu.edu

Tracy Popp
Texas Tech University Library
tracy.popp@ttu.edu

Joy Perrin
Texas Tech University Library
joy.perrin@ttu.edu

Abstract

In the realm of digital data, vendor-supplied cloud systems will still leave the user with responsibility for curation of digital data. Some of the very tasks users thought they were delegating to the cloud vendor may be a requirement for users after all. For example, cloud vendors most often require that users maintain archival copies. Beyond the better known vendor cloud model, we examine curation in two other models: inhouse clouds, and what we call "open" clouds which are neither inhouse clouds nor vendor supported clouds. In open clouds, users come aboard as participants or partners for example, by being invited to participate in development or in hosting hardware. In open cloud systems users can develop their own software and data management, control access, and purchase their own hardware while running securely in the cloud environment. To do so will still require working within the rules of the cloud system, but in some open cloud systems those restrictions and limitations can be walked around easily with surprisingly little loss of freedom. It is in this context that REDDnet (Research and Education Data Depot network) is presented as the place where the Texas Tech University (TTU)) Libraries have been conducting research on long-term digital archival storage. The REDDnet network by year's end will be at 1.2 petabytes (PB) with an additional 1.4 PB for a related project (Compact Muon Soleniod Heavy Ion [CMS-HI]); additionally there are over 200 TB of tape storage. These numbers exclude any disk space which TTU will be purchasing during the year. National Science Foundation (NSF) funding covering REDDnet and CMS-HI was in excess of $850,000 with $850,000 earmarked toward REDDnet. In the terminology we used above, REDDnet is an open cloud system that invited TTU Libraries to participate. This means that we run software which fits the REDDnet structure. We are beginning the final design of our system, and moving into the first stages of construction. And we have made a decision to move forward to purchase one-half PB of disk storage in the initial phase. The concerns, deliberations and testing are presented here along with our initial approach.

1. Introduction

This paper divides itself into four sections. First, we compare vendor cloud solutions to inhouse and open cloud solutions along specific points. From there we go to reviewing the background and historical uses of REDDnet, and look into what we believe is an exciting expansion of REDDnet beyond its initial design as a network primarily for physicists. In the third section we report on our testing with REDDnet and tools available for REDDnet. We end by talking about the strategies and goals we developed for our continued involvement in REDDnet.

We also touch in a general way upon cost issues when using clouds, and point to sources for more specific evaluation methods. In the coming year, REDDnet will have a number of changes intended to assist the physics community. Those changes will be available to us, becoming a benefit to our library community. It is anticipated that a future article will address our progress in our digital archival preservation efforts.

We do not address Digital Access Management Systems, techniques of digitization or metadata creation. We focus entirely on the archival preservation of digital data in the cloud.

2. Key responsibilities for cloud systems: vendor, inhouse and open clouds

At least for the near future, current technology for digital curation leaves the owner of data in the role of never being finished. Information professionals such as Reagan Moore have referred to the dilemma of this task as "communication with the future" (2008), referring to those inhabitants thousands of years from now who will see us as part of their ancient history. Surrounding this communications channel to the future are numerous perils. We foresee the angers of nature (weather damage and flooding, seismic blasts), criminal acts (terrorism and vandalism, viruses), and our everyday media which itself is unstable and can wear out or change spontaneously. Beyond these difficult concerns are complex issues such as cost savings by switching to new storage devices and conversion tasks. Over time we generally add to our existing data, so that the collections do not ever grow smaller.

Security of library data takes on a new aspect as we become aware of an ever-increasing concern. We are becoming aware that electronic theses and dissertations, which can contain patent information or information restricted to a sponsoring organization, are now the object of exploits to get hold of the data.

New file formats are also spreading quickly: note in particular the eReader marketplace with its abundance of new formats. Digital objects on top of everything else can be controlled by Digital Rights Management (DRM) software, which means that ownership of the files is not necessarily the same as access. For the most part cloud solutions do not address these threats and issues. There are many more issues which are often exposed, and in some cases not seen. Figure 1 below shows a number of these types of issues and summarizes issues those prompted us to look at open clouds.

Comparison of vender cloud systems to inhouse systems.

Issue

 

Vendor

Inhouse

Open

Comment

Responsibility for data loss.

May still be the user's responsibility

User's responsibility

User's responsibility

From Amazon's Web Services agreement: WE AND OUR AFFILIATES OR LICENSORS WILL NOT BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, CONSEQUENTIAL OR EXEMPLARY DAMAGES (INCLUDING DAMAGES FOR LOSS OF PROFITS, GOODWILL, USE, OR DATA), EVEN IF A PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.[1].

Make archival copies

May often be the user's responsibility

User's responsibility

User's responsibility

Amazon recommends Cloud customers make frequent archives of data.

Speed of repair of system damage

Not in the user's hands

User's responsibility

Not in the user's hands

 

Data storage methods

Not in the user's hands

User's responsibility

User's responsibility

 

Day-to-day running of the Cloud

Vendor's responsibility

User's responsibility

Infrastructure Hardware: Cloud owner;

Cloud owner software: Cloud owner;

User Hardware: User

User's software: User

 

Reports on health of the system

Vendor's responsibility

User's responsibility

User's responsibility

 

Periodic data recovery fire drills

What does the vendor say on this issue?

User's responsibility

User's responsibility

 

Data security

Does the vendor claim responsibility?

User's responsibility

User's responsibility

Some vendors do not claim responsibility for this.

Disaster Recorvery

Does the vendor claim responsibility? What about your archival copies?

User's responsibility

User's responsibility

 

Figure 1.Data handling issues for vendor, inhouse, or open cloud systems. Issues that may be overlooked. (Amazon web services, 2011).

Let's take a look at a practical example. We have seen an explosion of CODECs (coding/decoding software and hardware) which can compress or encrypt data. Our audio and video digital objects make significant use of them. An archival preservation system needs to know of the existence and location of each file, its file format, and its video or audio CODECs (video files usually have a video and audio CODEC, in some cases multiple CODECs in a single file). By locating these files in the preservation system and having their formats and CODECs, it will be easier to convert them if necessary. Certainly, by sometime in the more distant future, a conversion will come about. Users who might need to access the digital object will know what types of CODECs, if any, are required, and how to obtain them.

Institutions change repository systems over time in response to perceived needs that can be met by other software, or take advantage of new directions in software development. The successful change depends upon having complete access to your data in a timely way. At least for this issue, discussions with the cloud vendor need to look into the future and question the availability of your data. How would the vendor of a cloud system return your data to you if you wished to move to a new system? By having archival preservation copies of your source electronic documents and your metadata files, you are in a much stronger position to undergo change. This concept became the most important to us as we searched cloud options.

3. Focus on REDDnet: history, structure, and choosing this architecture

At Texas Tech University Libraries we have been using DSpace and CONTENTdm for a number of years. Our modeling for an archival system was challenged by our experience with these two products. Separation of documents and their metadata has been an issue for us as we began to look at alternate strategies of working with digital files. Long-term maintenance and flexibility with our data is hampered when the data are represented in system-specific formats which require the whole system in production to unload an ingested document. If the actual document is ingested and stored separately from its metadata, it becomes difficult to reunite the two, as might be needed in a system reload or migration. We see that as a growing limitation of our two current production systems for digital content management.

From our perspective, the best systems offer modular design where parts can be put together and switched out with ease. Tasks such as ingest, display, edit/maintenance, and backup are entirely separate an indepenent modules allowing any building block to be replaced. In particular, we focused in our planning on archiving, which should not depend on another program's design, vendor method, or traditional institutional practice.

Factors for success in integrating into an open cloud system can range over a wide set of issues, but having experienced practitioners locally on campus will be a significant advantage. We were fortunate to be able to cooperate with our High Performance Computing Center (HPCC) in broad, general discussions about REDDnet before we started. We were introduced to REDDnet's principal architects over the phone and had a chance to explore topics with them. We had a technical contact addressing our questions and dealing with concerns whenever we had them. This face-to-face mode continued all the way through our setup of the REDDnet depots (servers) and gave us an opportunity to work directly with the hardware in our hands. Our environment supported discussion of "what if" questions and working through the answers, so that our designs could be weighed before we began. HPCC had worked with other cloud systems before REDDnet and therefore was able to give us a much broader view. Whether you are working with a single-institution academic, multi-institution academic, or commercial system, prior experience and support for design and concept can speed your progress.

The TTU Libraries' relationship with HPCC allowed us access to two 20 TB REDDnet depots (40 TB total). Eventually, as we go to construction mode, we are making plans to purchase one-half petabyte, consisting of five depots each at 106 TB. In the initial phase, however, space is not an immediate consideration for our investigations, since 40 TB will carry us far. When we first contacted REDDnet, we saw an unexpected usage in comparison to our library practice: most data was temporary in nature, to be discarded after analysis. With regard to REDDnet in the past it had been true for physicists that

“[REDDnet’s]... mission ...[was] to provide “working storage” to help manage the logistics of moving and staging large amounts of data in the wide area network, e.g. among collaborating researchers who are either trying to move data from one collaborator (person or institution) to another or who want share large data sets for limited periods of time (ranging from a few hours to a few months) while they work on it. REDDnet is not designed or intended to be a replacement for reliable archival or long term personal storage. Although the REDDnet software stack does support reliable long-term archival storage to both disk and tape.” (Tackett, 2011)

In other words, REDDnet was not providing traditional data center backups for its users. It was their responsibility to archive their own data. When invited by HPCC and the Vanderbilt team of REDDnet (J. Brewer personal communication, October, 2010), it was at a time when the thinking on REDDET had changed. Other library users and other projects were also invited to consider using REDDnet for permanent digital storage. This reverses, in part, the original goal for REDDnet which was to provide short-term parking (e.g., a few months) for large, temporary data sets originating from collider data. The network availability originally allowed joint participation by a number of researchers, primarily working in physics, to analyze large data sets. When the analysis was completed, the data was discarded to make way for a new dataset, a policy which has been revised.

TTU goals for the archival system we are building are the following:

To gain a foothold in working with REDDnet, we first did two types of tests of file transfers

  1. Perform a large amount of data trasfers with larger files (20 TB) .
  2. Perhorm a high number of smaller file transfers (12,000 files) in multiple subdirectories of varying sizes with deep nesting levels.

Our primary tool was L-STORE (Lstore, 2010) (Logistical Storage), which runs as a Java-based Linux client. For testing purposes a basic knowledge of Linux works well. We also installed the L-STORE Windows web client, which is a quick method of accessing REDDnet to provide a graphical view of the network. Figure 2 below shows the Linux console program. Figure 3 follows and shows the Windows web interface.

 

L-STORE console image

L-STORE commands.

Figure 2. L-STORE commands: (Lstore, 2010).

REDDnet depots by nature are not visible on the Internet; particular depots are known to specific L-STORE servers. It is through L-STORE that depot data can be accessed, and this provides in its design a significantly high degree of security at the outset. We note that this is the only issue about REDDnet security we will address now in this paper, since it is a substantial topic on its own.

Windows web interface for L-STORE.

Windows REDDNET/L-STORE interface.

Figure 3. Windows REDDNET/L-STORE interface (Lstore, 2010).

 

One feature of the design of REDDNET is multiserver striping of data across depots (i.e. a portion of the data from a single file sits on multiple servers; the file gets written faster because slices of it go the multiple servers at the same time). Just as this feature can provide great efficiency on a local system, it is able to improve throughput for large data sets on REDDNET. A 2009 presentation on “REDDNET for Emergency Response Data Distribution“ presented some of the main features of multisever striping (Moore, 2011a):

This system has allowed transfer rates at 3.3 GBytes/sec (Lstore, 2010) with higher rates projected for the future. with higher rates for the future. For a more visual sense of REDDnet structure, see Figure 4 which shows the REDDNET Americas Map and European nodes.

REDDNET Americas Map with additional section showing CERN in Europe

Figure 4. REDDNET Americas Map with additional section showing CERN in Europe (Moore, 2011a).

4. Weighing REDDNET for suitability in library digital archiving

As an industry, Information Technology (IT) is not always successful in disaster recovery (DR). In Symantec's October 2010 DR study, which was based on interviews with IT decision makers at 1,700 large enterprises, the failure rate on recovery tests was 30 percent (Fegreus, 2011). According to this study, reasons for the failure have to do with untried or untested data recovery and systems "too complex" for reliable data restoration. If nothing else, this information suggests that actual checks for successful data backup and file restoration must be undertaken at regular intervals. Does the cloud vendor guarantee doing that? Or on your inhouse system, do you take care of that? Development of disaster recovery methods in some systems takes a back seat to other demands.

With these goals in mind, we began investigating strategies to provide robust archival backup, which would include multiple copies, error checking and comparison, and reporting tools which could verify file status at any time. Since we would include many born digital objects whose persistence over time requires archival backup, our data needs are clearly different from the main users of REDDnet. Because "backup" is not always a clear and distinct term and may imply a variety of different strategies depending on the environment and purpose, it is important to address varying types of backup.

We note the following backup strategies, which can be used together or alone.

Some aspects of Archival Backup are:

Our approach can be perceived in effect as two systems, one providing file management system backup and a related system that stores and protects the original document files and their metadata files. By implementing five REDDnet depots on the network, we will achieve high redundancy of data. The impact increases as we expect these depots to reside in separate locations.The geographical dispersal of REDDnet allows that as a choice.

A strong attraction to REDDnet comes about from its relatively long history in the cloud world. REDDnet was launched in 2006 (ACCRE, 2011a), and from that time forward it has been expanding and attracting new participants.

"The Research and Education Data Depot network (REDDnet) team at Vanderbilt has been selected as a 2010 Internet2 IDEA award winner. REDDnet was selected based on its innovative and important solution, including the Data Logistics Toolkit, for large distributed storage facilities for data intensive collaboration among the nation's researchers and educators in a wide variety of application areas. REDDnet is an NSF-funded infrastructure project that provides a large distributed storage facility for data-intensive collaboration among the nation's researchers and educators in a wide variety of application areas including Vanderbilt's involvement in the LSST telescope project." (Stassun, 2011).

For a more complete overview, see "A Strategy for Campus Bridging for Data Logistics" (Moore, 2011b) and "REDDnet: Enabling Data Intensive Science in the Wide Area" (REDDnet, 2009).

5. Using REDDnet: tools, methods, distant objectives

The REDDnet system is looking to deploy more than 1.2 PB of distributed storage and 200 Terabytes of tape (Lstore, 2007) this year, and an additional 1.4 PB of storage used in a closely related project. Organizations such as the CMS-HI already mentioned, CERN (European Laboratory for Particle Physics), Large Synoptic Survey Telescope (LSSST), and Oak Ridge National Laboratory (ORNL) are playing a role in the use and development of REDDnet capabilities. Vanderbilt University is supplying significant project direction with NSF funding and funding from the Vanderbilt Center for the Americas. Principal collaborators are Vanderbilt University, University of Tennessee, Stephen F. Austin State University, Nevoa Networks, North Carolina State University, University of Delaware, Universidade de São Paulo, Universidade do Estado do Rio de Janeiro, University of Michigan, University of Florida, Fermilab, Caltech, and AMPATH (Pathway of the Americas). (ACCRE, 2011b) This is not a complete list of participants or collaborators.

ACCRE (Advanced Computing Center for Education & Research) at Vanderbilt University has developed L-STORE (LOGISTICAL STORAGE (ACCRE, 2011c), which is used as a client for accessing REDDnet depots. From the L-STORE Wiki:

"L-Store provides a flexible logistical storage framework for distributed, scalable, and secure access to data for a wide spectrum of users. L-Store is planned to be used on the REDDnet infrastructure. It is designed to provide: virtually unlimited scalability in both raw storage and associated file system metadata; a decentralized management system; security; fault tolerant metadata support; user controlled replication and striping of data on a file and directory level; scalable performance in both raw data movement and metadata queries; a virtual file system interface in both a web and command line form; and support for the concept of geographical locations for data migration to facilitate quicker access (ACCRE, 2011c)."

L-STORE user commands are available from a Linux console or users can have a Windows web interface. A Java library of commands is available. Readers should note that this software is still in the development phase. Currently it provides functions such as:

As noted earlier, a web interface is available for providing similar access as these commands.

Goals for our development include all of the following areas:

By providing our own archival backup services, we avoid a number of risk situations by tackling them ourselves. The issues we avoid are:

We instead inherit the following responsibilities as a result:

In our discussions with campus HPCC we spoke about a number of concerns that come about when working with a cloud vendor. These are:

HPCC recommends the Three E's:

Many users operate under the assumption that cloud computing will be more cost effective in the long run. We will need to uncover solid evidence for our particular case, especially based on the fact there are new studies and research about this topic as it becomes mature. At the recent Usenix HotCloud 2011 Workshop on Hot Topics in Cloud Computing (Jackson, 2011), Bryan Chul Tak et al. presented a paper (Chul Tak et al. 2011) which looked at costs for customers using Amazon EC2 and Microsoft Azure. One of the findings showed that

"For small workloads, the servers procured for in-house provisioning end up having significantly more capacity than needed (and they remain under-utilized) since they are the lowest granularity servers available in market today. On the other hand, cloud can offer instances matching the small workload needs (due to the statistical multiplexing and virtualization it employs). For medium workload intensity, cloud-based options are cost-effective only if the application needs to be supported for 2-3 years [emphasis added], and become expensive for longer lasting scenarios [emphasis added]. These workload intensities are able to utilize well provisioned servers making in-house procurement cost-effective." (Chul Tak et al. 2011)

Further, according to Bryan Chul Tak et al.

"Even if we assume the performance/$ offered by the cloud improves with time (say, an instance of given capacity becomes cheaper over time), cloud-based provisioning still remains expensive in the long run since data capacity and transfer costs contribute to the costs more significantly than in-house." [emphasis added] (Chul Tak et al. 2011)

Our interpretation of these points is that a cost analysis needs to done to support the move to the cloud. One final point in this article is, "using the cloud need not preclude a continued use of in-house infrastructure. The most cost-effective approach for an organization might, in fact, involve a combination of cloud and in-house resources rather than choosing one over the other" (Chul Tak et al. 2011). A cost analysis needs to be done to support the movement to the cloud. Applying these ideas to our scenario, application refers to the archival preservation system.

Standards for digital preservation and cloud standards need to emerge as well. Since in so many instances of cloud, we think of a vendor and note that a vendor's presence cannot be guaranteed, this area takes on a significance beyond the normal details of preservation work. Approximately 75% of last year's cloud vendors are out of business (J. Brewer Personal communicaiton with Dr. Alan Sill of the HPCC at Texas Tech University March 2011). Libraries have always had concerns about major vendors, such as those of online catalog software. Without backup elsewhere to capture the work libraries have put into maintaining their MARC data, there are real concerns for worry. Are the same conditions and details being taken into account when libraries approach cloud vendors? This may be one area where inhouse solutions show their strength beyond commercial/semi-commercial efforts.

In presenting this paper the goal is to review cloud factors that are present whether you are working with a cloud vendor, inhouse, or with an open cloud. In most cases some of the same work needs to be done. Responsibilities for archiving exist on both sides, and costs need to be decided based on the mix of tasks you are implementing. We at TTU firmly believe we will gain sufficient flexibility from our decisions to justify the direction of proceeding in part on our own with a mix of services from an open cloud.

6. Bibliography

REDDnet Related:

Cloud Computing Related:

Digital Preservation Related:

 

7. Acknowledgements

We would like to thank Donald Dyal, Paul Sheldon, Alan Sill, Alan Tackett, and Robert Sweet for their help and insights by sharing ideas and being part of a long series of discussions.

8. References

9. Notes

  1. http://aws.amazon.com/agreement/, 11. Limitations and Liability. This section reads:

    "WE AND OUR AFFILIATES OR LICENSORS WILL NOT BE LIABLE TO YOU FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, CONSEQUENTIAL OR EXEMPLARY DAMAGES (INCLUDING DAMAGES FOR LOSS OF PROFITS, GOODWILL, USE, OR DATA), EVEN IF A PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. FURTHER, NEITHER WE NOR ANY OF OUR AFFILIATES OR LICENSORS WILL BE RESPONSIBLE FOR ANY COMPENSATION, REIMBURSEMENT, OR DAMAGES ARISING IN CONNECTION WITH: (A) YOUR INABILITY TO USE THE SERVICES, INCLUDING AS A RESULT OF ANY (I) TERMINATION OR SUSPENSION OF THIS AGREEMENT OR YOUR USE OF OR ACCESS TO THE SERVICE OFFERINGS, (II) OUR DISCONTINUATION OF ANY OR ALL OF THE SERVICE OFFERINGS, OR, (III) WITHOUT LIMITING ANY OBLIGATIONS UNDER THE SLAS, ANY UNANTICIPATED OR UNSCHEDULED DOWNTIME OF ALL OR A PORTION OF THE SERVICES FOR ANY REASON, INCLUDING AS A RESULT OF POWER OUTAGES, SYSTEM FAILURES OR OTHER INTERRUPTIONS; (B) THE COST OF PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; (c) ANY INVESTMENTS, EXPENDITURES, OR COMMITMENTS BY YOU IN CONNECTION WITH THIS AGREEMENT OR YOUR USE OF OR ACCESS TO THE SERVICE OFFERINGS; OR (D) ANY UNAUTHORIZED ACCESS TO, ALTERATION OF, OR THE DELETION, DESTRUCTION, DAMAGE, LOSS OR FAILURE TO STORE ANY OF YOUR CONTENT OR OTHER DATA. IN ANY CASE, OUR AND OUR AFFILIATES' AND LICENSORS' AGGREGATE LIABILITY UNDER THIS AGREEMENT WILL BE LIMITED TO THE AMOUNT YOU ACTUALLY PAY US UNDER THIS AGREEMENT FOR THE SERVICE THAT GAVE RISE TO THE CLAIM DURING THE 12 MONTHS PRECEDING THE CLAIM."

  2. Tape data can fail after it is created, and so can hard disk data. The advantage of the hard disk is that it can be seen online and tested constantly/periodically in a relatively inexpensive environment. Tape systems which would be equally flexible would incur the cost of robotic arms, but would run outside our budget limits. Moving to campus Central IT would further reduce our flexibility. Current disk pricing trends favor our method.
  3. Hot backups present other types of problems, since there is no other file to compare them to when verifying checksums; they are generated on-the-fly as a temporary file to allow backup to run while system users carry out transactions. Hot backups need to be processed again transaction logs to bring a system current.