1. Introduction
The Texas Digital Library (TDL) is a consortium of higher education institutions in Texas that provides shared services in support of research and teaching. The TDL began in 2005 as a partnership between four of the state's largest ARL universities: Texas A&M University, Texas Tech University, the University of Houston, and the University of Texas at Austin. Currently, the consortium has 15 members, representing large and small institutions from every region of the state.
The goal of the TDL is to use a shared services model to provide cost-effective, collaborative solutions to the challenges of digital storage, publication, and preservation of research, scholarship, and teaching materials. Among the services the TDL provides its members are: hosted digital repositories; hosted scholarly publishing tools; a learning objects repository; a "Preservation Network" to secure multiple copies of digital items at geographically distributed nodes; training, technical support, and opportunities for professional interaction; and, electronic thesis and dissertation (ETD) management software and infrastructure (Vireo).
In this paper, we start by describing the services TDL offers and why (and how) we wanted to use cloud-based storage and compute services. We then consider the evolution of our cloud services, through the initial deploy up until the current time. We consider some of the conclusions we have drawn about how useful the cloud has been for us, and describe some of the ongoing future work in this area.
2. Background
In this section, we provide more details about the services TDL provides, why we wanted to use the cloud, and our initial strategy for getting there.
2.1 TDL Services
TDL provides several services to its members. Some of these are called institutional services — TDL runs customized versions of these services for each institutional member. The institutional services provided by TDL are hosted institutional repositories (DSpace, see (Duraspace 2011a)) and ETD management systems (Vireo, see (Nürnberg et al. 2010)).
Other services are called faculty communication services — TDL runs a fixed number of these to be shared among all members. The faculty communication services provided by TDL are: Open Conference Systems (PKP 2011a), Open Journal Systems (PKP 2011b), WordPress (2011) for blogs and websites, MediaWiki (2011) for wikis, the Faculty Directory, and a staging area of the TDL Preservation Network (TDL 2011a).
Finally, TDL hosts a small number of other services for project partners, including the Texas Learning Object Repository (TDL 2011b), as well as a number of special purpose DSpace repositories.
All of aforementioned services had associated production and "labs" or sandbox instances. Together with supporting software and services, this constituted 77 service instances.
2.2 Initial architecture
Before the move of services to the cloud, the TDL ran 15 Sun compute servers (2 Sunfire 240s, 2 Sunfire 245s, 3 Sunfire 490s, 2 X 4150s, 2 T5120s and 4 T5220s); 6 storage servers (3 NetApp filer 3020s, 2 Sun StorageTek 6140s, 1 Sun SL500); and 8 pieces of networking support hardware (2 Cisco Catalyst 3750s, 2 Citrix Netscalers, 2 Brocade Silkworm 300Es and 2 Avocent Cyclades ACSs).
Each service was run within a separate "zone" on Sun servers running the Solaris operating system. A Solaris zone is a type of virtual machine (Oracle 2011).
2.3 Motivations For Using the Cloud
TDL hardware was physically located at the University of Texas at Austin. Recently, a new data center was opened on campus. The old data center in which TDL equipment was located was to be decommissioned. To prepare for the move of our hardware to the new data center, we designed and implemented a disaster recovery plan (DRP) that would allow us to continue to provide our services if the equipment move were unsuccessful.
The primary reason for our initial work using cloud services was as a DRP for this data center move. The centerpiece of this DRP was the duplication of our services in the Amazon EC2 cloud (Amazon 2011b). This is an extension of the idea of having an off-site backup for data: in the cloud duplication case, we also have an off-site backup of the services themselves.
There were, however, two secondary motivations for the cloud duplication effort. Firstly, the elastic nature of cloud services (both compute and storage) would allow us to pay only for what we actually use, as opposed to using our own hardware, which incurs cost regardless of utilization. Elasticity was seen as potentially very useful. The compute and storage demands of our members is very "bursty". For example, members may want to upload tens or hundreds of gigabytes of storage all at once into their TDL-hosted repository. Another example of this "burstiness" might be the load on the Vireo ETD management system, which has expected peak utilization shortly before deadlines for thesis and dissertation submission. Buying hardware to accommodate these bursts is problematic. Provisioning large amounts of storage via hardware takes significant time (to navigate budgetary processes of our host institution); provisioning significant compute resources for short periods via hardware leaves these resources underutilized for long periods of time. Cloud storage and compute resources were seen as potential answers to these problems.
Secondly, we envisioned a potential personnel savings. Owning hardware requires one to staff competencies (e.g., network architecture expertise) that we could effectively outsource by using cloud services. Even if staffing these competencies was still necessary, the resources committed to those skills could potentially be reduced.
2.4 Strategy
We chose to duplicate customer-facing services first. (The TDL runs a number of internal services to support, for example, software development across our distributed programming team.) The services to be moved were grouped into categories based on their dependencies on third-party software (e.g., LAMP stack, DSpace/Tomcat stack, etc.). For each category of software, a service was duplicated by hand. The experience of doing this allowed our team to write a set of utilities to duplicate other instances of the same category to the cloud. We required that all duplicated services be backed by scripts to do the duplication so that any procedures we used would be reproducible.
We also took the opportunity to simplify much of the service architecture during the duplication. That is, the duplicated service suite was not an exact clone of the service suite running on the hardware. In some cases, different services running in the same Solaris zone on the TDL hardware were duplicated in separate cloud virtual machines (VMs); in others, services running in separate zones were combined into the same EC2 VM.
3. Transition to Cloud
In this section, we consider separately different steps in our experience using the cloud to provision digital library services, specifically: at the the of the aforementioned data center move; our initial progress immediately after the move; and, our current architecture.
3.1 DRP for the Data Center Move
As stated above, our intent was to duplicate customer-facing services to the cloud first so that we could recover from any problems we might encounter during the data center move. We succeeded in duplicating most customer-facing services; however, we were unable to duplicate all infrastructure services to the cloud. Specifically, all production instances of institutional services and most instances of faculty communication services were duplicated on EC2.
Immediately after the duplication effort, the TDL was running 38 EC2 VMs. Non-duplicated services ran on 5 of our compute servers (1 Sunfire 240, 3 Sunfire 490s, 1 T5220), 4 storage servers (3 NetApp filer 3020s, 1 Sun StorageTek 6140), and most networking support (all but the 2 Cisco Catalyst 3750s).
In fact, there were problems with the installation of the networking infrastructure to our hardware in the new data center that left many machines either partially or fully offline after the move. Invoking the DRP and switching over services to the AWS copies, however, was only partially successful. Trying to manage some services on AWS (e.g., OJS) and some on the hardware (e.g., mail) was more problematic than expected.
The key lesson we learned from this phase of our cloud experience was the need for better clean-room testing of the cloud duplicated services. Our services were not instrumented to be duplicated, which made it difficult to find and track all of the service dependencies.
3.2 Reaction to the Data Center Move
After the difficulties we encountered invoking our DRP, we made two decisions:
- duplicate our entire service suite on EC2; and,
- better document service interdependencies.
We made the decision to split our system administration efforts immediately after the DRP partial failure along two fronts. The first worked to resolve the difficulties encountered during the move (i.e., restore primary copies of services on our hardware). The second worked to duplicate missing services on the cloud to ensure the duplicate copies of services were fully functional. A fallout of this second effort was the improved service interdependency documentation mentioned above.
The effort to restore primary hardware copies finished about one week after the data center move finished. We were able to consolidate our primary copies onto the reduced hardware set used immediately after the move for non-duplicated services. We continued to run our 38 EC2 VMs as backups.
The key lesson we learned from this phase was the dramatically lower cost of the cloud versions of our services relative to their primary hardware copies. The elasticity of our cloud duplcates allowed us to provision exactly the required amount of storage, compared to our hardware copies, which incurred the fixed amortized cost of storage that was partially unused. Conversely, the hardware copies of our services could not respond to large requests for additional storage in a timely manner, since buying new hardware, setting it up, and provisioning it takes several months. We were able, however, to provision large amounts of Amazon S3 storage essentially instantly.
3.3 Current Status
In the months since the data center move, we continued to duplicate services onto EC2. As these duplicated services proved to be as robust as our local copies (or even more so), we began to decomission our local copies of services and run exclsuively on EC2. We finally decommissioned our last piece of hardware about six months after the move was completed. The full duplication of all of our services resulted in 48 EC2 VMs running our various services.
As part of our move to decommission our hardware, our development team began to use cloud provisioned servers for development. This includes using Elastic Bamboo (Atlassian 2011) for builds and and Elastic Beanstalk (Amazon 2011a) for development and staging deploys.
4. Lessons Learned
In this section, we revisit the lessons we described in our original paper.
With regard to elastic, just-in-time capacity for both storage and computation, as before, we see this as a clear advantage. Our continued experience since our original paper was published has reinforced our initial perception of this as one of the most significant advantages of the cloud model.
With regard to lower personnel costs, we initially said that the jury is still out. There has been a learning curve with AWS. Additionally, running two copies of every service (on both our own hardware and AWS) has meant more in the short-term workload for our system administration and production teams. Our experience since then has allowed us to categorize this as a modest advantage. We are dedicating fewer resources to system administration than pre-cloud, but staff turnover has meant we are now trying to recruit system administrators with cloud experience, which is still not the norm. We expect that, long-term, this will continue to be a modest advantage.
With regard to disaster recoverythe original motivation for the cloud duplicationwe originally said that the news was mixed, since the switch-over of services to the AWS copies, however, was only partially successful. With more perspective, we underestimated the size of the task of duplicating services on the cloud, since we failed to take into account complex and undocumented service interdependencies. With a more realistic assessment of this task, we believe the cloud can provide a good solution for digital libraries organizations in search of a DRP.
Finally, an additional lesson we have come to learn is the difficulty someorganizations may have in budgeting cloud services. Specifically, shifting costs for service provision from capital to operating expenditures may seem daunting, novel or unpredictable, especially since these costs are not fixed up front. We have struggled to find a way for our members to budget (e.g., additional storage) in a more predictable way than the "pay-as-you-go" model provided by Amazon.
5. Future Work
Overall, our experience with deploying services on the web has been positive. We are continuing to examine further service offerings on AWS. For example, we now offer an AWS machine image (AMI) with a pre-installed copy of our open source Vireo ETD workflow management system, for institutions that would like to evaluate or run Vireo without having to build or install it (see the Vireo site at SourceForge). The ability to deploy clones of services easily has allowed us to experiment with bringing up new versions of services to allow more options for our members (e.g., different versions of OCS for different conferences).
DuraCloud (Duraspace 2011b) represents another avenue for fututre work at the TDL. DuraCloud provides an abstraction layer over concrete cloud providers. We view DuraCloud as another way to experiment with providing cloud storage both directly (as raw storage space) and indirectly (behind our services) in a more flexible way. There are significant differences in the DuraCloud approach to service provision and our requirements for most of our institutional and faculty communication services, however. Our investigations into how to apply DuraCloud usefully in our environment continues.