Web browsers are used to access cloud storage and online applications. Is special knowledge needed on the part of end users and systems engineers to exploit them to the fullest? If the address matches an existing account you will receive an email with instructions to reset your password. Performance of periodograms on three different clouds. The result shows that for relatively small computations, commercial clouds provide good performance at a reasonable cost. One group [3] is investigating the applicability of GPUs in astronomy by studying performance improvements for many types of applications, including input/output (I/O) and compute-intensive applications. Principally, articles will address topics that are core to Cloud Computing, focusing on the Cloud applications, the Cloud systems, and the advances that will lead to the Clouds of the future. Monthly storage cost for three workflows. These technologies include processing technologies such as graphical processing units (GPUs), frameworks such as MapReduce and Hadoop, and platforms such as grids and clouds. For Broadband, the picture is quite different. Storage cost consists of the cost to store VM images in S3, and the cost of storing input data in EBS. We chose three workflow applications because their usage of computational resources is very different. Scientific applications usually require significant resources, however not all scientists have access to sufficient high-end computing systems. All application executables and input files were stored in the Lustre file system. IU, Indiana University; UofC, University of Chicago; UCSD, University of California San Diego; UFI, University of Florida. The fixed monthly cost of storing input data for the three applications is shown in table 5. Cloud Computing with e-Science Applications: Terzo, Olivier, Mossucca, Lorenzo: Amazon.sg: Books CCSA workshop has been formed to promote research and development activities focused on enabling and scaling scientific applications using distributed computing paradigms, such as cluster, Grid, and Cloud Computing. Table 8 shows the results of processing 210 000 Kepler time-series datasets on AmEC2 using 128 cores (16 nodes) of the c1.xlarge instance type (Runs 1 and 2) and of processing the same datasets on the NSF TeraGrid using 128 cores (8 nodes) from the Ranger cluster (Run 3). Processing costs do not vary widely with machine, so there is no reason to choose anything other than the most powerful machines. Two publications [7,8] detail the impact of this business model on end users of commercial and academic clouds. Table 5.Monthly storage cost for three workflows. Table 9 shows the locations and available resources of five clusters at four FutureGrid sites across the US in November 2010. Cloud computing is the industry standard for a reason. — Virtualization overhead on AmEC2 is generally small, but most evident for CPU-bound applications. Enter your email address below and we will send you your username, If the address matches an existing account you will receive an email with instructions to retrieve your username, Infrared Processing and Analysis Center, Caltech, Pasadena, CA 91125, USA, University of Southern California Information Sciences Institute, Marina del Rey CA 90292, USA. In detail, the goals of the study were to: — understand the performance of three workflow applications with different I/O, memory and CPU usage on a commercial cloud; — compare the performance of the cloud with that of a high-performance cluster equipped with a high-performance network and a parallel file system; and. It is a strongly CPU-bound application, as it spends 90 per cent of the runtime processing data, and the datasets are small; so the transfer and storage costs are not excessive [13]. They cite the example of hosting the 12 TB volume of the 2MASS survey, which would cost US$12 000 per year if stored on S3, the same cost as the outright purchase of a disk farm, inclusive of hardware purchase, support and facility and energy costs for 3 years. The performances of the different workflows do, however, depend on the architectures of the storage system used, and on the way in which the workflow application itself uses and stores files, both of which of course govern how efficiently data are communicated between workflow tasks. By contrast, Epigenome shows much less variation than Montage because it is strongly CPU bound. Similar results apply to Epigenome: the machine offering the best performance, c1.xlarge, is the second cheapest machine. User requirements are getting more and more complex. We report here the results of investigations of the applicability of commercial cloud computing to scientific computing, with an emphasis on astronomy, including investigations of what types of applications can be run cheaply and efficiently on the cloud, and an example of an application well suited to the cloud: processing a large dataset to create a new science product. This is particularly the case for I/O-bound applications, whose performance benefits greatly from the availability of parallel file systems. Traditional grids and clusters use network or parallel file systems. Input data were stored for the long term on elastic block store (EBS) volumes, but transferred to local disks for processing. PVFS likely performs poorly because the small file optimization that is part of the current release had not been incorporated at the time of the experiment. Periodograms identify the significance of periodic signals present in a time-series dataset, such as those arising from transiting planets and from stellar variability. Pegasus offers two major benefits in performing the studies itemized in the introduction. In particular, we used the FutureGrid and Magellan academic clouds. — Mapper (Pegasus mapper): generates an executable workflow based on an abstract workflow provided by the user or workflow composition system. For Broadband, the picture is quite different. Traditional grids and clusters use network or parallel file systems. Figure 1. One example is Magellan, deployed at the US Department of Energy's National Energy Research Scientific Computing Center with Eucalyptus technologies (http://open.eucalyptus.com/), which are aimed at creating private clouds. For a memory-bound application such as Broadband, the processing advantage of the parallel file system disappears: abe.lustre offers only slightly better performance than abe.local. Our initial experiments used subsets of the publicly released Kepler datasets. — analyse the costs associated with running workflows on a commercial cloud. Given that scientists will almost certainly need to transfer products out of the cloud, transfer costs may prove prohibitively expensive for high-volume products. This work was supported in part by the National Science Foundation under grants nos 0910812 (FutureGrid) and OCI-0943725 (CorralWMS). — AmEC2 offers no cost benefits over locally hosted storage, and is generally more expensive, but eliminates local maintenance and energy costs, and offers high-quality storage products. If the address matches an existing account you will receive an email with instructions to reset your password. Similarly, S3 is at a disadvantage, especially for workflows with many files, because Amazon charges a fee per S3 transaction. The FutureGrid testbed includes a geographically distributed set of heterogeneous computing systems, a data management system and a dedicated network. In general, GlusterFS delivered good performance for all the applications tested and seemed to perform well with both a large number of small files, and a large number of clients. (Online version in colour. A number of such tools are under development, and the investigations reported here used two of them: Wrangler [9] and the Pegasus Workflow Management System [10]. All rights reserved. What are the overheads and hidden costs in using these technologies? Providers generally charge for all operations, including processing, transfer of input data into the cloud and transfer of data out of the cloud, storage of data, disk operations and storage of VM images and applications. Cloud computing, method of running application software and storing related data in central computer systems and providing customers or other users access to them through the Internet. The c1.xlarge type is nearly equivalent to abe.local and delivered nearly equivalent performance (within 8%), which indicates that the virtualization overhead does not seriously degrade performance. Performance and costs associated with the execution of periodograms of the Kepler datasets on Amazon and the NSF TeraGrid. Astronomers generally take advantage of a cloud environment to provide the infrastructure to build and run parallel applications; that is, they use it as what has come to be called ‘Infrastructure as a Service’. Figure 1. Data transfer sizes per workflow on Amazon EC2. We will refer to these instances by their AmEC2 name throughout the paper. They are finding that what they call ‘arithmetically intensive’ applications run most effectively on GPUs, and they cite examples such as radio-telescope signal correlation and machine learning that run 100 times faster than on central processing unit (CPU)-based platforms. The José Vasconcelos Library in Mexico City, Mexico, includes some … As a rule, cloud providers make available to end users root access to instances of virtual machines (VMs) running an operating system of the user's choice, but they offer no system administration support beyond ensuring that the VM instances function. Montage is maintained by the NASA/IPAC Infrared Science Archive. FutureGrid available Nimbus and Eucalyptus cores in November 2010. The cost of the protocol used by Condor to communicate between the submit host and the workers is not included, but it is estimated to be much less than US$0.01 per workflow. What demands do they place on applications? What kind of tools will allow users to provision resources and run their jobs? It has double the memory of the other machine types, and the extra memory is used by the Linux kernel for the file system buffer cache to reduce the amount of time the application spends waiting for I/O. The book provides the scientific community with an essential reference for moving applications to the cloud. The walltime measures the end-to-end workflow execution, while the cumulative duration is the sum of the execution times of all the tasks in the workflow. We used the Eucalyptus and Nimbus technologies to manage and configure resources, and to constrain our resource usage to roughly a quarter of the available resources in order to leave resources available for other users. While data transfer costs for Epigenome and Broadband are small, for Montage, they are larger than the processing and storage costs using the most cost-effective resource type. (Online version in colour. The usage of cloud computing has gained a significant advantage due to the reduced cost of ownership of IT applications, extremely fast entry into the services market, as well as rapid increases in employee productivity. NFS performed surprisingly well in cases where there were either few clients, or when the I/O requirements of the application were low. The legend identifies the processor instances listed in tables 3 and 4.Download figureOpen in new tabDownload powerPoint. Summary of processing resources on Amazon EC2. See Deelman. The scientific goal for our experiments was to calculate an atlas of periodograms for the time-series datasets released by the Kepler mission (http://kepler.nasa.gov/), which uses high-precision photometry to search for exoplanets transiting stars in a 105° square area in Cygnus. Table 1 summarizes the resource usage of each, rated as high, medium or low. The legend identifies the processor instances listed in tables 3 and 4. If there is less, some cores must sit idle to prevent the system from running out of memory or swapping. We created a single workflow for each application to be used throughout the study. See Deelman. Data discovery and access for the next decade. Cloud Computing with e-Science Applications explains how cloud computing can improve data management in data-heavy fields such as bioinformatics, earth science, and computer science. The runtimes in hours for the Montage, Broadband and Epigenome workflows on the Amazon EC2 cloud and on Abe. To have an unbiased comparison of the performance of workflows on AmEC2 and Abe, all the experiments presented here were conducted on single nodes, using the local disk on both EC2 and Abe, and the parallel file system on Abe. Abe.local's performance is only 1 per cent better than c1.xlarge; so virtualization overhead is essentially negligible. NFS was at a disadvantage compared with the other systems because it used an extra, dedicated node to host the file system; overloading a compute node to run the NFS server did not significantly reduce the cost. Configuration of these instances, installation and testing of applications, deployment of tools for managing and monitoring their performance, and general systems administration are the responsibility of the end user. The astronomical community is collaborating with computer scientists in investigating how emerging technologies can support the next generation of what has come to be called data-driven astronomical computing [2]. The cost on running this workflow on Amazon is approximately US$31, with US$2 in data transfer costs. Epigenome (CPU bound). Table 7.File systems investigated on Amazon EC2. Wrangler then provisions and configures the VMs according to their dependencies, and monitors them until they are no longer needed. The role of cloud computing on a corporate level can be either for the in house operations, or as a deployment tool for software or services the company develops for the public. Runs 1 and 2 used two computationally similar algorithms, whereas Run 3 used an algorithm that was considerably more computationally intensive than those used in Runs 1 and 2. These periodograms executed the Plavchan algorithm [13], the most computationally intensive algorithm implemented by the periodogram code. Here, we summarize the important results and the experimental details needed to properly interpret them. Wrangler users describe their deployments using a simple extensible markup language (XML) format, which specifies the type and quantity of VMs to provision, the dependencies between the VMs and the configuration settings to apply to each VM. This is because m1.small has only a 50 per cent share of one core, and only one of the cores can be used on c1.medium because of memory limitations. The result shows that for relatively small computations, commercial clouds provide good performance at a reasonable cost. ), Figure 4. A thorough cost–benefit analysis, of the kind described here, should always be carried out in deciding whether to use a commercial cloud for running workflow applications, and end-users should perform this analysis every time price changes are announced. Wrangler is a service that automates the deployment of complex, distributed applications on infrastructure clouds. Storage cost. However, when computations grow larger, the costs of computing become significant. The nodes on the TeraGrid and Amazon were comparable in terms of CPU type, speed and memory. — The resources offered by AmEC2 are generally less powerful than those available in HPCs and generally do not offer the same performance. Our investigations used the periodogram service at the National Aeronautics and Space Administration's Exoplanet Archive [13]. Enter your email address below and we will send you the reset instructions. Storage cost consists of the cost to store VM images in S3, and the cost of storing input data in EBS. is supported by the NASA Exoplanet Science Institute at the Infrared Processing and Analysis Center, operated by the California Institute of Technology in coordination with the Jet Propulsion Laboratory (JPL). [7] and the United States Department of Energy Advanced Scientific Computing Research Program [8] point out that this activity can incur considerable business costs and must be taken into account when deciding whether to use a cloud platform. Nevertheless, the cloud is clearly a powerful and cost-effective tool for CPU- and memory-bound applications, especially, if one-time, bulk processing is required and especially if data volumes involved are modest. Pegasus has been developed over several years. Variation with the number of cores of the runtime and data-sharing costs for the Epigenome workflow for the data storage options identified in table 7. We will refer to these instances by their AmEC2 name throughout the paper. While the costs will change with time, this paper shows that the study must account for itemized charges for resource usage, data transfer and storage. Broadband (http://scec.usc.edu/research/cme/) generates and compares synthetic seismograms for several sources (earthquake scenarios) and sites (geographical locations). Summary of processing resources on the Abe high-performance cluster. Porting applications to run on different environments, along with installation of dependent toolkits or libraries, is the end user's responsibility. Figure 2. While data transfer costs for Epigenome and Broadband are small, for Montage, they are larger than the processing and storage costs using the most cost-effective resource type. We measured and compared the total execution time of the workflows on these resources, their input/output needs and quantified the costs. — Execution engine (DAGMan): executes the tasks defined by the workflow in order of their dependencies. The other is that Pegasus manages data on behalf of the user: infers the required data transfers, registers data into catalogues and captures performance information while maintaining a common user interface for workflow submission. We ran experiments on AmEC2 (http://aws.amazon.com/ec2/) and the National Center for Supercomputer Applications Abe high-performance cluster (http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/Intel64Cluster/). In table 2, input is the amount of input data to the workflow, output is the amount of output data and logs refers to the amount of logging data that is recorded for workflow tasks and transferred back to the submit host. The runtimes in hours for the Montage, Broadband and Epigenome workflows on the Amazon EC2 cloud and on Abe. Cloud research tools provide a platform for new avenues of scientific research by providing fast access to bare-metal resources. A few approaches try to use the topology information to improve the performance of systems (e.g., in [7]). While academic clouds cannot yet offer the range of services offered by AmEC2, their performance on the one product generated so far is comparable to that of AmEC2, and when these clouds are fully developed, may offer an excellent alternative to commercial clouds. NFS was at a disadvantage compared with the other systems because it used an extra, dedicated node to host the file system; overloading a compute node to run the NFS server did not significantly reduce the cost. The potential applications of cloud computing are many: financial applications, health care services, business enterprises and many others. It is written in C for performance, and supports three algorithms that find periodicities according to their shape and according to their underlying data sampling rates. The costs of transferring data into and out of the Amazon EC2 cloud. Analysing astronomy algorithms for GPUs and beyond, Astronomical image processing with Hadoop, Scientific workflow applications on Amazon EC2, Debunking some common misconceptions of science in the cloud, Automating application deployment in infrastructure clouds, Pegasus: a framework for mapping complex scientific workflows onto distributed systems, Data sharing options for scientific workflows on Amazon EC2, Experiences with resource provisioning for scientific workflows using corral, The application of cloud computing to astronomy: a study of cost and performance, Design of the futuregrid experiment management framework, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, http://www.ncsa.illinois.edu/UserInfo/Resources/Hardware/Intel64Cluster/, e-Science–towards the cloud: infrastructures, applications and research, http://queue.acm.org/detail.cfm?id=2047483, http://datasys.cs.iit.edu/events/ScienceCloud2011/, http://science.energy.gov/∼/media/ascr/pdf/program-documents/docs/Magellan_Final_Report.pdf, centralized node acts as a file server for a group of servers, non-uniform file access (NUFA): write to new files always on local disk, distribute: files distributed among nodes. The glide-ins contact a Condor central manager controlled by the user where they can be used to execute the user's jobs on the remote resources. Table 2.Data transfer sizes per workflow on Amazon EC2. Theme Issue ‘e-Science–towards the cloud: infrastructures, applications and research’ compiled and edited by Paul Townend, Jie Xu and Jim Austin, The application of cloud computing to scientific workflows: a study of cost and performance. File systems investigated on Amazon EC2. Our investigations used the periodogram service at the National Aeronautics and Space Administration's Exoplanet Archive [13]. The application of cloud computing to scientific workflows: A study of cost and performance . Under AmEC2's current cost structure, long-term storage of data is prohibitively expensive. IU, Indiana University; UofC, University of Chicago; UCSD, University of California San Diego; UFI, University of Florida. What are the overheads and hidden costs in using these technologies? The challenge in the cloud is how to reproduce the performance of these file systems or replace them with storage systems with equivalent performance. should be automatically deployed on these resources. — Execution engine (DAGMan): executes the tasks defined by the workflow in order of their dependencies. Understanding the Performance and Potential of Cloud Computing for Scientific Applications Variation with the number of cores of the runtime and data-sharing costs for the Broadband workflow for the data storage options identified in table 7. — What are the costs of running workflows on commercial clouds? Table 10 shows the characteristics of the various cloud deployments and the results of the computations. — Do academic cloud platforms offer any performance advantages over commercial clouds? Footnotes. Since the completion of this study, AmEC2 has begun to offer high-performance options, and repeating this experiment with them would be valuable. Pegasus offers two major benefits in performing the studies itemized in the introduction. While the AmEC2 instances are not prohibitively slow, the processing times on abe.lustre are nevertheless nearly three times faster than the fastest AmEC2 machines. We can see that the performance on the three clouds is comparable, achieving a speed up of approximately 43 on 48 cores. Archives of the future must instead offer processing and analysis of massive volumes of data on distributed high-performance technologies and platforms, such as grids and the cloud. The best performance was achieved on the m1.xlarge resource. The cost on running this workflow on Amazon is approximately US$31, with US$2 in data transfer costs. S3 produced good performance for one application, possibly owing to the use of caching in our implementation of the S3 client. The 32 bit image used for the experiments in this study was 773 MB, compressed, and the 64 bit image was 729 MB, compressed, for a total fixed cost of US$0.22 per month. S3 performs relatively well because the workflow reuses many files, and this improves the effectiveness of the S3 client cache. It’s been a huge advantage to be part of the AWS network and leverage all of those relationships and technologies. It supports VM-based environments, as well as native operating systems for experiments aimed at minimizing overheads and maximizing performance. Monthly storage cost for three workflows. The Epigenome workflow is CPU bound because it spends 99 per cent of its runtime in the CPU and only 1 per cent on I/O and other activities. FutureGrid available Nimbus and Eucalyptus cores in November 2010. Figure 3 shows that for Montage, the variation in performance can be more than a factor of three for a given number of nodes. The Mapper can also restructure the workflow to optimize performance and adds transformations for data management and provenance information generation. Such a study is, however, a major undertaking and outside the scope of this paper. — Does a commercial cloud offer performance advantages over a high-performance cluster in running workflow applications? Another example of an academic cloud is the FutureGrid testbed (https://portal.futuregrid.org/about), designed to investigate computer science challenges related to the cloud computing systems such as authentication and authorization, interface design, as well as the optimization of grid- and cloud-enabled scientific applications [13]. We provisioned 48 cores each on Amazon EC2, FutureGrid and Magellan, and used the resources to compute periodograms for 33 000 Kepler datasets. A thorough cost–benefit analysis, of the kind described here, should always be carried out in deciding whether to use a commercial cloud for running workflow applications, and end-users should perform this analysis every time price changes are announced. Under AmEC2's current cost structure, long-term storage of data is prohibitively expensive. Broadband generates a large number of small files, and this is why PVFS most likely performs poorly. Montage (I/O bound). Pipelines used to create scientific datasets from raw and calibration data obtained from a satellite or ground-based sensors are the best-known examples of workflow applications. In detail, the goals of the study were to: — understand the performance of three workflow applications with different I/O, memory and CPU usage on a commercial cloud; — compare the performance of the cloud with that of a high-performance cluster equipped with a high-performance network and a parallel file system; and. — Virtualization overhead on AmEC2 is generally small, but most evident for CPU-bound applications. Figure 2 shows the resource cost for the workflows whose performances were given in figure 1. AmEC2 generally charges higher rates as the processor speed, number of cores and size of memory increase, as shown by the last column in table 3. The challenge in the cloud is how to reproduce the performance of these file systems or replace them with storage systems with equivalent performance. It finds the appropriate software, data and computational resources required for workflow execution. These images were all stored on AmEC2's object-based storage system, called S3. The commodity AmEC2 hardware evaluated here cannot match the performance of HPC systems for I/O-bound applications, but as AmEC2 offers more high-performance options, their cost and performance should be investigated.