A NOVEL ADAPTIVE CHECKPOINTING METHOD BASED ON INFORMATION OBTAINED FROM WORKFLOW STRUCTURE

Eszter Kail; Péter Kacsuk; Miklós Kozlovszky

doi:10.7494/csci.2016.17.3.387

Authors

Eszter Kail Obuda University, John von Neumann Faculty of Informatics, 1034 Bécsi str. 96/b., Budapest
Péter Kacsuk University of Westminster, 115 New Cavendish Street, London, United Kingdom; MTA SZTAKI, 1518 Budapest, Hungary
Miklós Kozlovszky Obuda University, John von Neumann Faculty of Informatics, Biotech Lab, 1034 Bécsi str. 96/b., Budapest, Hungary; MTA SZTAKI, 1518 Budapest, Hungary

DOI:

https://doi.org/10.7494/csci.2016.17.3.387

Keywords:

scientific workflow, checkpoint, dynamic execution

Abstract

Scientific workflows are data- and compute-intensive; thus, they may run for days or even weeks on parallel and distributed infrastructures such as grids, supercomputers, and clouds. In these high-performance computing infrastructures, the number of failures that can arise during scientific-workflow enactment can be high, so the use of fault-tolerance techniques is unavoidable. The most-frequently used fault-tolerance technique is taking checkpoints from time to time; when failure is detected, the last consistent state is restored. One of the most-critical factors that has great impact on the effectiveness of the checkpointing method is the checkpointing interval. In this work, we propose a Static (Wsb) and an Adaptive (AWsb) Workflow Structure Based checkpointing algorithm. Our results showed that, compared to the optimal checkpointing strategy, the static algorithm may decrease the checkpointing overhead by as much as 33% without affecting the total processing time of workflow execution. The adaptive algorithm may further decrease this overhead while keeping the overall processing time at its necessary minimum.

Citations

Citation Indexes: 2

Captures

Readers: 1

see details

Downloads

References

Kail, E. and Kacsuk, P. and Kozlovszky, M.: New Aspect of Investigating Fault Sensitivity of Scientific Workflows. In:Proceedings of IEEE 19th International Conference on Intelligent Engineering Systems, INES, 2015.

Li, WN. and Xiao, Z. and Beavers, G.: On Computing the Number of Topological Orderings of a Directed Acyclic Graph. In: Congressus Numerantium, vol. 5, pp. 143-159, 2005.

Di, S. and Robert, Y. and Vivien, F. and Kondo, D. and Cho-Li Wang and Cappello, F.: Optimization of cloud task processing with checkpoint-restart mechanism. In: International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2013, vol. 5, pp. 1-12, 2013.

Antony Lidya Therasa.S, Sumathi.G, Antony Dalya.S.: Dynamic Adaptation of Checkpoints and Rescheduling in Grid Computing. In: International Journal of Computer Applications, vol.3, No.2, 2010.

Meroufel, B. and Belalem, G.: Policy Driven Initiator in Coordination Checkpointing Strategies. In: Recent Advances in Telecommunications, Informatics And Educational Technologies, Proceeding of the 5th European Conference of Computer Science, 2014.

Meroufel, B. and Belalem, G.: Adaptive time-based coordinated checkpointing for cloud computing workflows. In: Scalable Computing: Practice and Experience, vol.15, No.2, pp. 153-168, 2014.

Garg, R. and Singh, A.K.: Fault Tolerance in Grid Computing: State of the art and open issues. In: International Journal of Computer Science and Engineering Survey (IJCSES), vol.2, 2011.

Hwang, S. and Kesselman, C.: Grid Workflow: A Flexible Failure Handling Framework for the grid. In: n Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing, Seattle, Washington, USA., 2003.

Pietri, I. and Juve, G. and Deelman, E. and Sakellariou, R.: Performance Model to Estimate Execution Time of Scientific Workflows on the Cloud. In: Proceedings of the 9th Workshop on Workflows in Support of Large-Scale Science, WORKS'14, pp. 11-19, 2014.

Starlinger, J. and Cohen-Boulakia, S. and Khanna, S. and Davidson, S.B. et al.: Layer Decomposition: An Effective Structure-based Approach for Scientific Workflow Similarity. In: Proceedings of IEEE 10th International Conference on e-Science, vol.1, pp. 169-176, 2014.

Jhawar, R. and Piuri, V. and Santambrogio, M.: Fault Tolerance Management in Cloud Computing: A System-Level Perspective In: IEEE Systems Journal, vol.2, 2013.

J.W. Young:A first order approximation to the optimum checkpoint interval In: Communications of ACM, vol.17, pp. 530-531, 1974.