A cluster federation is a union of clusters and is heterogeneous. Each cluster contains a certain number of processes. An application running in such a computing environment is divided into communicating modules so that these modules can run on different clusters. To achieve fault-tolerance different clusters may employ different check pointing schemes. For example, some may use coordinated schemes, while some other may use communication-induced schemes. It may complicate the recovery process. In this paper, we have addressed the complex problem of recovery for cluster computing environment. The proposed approach handles both inter cluster orphan and lost messages unlike the existing works in this area. We first propose an algorithm to determine a recovery line so that there does not exist any inter cluster orphan message between any pair of the cluster level check points belonging to the recovery line. The main feature of the proposed algorithm is that it can be executed simultaneously by all clusters in the cluster federation. Next we apply the sender-based message logging idea to effectively handle all inter cluster lost messages to ensure correctness of computation.
All Science Journal Classification (ASJC) codes
- Computer Science(all)