Distributed Computing: Distributed System Checkpoints and it’s type-checkpoint levels its tools
Distributed Computing: Distributed System Checkpoints and it’s type-checkpoint levels its tools.
Contents [hide]
- 1 Distributed Computing: Checkpoints in Distributed Systems
- 2 What is Checkpointing in Distributed Systems?
- 3 Types of Checkpoints in Distributed Systems
- 4 Coordinated Checkpointing
- 5 Uncoordinated Checkpointing
- 6 Communication-Induced Checkpointing
- 7 Application-Level Checkpointing
- 8 Checkpoint Levels in Distributed Systems
- 9 Tools for Checkpointing in Distributed Systems
- 10 Conclusion
Distributed Computing: Checkpoints in Distributed Systems
What is Checkpointing in Distributed Systems?
Checkpointing is a fault-tolerance mechanism in distributed computing that periodically saves the system state. If a failure occurs, the system can restart from the last checkpoint instead of starting from scratch.
Key Idea:
Saves system state at intervals.
Reduces computation loss during failures.
Speeds up recovery in distributed systems.
Types of Checkpoints in Distributed Systems
Coordinated Checkpointing
Definition: All nodes in the system synchronize and save their states together.
Ensures consistency (no orphan or lost messages).
Used in global snapshots.
Slower due to coordination overhead.
Example:
- Two-Phase Commit (2PC)
- Chandy-Lamport Algorithm
Uncoordinated Checkpointing
Definition: Each process takes checkpoints independently without synchronization.
Faster, no coordination required.
Risk of cascading rollbacks (domino effect).
Example:
- Individual process backups.
Communication-Induced Checkpointing
Definition: A hybrid approach where checkpoints are triggered based on message passing.
Prevents inconsistent states.
Avoids the domino effect from uncoordinated checkpointing.
Higher overhead due to message tracking.
Example:
- Log-based checkpointing in distributed databases.
Application-Level Checkpointing
Definition: Checkpoints are managed at the software level rather than system level.
Allows customized checkpointing in applications.
Efficient for high-performance computing (HPC).
Requires developer implementation.
Example:
- MPI (Message Passing Interface) checkpointing.
Checkpoint Levels in Distributed Systems
Process-Level Checkpointing: Saves state of individual processes.
System-Level Checkpointing: Saves the entire OS state.
Application-Level Checkpointing: Saves the state at the application level.
Tools for Checkpointing in Distributed Systems
1. DMTCP (Distributed MultiThreaded Checkpointing)
- Application-level checkpointing.
- Supports parallel computing systems.
2. CRIU (Checkpoint/Restore in Userspace)
- Process-level checkpointing for Linux.
- Saves process state & resumes execution.
3. BLCR (Berkeley Lab Checkpoint/Restart)
- Kernel-level checkpointing for HPC systems.
- Works with MPI applications.
4. Hadoop Checkpointing
- Used in HDFS (Hadoop Distributed File System) for fault tolerance.
Conclusion
Checkpointing reduces system failures’ impact by allowing recovery from saved states. Different types & tools are used based on the system requirements (speed, reliability, and overhead).
Would you like code examples or real-world use cases?