DIZNR INTERNATIONAL

Distributed Computing: Distributed System Checkpoints and it’s type-checkpoint levels its tools

Distributed-Computing-Distributed-System-Checkpoints-and-its-type-checkpoint-levels-its-tools

Distributed-Computing-Distributed-System-Checkpoints-and-its-type-checkpoint-levels-its-tools

Distributed Computing: Distributed System Checkpoints and it’s type-checkpoint levels its tools.

https://www.gyanodhan.com/video/7A2.%20Computer%20Science/Distributed%20Computing/301.%20Day%2007%20Part%2004%20Distributed%20System%20Checkpoints%20and%20it%27s%20type%20checkpoint%20levels%20its%20%20toolsdistributed.mp4

Distributed Computing: Checkpoints in Distributed Systems

 What is Checkpointing in Distributed Systems?

Checkpointing is a fault-tolerance mechanism in distributed computing that periodically saves the system state. If a failure occurs, the system can restart from the last checkpoint instead of starting from scratch.

Key Idea:
 Saves system state at intervals.
 Reduces computation loss during failures.
 Speeds up recovery in distributed systems.

 Types of Checkpoints in Distributed Systems

Coordinated Checkpointing

Definition: All nodes in the system synchronize and save their states together.

Ensures consistency (no orphan or lost messages).
 Used in global snapshots.
Slower due to coordination overhead.

Example:

 Uncoordinated Checkpointing

Definition: Each process takes checkpoints independently without synchronization.

Faster, no coordination required.
Risk of cascading rollbacks (domino effect).

Example:

 Communication-Induced Checkpointing

Definition: A hybrid approach where checkpoints are triggered based on message passing.

Prevents inconsistent states.
Avoids the domino effect from uncoordinated checkpointing.
Higher overhead due to message tracking.

Example:

 Application-Level Checkpointing

Definition: Checkpoints are managed at the software level rather than system level.

 Allows customized checkpointing in applications.
 Efficient for high-performance computing (HPC).
Requires developer implementation.

Example:

 Checkpoint Levels in Distributed Systems

Process-Level Checkpointing: Saves state of individual processes.
System-Level Checkpointing: Saves the entire OS state.
Application-Level Checkpointing: Saves the state at the application level.

 Tools for Checkpointing in Distributed Systems

1. DMTCP (Distributed MultiThreaded Checkpointing)

2. CRIU (Checkpoint/Restore in Userspace)

3. BLCR (Berkeley Lab Checkpoint/Restart)

4. Hadoop Checkpointing

 Conclusion

Checkpointing reduces system failures’ impact by allowing recovery from saved states. Different types & tools are used based on the system requirements (speed, reliability, and overhead).

Would you like code examples or real-world use cases?