Distributed Computing: Distributed System Checkpoints and it’s type-checkpoint levels its tools

Diznr International

2 months ago

Distributed Computing: Distributed System Checkpoints and it’s type-checkpoint levels its tools.

https://www.gyanodhan.com/video/7A2.%20Computer%20Science/Distributed%20Computing/301.%20Day%2007%20Part%2004%20Distributed%20System%20Checkpoints%20and%20it%27s%20type%20checkpoint%20levels%20its%20%20toolsdistributed.mp4

Contents

1 Distributed Computing: Checkpoints in Distributed Systems
2 What is Checkpointing in Distributed Systems?
3 Types of Checkpoints in Distributed Systems
4 Coordinated Checkpointing
5 Uncoordinated Checkpointing
6 Communication-Induced Checkpointing
7 Application-Level Checkpointing
8 Checkpoint Levels in Distributed Systems
9 Tools for Checkpointing in Distributed Systems
10 Conclusion

Distributed Computing: Checkpoints in Distributed Systems

What is Checkpointing in Distributed Systems?

Checkpointing is a fault-tolerance mechanism in distributed computing that periodically saves the system state. If a failure occurs, the system can restart from the last checkpoint instead of starting from scratch.

Key Idea:
Saves system state at intervals.
Reduces computation loss during failures.
Speeds up recovery in distributed systems.

Types of Checkpoints in Distributed Systems

Coordinated Checkpointing

Definition: All nodes in the system synchronize and save their states together.

Ensures consistency (no orphan or lost messages).
Used in global snapshots.
Slower due to coordination overhead.

Example:

Two-Phase Commit (2PC)
Chandy-Lamport Algorithm

Uncoordinated Checkpointing

Definition: Each process takes checkpoints independently without synchronization.

Faster, no coordination required.
Risk of cascading rollbacks (domino effect).

Example:

Individual process backups.

Communication-Induced Checkpointing

Definition: A hybrid approach where checkpoints are triggered based on message passing.

Prevents inconsistent states.
Avoids the domino effect from uncoordinated checkpointing.
Higher overhead due to message tracking.

Example:

Log-based checkpointing in distributed databases.

Application-Level Checkpointing

Definition: Checkpoints are managed at the software level rather than system level.

Allows customized checkpointing in applications.
Efficient for high-performance computing (HPC).
Requires developer implementation.

Example:

MPI (Message Passing Interface) checkpointing.

Checkpoint Levels in Distributed Systems

Process-Level Checkpointing: Saves state of individual processes.
System-Level Checkpointing: Saves the entire OS state.
Application-Level Checkpointing: Saves the state at the application level.

Tools for Checkpointing in Distributed Systems

1. DMTCP (Distributed MultiThreaded Checkpointing)

Application-level checkpointing.
Supports parallel computing systems.

2. CRIU (Checkpoint/Restore in Userspace)

Process-level checkpointing for Linux.
Saves process state & resumes execution.

3. BLCR (Berkeley Lab Checkpoint/Restart)

Kernel-level checkpointing for HPC systems.
Works with MPI applications.

4. Hadoop Checkpointing

Used in HDFS (Hadoop Distributed File System) for fault tolerance.

Conclusion

Checkpointing reduces system failures’ impact by allowing recovery from saved states. Different types & tools are used based on the system requirements (speed, reliability, and overhead).

Would you like code examples or real-world use cases?