DIZNR INTERNATIONAL

Distributed Computing: Distributed System Checkpoints and it’s type-checkpoint levels its tools

Distributed-Computing-Distributed-System-Checkpoints-and-its-type-checkpoint-levels-its-tools

Distributed-Computing-Distributed-System-Checkpoints-and-its-type-checkpoint-levels-its-tools

Distributed Computing: Distributed System Checkpoints and it’s type-checkpoint levels its tools.

https://www.gyanodhan.com/video/7A2.%20Computer%20Science/Distributed%20Computing/301.%20Day%2007%20Part%2004%20Distributed%20System%20Checkpoints%20and%20it%27s%20type%20checkpoint%20levels%20its%20%20toolsdistributed.mp4

Distributed Computing: Checkpoints in Distributed Systems

 What is Checkpointing in Distributed Systems?

Checkpointing is a fault-tolerance mechanism in distributed computing that periodically saves the system state. If a failure occurs, the system can restart from the last checkpoint instead of starting from scratch.

Key Idea:
 Saves system state at intervals.
 Reduces computation loss during failures.
 Speeds up recovery in distributed systems.

 Types of Checkpoints in Distributed Systems

Coordinated Checkpointing

Definition: All nodes in the system synchronize and save their states together.

Ensures consistency (no orphan or lost messages).
 Used in global snapshots.
Slower due to coordination overhead.

Example:

 Uncoordinated Checkpointing

Definition: Each process takes checkpoints independently without synchronization.

Faster, no coordination required.
Risk of cascading rollbacks (domino effect).

Example:

 Communication-Induced Checkpointing

Definition: A hybrid approach where checkpoints are triggered based on message passing.

Prevents inconsistent states.
Avoids the domino effect from uncoordinated checkpointing.
Higher overhead due to message tracking.

Example:

 Application-Level Checkpointing

Definition: Checkpoints are managed at the software level rather than system level.

 Allows customized checkpointing in applications.
 Efficient for high-performance computing (HPC).
Requires developer implementation.

Example:

 Checkpoint Levels in Distributed Systems

Process-Level Checkpointing: Saves state of individual processes.
System-Level Checkpointing: Saves the entire OS state.
Application-Level Checkpointing: Saves the state at the application level.

 Tools for Checkpointing in Distributed Systems

1. DMTCP (Distributed MultiThreaded Checkpointing)

2. CRIU (Checkpoint/Restore in Userspace)

3. BLCR (Berkeley Lab Checkpoint/Restart)

4. Hadoop Checkpointing

 Conclusion

Checkpointing reduces system failures’ impact by allowing recovery from saved states. Different types & tools are used based on the system requirements (speed, reliability, and overhead).

Would you like code examples or real-world use cases?

Distributed Computing involves multiple computer systems working together to achieve a common goal. One key challenge in such systems is ensuring fault tolerance, which is where checkpoints come in.


What is a Checkpoint in Distributed Systems?

A checkpoint is a saved state of a process or the entire system at a specific point in time. If a failure occurs, the system can roll back to the last checkpoint rather than starting over.


Purpose of Checkpointing


Types of Checkpoints in Distributed Systems

1. Local Checkpoints

2. Global Checkpoints


Checkpointing Levels

Level Description Use Case
Application-level App explicitly saves state Custom control, efficient for app-specific logic
Library-level Uses a library (like BLCR) Transparent to app, often used in HPC
System-level OS or VM-level snapshots No modification to app, broader but heavier
Hardware-level Hardware saves memory states Fastest, but rare and hardware-dependent

Types of Checkpointing Techniques

1. Coordinated Checkpointing

2. Uncoordinated Checkpointing

3. Communication-Induced Checkpointing


Tools for Checkpointing in Distributed Systems

Tool/Library Description
BLCR (Berkeley Lab Checkpoint/Restart) Kernel-level checkpointing for Linux
CRIU (Checkpoint/Restore In Userspace) Linux tool to freeze running apps and store state
DMTCP (Distributed MultiThreaded CheckPointing) Transparent user-level checkpointing for distributed and multi-threaded apps
OpenMPI Checkpointing MPI-based applications using BLCR for fault tolerance
LAM/MPI Supports checkpoint/restart via coordination in MPI apps
Docker Checkpoint/Restore Uses CRIU under the hood to checkpoint containers

Challenges in Checkpointing


Summary

Checkpointing is a critical technique in distributed systems for ensuring reliability and resilience. Depending on the application’s complexity and requirements, different checkpointing strategies and tools can be used.


Would you like a diagram to visualize checkpointing, or help implementing one in code or a cloud platform (like Kubernetes)?

Distributed Computing: Distributed System Checkpoints and it’s type-checkpoint levels its tools

Distributed Computing: Principles, Algorithms, and Systems

CS3551 – DISTRIBUTED SYSTEMS 2 MARKS AND 16 …

CS3551- DISTRIBUTED COMPUTING UNIT I …