theAIcatchup

Benchmark graph: continuous vs fixed checkpoint intervals on v5p-128 TPU cluster training Llama 3.1 70B

Continuous Checkpointing in Orbax and MaxText: Halves Checkpoint Gaps, Saves Hours on TPU Failures

On a dual-slice v5p-128 TPU cluster training Llama 3.1 70B, continuous checkpointing slashed P50 intervals from 100 steps to under 50—without tanking goodput. Here's why this async trick rewrites large-scale LLM training.

4 min read 3 hours ago

#Orbax

Continuous Checkpointing in Orbax and MaxText: Halves Checkpoint Gaps, Saves Hours on TPU Failures

Stay in the loop