What is continuous checkpointing in Orbax and MaxText?

It's async checkpoint saves that trigger only after the prior one completes, maximizing I/O use without blocking training—cutting failure recovery time dramatically.

Does continuous checkpointing slow down TPU training ?

Slightly—5-10% step time increase from more transfers—but goodput holds, and downtime savings dominate on large clusters.

How do I enable continuous checkpointing in MaxText?

Set enable_checkpointing: True, async_checkpointing: True, enable_continuous_checkpointing: True, and max_num_checkpoints_to_keep: 10 in your config.

🤖 AI Dev Tools

Continuous Checkpointing in Orbax and MaxText: Halves Checkpoint Gaps, Saves Hours on TPU Failures

On a dual-slice v5p-128 TPU cluster training Llama 3.1 70B, continuous checkpointing slashed P50 intervals from 100 steps to under 50—without tanking goodput. Here's why this async trick rewrites large-scale LLM training.

theAIcatchup Apr 10, 2026 4 min read

Benchmark graph: continuous vs fixed checkpoint intervals on v5p-128 TPU cluster training Llama 3.1 70B

⚡ Key Takeaways

Continuous checkpointing halves P50 intervals on v5p TPUs, slashing lost work on failures. 𝕏
Async saves avoid DCN bottlenecks in multi-slice runs, scaling better than fixed schedules. 𝕏
Orbax's policy flexibility—from min intervals to custom preservation—fits any training scale. 𝕏

Published by

theAIcatchup

Ship faster. Build smarter.

#AI checkpointing #JAX checkpointing #ML training #MaxText #Orbax #TPU clusters #TPU training #continuous checkpointing

Worth sharing?

Get the best Developer Tools stories of the week in your inbox — no noise, no spam.

Originally reported by Google Developers Blog

⚡ Key Takeaways

The 60-Second TL;DR

theAIcatchup

Share this article

Worth sharing?

Related Stories

One Forgotten Line: How Anthropic Handed Rivals Their $340 Billion AI Crown Jewels

Whisper and MediaPipe: AI's Magic Wand for Instant Viral Clips

Codesight Slashes 90% of AI Coding Tokens via Karpathy-Style Pre-Compiled Wikis

One CUDA Kernel Slashes Qwen3-TTS Latency to 50ms on RTX 5090

Stay in the loop