DevOps & Platform Eng

Rust Game Server Leaks 1.2MB/Sec: Config Fixes

1.2MB per second. That's not a typo. A Rust game server developed a memory leak, proving even the compiler can't save you from your own bad decisions.

A visual representation of a memory leak, perhaps a graph showing memory usage spiking uncontrollably, with Rust code snippets in the background.

Key Takeaways

  • Rust's borrow checker doesn't prevent all memory leaks; configuration errors, especially with Tokio's unbounded channels, are still a major risk.
  • Relying on default channel capacities (like Tokio's unbounded default) can lead to severe memory leaks, especially under load.
  • Implementing explicit backpressure mechanisms (like Semaphores) is crucial for managing resources and preventing OOM errors in asynchronous systems.

1.2MB a second. Not a typo. Veltrix’s game server was hemorrhaging memory like a sieve. /proc/self/status showed the alloc counter doing a frantic jig. No panics. No stack traces. Just… gone. Vanished RAM. And you know what’s worse? They were using Rust. With Tokio. The language of safety. The borrow checker’s supposed to be your digital guardian angel. Turns out, even angels get blindfolded by user error.

Here’s the punchline: the game loop looked innocent. while let Some(player) = next_player() { handle_player(player); }. Innocent, right? Wrong. next_player wasn’t some magical function. It was a tokio::mpsc::Receiver. And its channel? Tuned to 1024. Except, the default capacity is unbounded. Unbounded. So every message, every player interaction, just sat there. Buffering. Forever.

The moment of truth came with tokio-console. 4,096 pending move requests. A small struct, sure. But 4,000 of them start eating into your precious gigabytes fast. This wasn’t a runtime problem, they thought. Let’s swap Tokio’s scheduler. Multi-threaded to current-thread. Less threads, fewer allocations. It nudged things down by 15%. The leak? Still there. Stubborn. Like a bad habit.

Attempt two: explicit limits. let (tx, rx) = tokio::sync::mpsc::channel(128);. Naive. Production traffic hit, 500 concurrent players. Backpressure. Timeouts. Players got kicked because the pipe was full. Then they tried 1024. The default. Seemed reasonable. It worked for a week. Then the memory climbed again. The real culprit wasn’t the buffer size. It was the lifetime.

Every message held a String for a session token. Player disconnects? Drop the sender. But the channel’s buffer? It held a reference. Leaking session tokens. With every disconnect. You see the pattern yet? Every message in the channel used Arc<Message> internally. Drop the sender, but the Arc kept the message alive. Until it was processed. With 10k players per match, that’s 10k strings just… waiting. In limbo. For no good reason.

So what was the fix? Not just fiddling with numbers. It was about ownership. They ditched the unbounded channel for tokio::sync::mpsc::unbounded_channel. But they bolted on a backpressure layer. Using Semaphore.

let sem = Arc::new(Semaphore::new(1024));
let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel();
while let Some(player) = rx.recv().await {
let permit = sem.clone().acquire_owned().await?;
tokio::spawn(async move {
handle_player(player).await;
drop(permit); // Release capacity
});
}

No more unbounded growth. The semaphore capped concurrency at 1k. Rust’s ownership ensured tokens were dropped the instant the permit was released. Elegant. Almost. But it came with a price tag. Explicit backpressure. Players got Service Unavailable when the semaphore choked.

Telemetry became essential. You had to see the backpressure.

if sem.available_permits() == 0 {
metrics::counter!("backpressure_rejects").increment(1);
}

And the results? Memory stabilized. The 4.2GB RSS before? Down to 1.8GB. Channel latency p99 dropped from 12ms to 8ms. Backpressure rejections? They went from zero to 23 per minute at peak. A trade-off, but a necessary one for stability. A 24-hour load test with 50k players. RSS topped out at 2.1GB. jemalloc reported zero leaks after shutdown. The semaphore added 4ms to p99 latency when full. Worth it.

The Real Blame Game

Here’s the bitter pill: tokio-console should have been day one. Weeks wasted chasing ghosts. tokio-console subscribe tokio/channel/size. Had they run that, the piling messages would have been obvious. Immediate.

And defaults? Never trust them when player data is involved. Tokio’s default channel size is usize::MAX. Unbounded. Tokio’s time module defaults to 1ms timers. Jitter. Under load. Every default needs a side-eye. Especially in production.

Rust isn’t a config silver bullet. The compiler stops leaks within a crate. But between crates? Or through external systems like Tokio? That’s your problem. Configuration is an ownership problem. Not a runtime one. And never, ever assume your game loop is safe just because it’s written in Rust. It’s not.

The compiler guarantees no leaks within a single crate, but leaks between crates or through external tools (like Tokio) are your problem. Configuration isn’t a runtime concern—it’s an ownership concern.

Metric Before After
Allocated heap (RSS) 4.2GB 1.8GB
GC cycles (if we’d used GC) N/A 0
Channel latency p99 12ms 8ms
Backpressure rejections 0 23 per minute at peak

Why Did This Happen?

It’s simple, really. They trusted defaults and the idea that the language itself would prevent all memory issues. Rust’s borrow checker is phenomenal for preventing data races and ensuring memory safety within Rust’s own memory management. However, it doesn’t magically understand how asynchronous runtimes like Tokio manage their internal buffers or how Arc can create shared ownership that outlives the immediate scope if not carefully managed. The tokio::sync::mpsc channel, when unbounded, essentially becomes a black box where messages can pile up indefinitely, and Arc ensures those messages (and the session tokens they contained) stick around as long as any reference to them exists, even if the original sender has been dropped. This created a situation where disconnecting a player didn’t immediately clean up their session token data because the channel’s internal buffer still held a reference via the Arc.

What’s the Takeaway for Developers?

The main takeaway is that even with powerful languages like Rust, rigorous configuration and deep understanding of the tools you’re using are paramount. Rust provides strong compile-time guarantees, but it’s not a panacea for all software engineering challenges. Developers must remain skeptical of defaults, especially in high-throughput systems. Understanding how asynchronous runtimes manage state, how shared ownership constructs like Arc behave, and implementing explicit backpressure mechanisms are critical. The Veltrix incident is a stark reminder that the compiler can’t save you from a poorly architected or misconfigured system. Observability tools like tokio-console are not optional; they are essential for debugging complex, asynchronous applications.


🧬 Related Insights

Sam O'Brien
Written by

Programming language and ecosystem reporter. Tracks releases, package managers, and developer community shifts.

Worth sharing?

Get the best Developer Tools stories of the week in your inbox — no noise, no spam.

Originally reported by dev.to

Stay in the loop

The week's most important stories from DevTools Feed, delivered once a week.