Databases & Backend

Cloudflare Billing Slowdown: ClickHouse Bottleneck Revealed

A seemingly routine database change plunged Cloudflare's billing pipeline into chaos. The culprit? A hidden ClickHouse bottleneck no one saw coming.

Diagram showing a complex database architecture with a red alert symbol highlighting a specific node.

Key Takeaways

  • A change to ClickHouse's partitioning key, intended to enable per-namespace retention, inadvertently caused lock contention in query planning.
  • Standard performance metrics (I/O, memory, rows scanned) did not indicate the bottleneck, highlighting a previously unknown issue.
  • Cloudflare engineers had to develop custom patches for ClickHouse to resolve the critical billing pipeline slowdown.

ClickHouse grind.

That’s what happened. Cloudflare, a company that practically invented the internet’s backbone, saw its billing jobs grind to a halt. And the villain? Not some complex microservice failure, but a database change. A humble addition of a column to a partitioning key. The irony is thick enough to spread on toast.

Here’s the thing: these weren’t just any jobs. These were the ones that make the company money. The ones that send out bills. When they slow down, empires crumble. Or at least, they start sweating profusely.

The Tale of the Additive Column

Cloudflare, bless their hearts, use ClickHouse. A lot. Over a hundred petabytes. They built this thing called “Ready-Analytics” to simplify data streaming for internal teams. Simple concept: one massive table, data distinguished by namespace. Clever. It’s also got a primary key that looks like (namespace, indexID, timestamp). Standard stuff, really.

But the retention policy was a relic. A single, inflexible 31-day limit. Some teams needed years. Others, days. So, they decided to get fancy. Per-namespace retention. The obvious move was to change the partitioning key from (day) to (namespace, day). Seemed logical. Every query is filtered by namespace, so more parts shouldn’t matter, right? Famous last words.

We made a key assumption: since every query is filtered by a specific namespace, the number of parts read by any single query shouldn’t change.

Ah, assumptions. The bedrock of every spectacular tech failure. This change was supposed to be elegant, a graceful solution to a nagging problem. It allowed for per-namespace TTLs, neat little storage management, and confident cluster utilization. They even used ClickHouse’s Merge table feature to migrate. Smooth sailing, they thought.

Then the billing jobs started getting slow. Really slow. And what did they check? The usual suspects. I/O, memory, rows scanned, parts read. All normal. Pristine, even. The system was fine, according to the metrics. Except it wasn’t. It was dying a slow, painful death.

The Hidden Villain: Lock Contention

It took days. Days of staring at dashboards, muttering incantations, and probably questioning every life choice that led to this moment. The issue wasn’t data volume. It wasn’t query complexity. It was lock contention in query planning. Query planning! Something nobody even bothered to monitor because, frankly, it was never a problem. It’s the digital equivalent of finding out your perfectly healthy appendix is trying to kill you.

This isn’t just a story about a slow database. It’s a story about the hidden complexities lurking in the most fundamental systems. It’s about how a seemingly innocent change can expose decades-old architectural quirks. It’s a stark reminder that even the most sophisticated infrastructure can have blind spots. Especially when you’re dealing with petabytes of data.

Cloudflare’s fix? Patches. They wrote their own patches to ClickHouse. Because of course they did. This isn’t the first time a company has had to patch an open-source tool they rely on. But it’s a proof to their engineering chops. And a warning to everyone else. Assume nothing. Watch everything. Especially the things you think are fine.

This whole debacle brings to mind the early days of relational databases. When complex indexes were touted as the silver bullet, only for applications to grind to a halt due to index maintenance overhead or subtle locking issues. We keep reinventing the wheel, but the friction points often remain the same.

Why Does This Matter for Developers?

So, what’s the takeaway for the average developer staring into the abyss of a slow query? First, trust your gut, but verify with data. When things feel wrong, they probably are. Don’t get fixated on the obvious metrics if the overall system performance is degrading. Second, understand your dependencies. Cloudflare is a massive operation, but even their billing relied on a database component that had a fundamental, undocumented weakness. Know your tools. Know their limitations. And be prepared to dig deeper than you ever thought necessary.

This incident highlights the ever-present tension between feature velocity and system stability. Adding per-namespace retention is a good thing, a necessary business requirement. But the implementation, while seemingly sound on paper, had an unforeseen consequence. It’s a balancing act that organizations constantly perform, and sometimes, they stumble.


🧬 Related Insights

Frequently Asked Questions

What is ClickHouse?

ClickHouse is an open-source analytical database management system. It’s designed for high-performance online analytical processing (OLAP) queries on large datasets.

Will this affect Cloudflare’s customers?

While the billing pipeline was affected, it’s unlikely to have directly impacted Cloudflare’s end-users in terms of service availability. The issue was internal to their billing operations.

How did Cloudflare fix the ClickHouse bottleneck?

Cloudflare engineers identified lock contention in ClickHouse’s query planning as the root cause. They then developed and applied custom patches to the ClickHouse software to resolve this specific bottleneck.

Jordan Kim
Written by

Cloud and infrastructure correspondent. Covers Kubernetes, DevOps tooling, and platform engineering.

Frequently asked questions

What is ClickHouse?
ClickHouse is an open-source analytical database management system. It's designed for high-performance online analytical processing (OLAP) queries on large datasets.
Will this affect Cloudflare's customers?
While the billing pipeline was affected, it's unlikely to have directly impacted Cloudflare's end-users in terms of service availability. The issue was internal to their billing operations.
How did Cloudflare fix the ClickHouse <a href="/tag/bottleneck/">bottleneck</a>?
Cloudflare engineers identified lock contention in ClickHouse's query planning as the root cause. They then developed and applied custom patches to the ClickHouse software to resolve this specific bottleneck.

Worth sharing?

Get the best Developer Tools stories of the week in your inbox — no noise, no spam.

Originally reported by Cloudflare Blog

Stay in the loop

The week's most important stories from DevTools Feed, delivered once a week.