6 min read

Building a Poor Man's WAL on Azure Storage

So, picture this. You've built a slick web beacon service. It tracks user clicks, email opens, and page views at thousands of events per second. Every event gets pushed to an Azure Storage Queue, your downstream ETL cluster picks it up, processes it, and dumps it into your data warehouse. The whole thing is beautiful - a perfect little conveyor belt of data flowing smoothly from point A to point B.

Life is good. The events are flowing.

Then comes the day every engineer dreads: "The ETL cluster is down."

No problem, you think. You'll restart it, maybe tweak some configs, and be back up in 20 minutes. Production incidents happen.

And that's when you notice the Queue is empty.

The Problem with Ephemeral Queues

Our perfect little pipeline had a fatal flaw. Azure Storage Queues have a visibility timeout - basically, if a message sits in the queue too long without being processed, it just... disappears. Poof. Gone forever.

So when our ETL cluster went down for a few hours (dependency issues, classic), all those in-transit messages timed out. Hundreds of thousands of events, tracking real user behavior, clicks on emails, critical product analytics - all vanished into the void.

And there was absolutely nothing we could do about it.

Well, almost nothing.

The Backup Plan That Wasn't

Technically, we did have a contingency plan. We were logging every event to Datadog, and we had built a tool that could pull those logs and reprocess them. It wasn't pretty, but it was supposed to be our safety net for exactly this scenario.

The problem? We weren't about to keep months of high-volume logs in Datadog. That would cost more than our entire Hyperscaler bill. So we had configured Datadog to archive logs to cold storage after a few days.

You can see where this is going.

When disaster struck and we fired up our log rehydration tool, we discovered that the cold storage connection had been silently broken for months. No alerts. No warnings. Just... nothing in the archive. Our backup plan had been quietly failing in the background, and we had no idea until the exact moment we needed it most.

So yeah. No backups. No replay. No way to recover. The data was just... gone.

This is the classic problem with ephemeral message queues: they're fast, but they're also forgettable. Once the message is consumed (or times out), it's done. And if your backup plan has been broken for months without anyone noticing, you'd better hope it wasn't anything important.

Spoiler: it's always something important.

"Just Increase the TTL!"

Now, I know what you're thinking. "Why not just increase the message TTL? Make it 7 days instead of 7 hours. Problem solved!"

And yeah, you're not wrong. We could do that. Azure Storage Queues support up to 28 days of message retention. Set it and forget it, right?

Except that doesn't actually solve the problem. It just kicks the can down the road.

What happens if the ETL cluster is down for 8 days? Or what if we discover a bug in our processing logic from last month and need to reprocess events from five weeks ago? Or what if we need to replay events for a specific customer for audit purposes?

A longer TTL helps with short outages, sure. But it doesn't give us the ability to go back in time. It doesn't give us a durable audit trail. And it definitely doesn't help when the real issue isn't "the queue timed out" but "we need to reprocess historical data."

We needed something more permanent. Something that would let us replay events on demand, not just hope the queue holds them long enough.

The Obvious Path vs. The Path I Took

The textbook answer here is to use a durable message broker with persistence. Something like Kafka, RabbitMQ with disk-backed queues, or even Event Hubs with capture enabled. When you publish a message, it gets written to disk, and you can replay it days, weeks, or even months later if you need to.

It's a great pattern. It's robust, battle-tested, and exactly what these systems were built for.

But... it also means replacing our entire ingestion stack. Migrating from Azure Storage Queues to Kafka means rewriting both the producer (our web beacon service) and the consumer (our ETL cluster). It means adding Kafka to our infrastructure, managing brokers, tuning retention policies, monitoring disk usage, and explaining to our CFO why we're suddenly paying for three more servers.

For our use case, it felt like overkill. I wondered, couldn't we just... keep the data around in case we need it later?

It turns out, we can. And Azure Blob Storage makes it surprisingly cheap.

The "Just Write It Twice" Approach

The core insight was simple: what if we wrote every event to two places?

  1. Azure Storage Queue (fast, ephemeral) → downstream ETL picks it up in real-time
  2. Azure Append Blobs (durable, cheap) → sits there quietly, waiting for the day we need it

This is called a dual write. When an event comes in, we immediately push it to the queue (so the ETL cluster gets it right away), and also append it to a blob file (so we have a durable copy in case something goes wrong).

If this sounds familiar, it's because it's basically a Write Ahead Log (WAL) - the same pattern databases like PostgreSQL and MySQL use for crash recovery. Before making any changes, they write to a durable append-only log. If something goes wrong, they replay the log to recover. We're doing the exact same thing, just for event streams instead of database transactions.

Now, eagle-eyed distributed systems nerds will notice: "What if the process crashes between the blob write and the queue push?" Good catch. The answer is: we're okay with that. If we crash after writing to the blob but before pushing to the queue, the ETL misses the event right now, but we can replay it from the blob later. This is classic at-least-once delivery semantics - we'd rather replay an event twice than lose it forever.

The append blob acts like a time machine. If the ETL cluster goes down and messages time out, we can just replay the events from the blob. We're not changing any downstream systems. We're not migrating to Kafka. We're just adding a safety net.

And here's the kicker: storing events in append blobs for 6 months costs us about $100 per month. That's less than a single EC2 instance. For peace of mind that we'll never lose critical data again, it's a no-brainer.

How It Works

The workflow now looks like this:

  1. A user clicks a tracking link. The request hits our Relay service.
  2. Relay parses and validates the event (URL, session ID, customer ID, etc.).
  3. Relay enriches the event with server-side fields (timestamp, User-Agent, etc.).
  4. Relay does a dual write:
    • Real-time path: Push JSON event to Azure Storage Queue → ETL cluster consumes it immediately
    • Durable path: Add event to an in-memory buffer → batch append to Azure Blob Storage
  5. Relay responds to the user (with a 1x1 transparent GIF we call: tracking pixel).

The append blob batches about 3,000 events per write to avoid hitting Azure's 50,000-block limit per blob. We store events in a compressed binary format (CBOR + zstd) to keep costs down - each event is about 200 bytes compressed, compared to 500 bytes for JSON.

Blobs are organized by date and hostname:

events/2026-01-10/node-a-0001.bin
events/2026-01-10/node-a-0002.bin
events/2026-01-10/node-b-0001.bin

This makes cleanup easy (just delete old date prefixes) and ensures zero write contention across scaled instances (each pod writes to its own blob). No distributed locking. No Azure Blob Leases. No coordination headaches. Just append to your own file and move on with your life.

The "Oops, Can We Have That Back?" CLI

The magic happens when disaster strikes. We built a simple CLI command to replay events from blobs:

relay replay --from-date 2026-01-10 --to-date 2026-01-10

This command:

  1. Lists all blobs for the specified date range
  2. Downloads and decompresses them (in deterministic order: date → hostname → sequence)
  3. Pushes every event back to the Azure Storage Queue (tagged with replayed=true)
  4. The ETL cluster consumes them like normal events and reprocesses everything

We can replay a single day (about 100 million events) in about 15 minutes. It's not instant, but for recovering from a multi-hour outage, it's more than fast enough.

And because the blobs are immutable append-only files, we can replay the same date range multiple times if needed (though we try not to, because downstream deduplication isn't trivial).

Why This Works for Us

This solution isn't perfect. The binary format is a bit harder to debug than plain JSON. We had to write a custom CLI tool to handle replay. And you have to be disciplined about not running overlapping replay jobs.

But here's what we got in return:

Never lose data again - If the ETL cluster goes down, we can replay from blobs
Zero downstream changes - The ETL cluster still consumes from the same queue
Minimal cost - ~$100/month for 6 months of retention (vs. thousands for Kafka)
Simple architecture - No new infrastructure to manage, just blobs
Auditability - We have a durable record of every event for forensics

Sometimes the simplest solution isn't the textbook one. Sometimes it's just writing to two places instead of one. And sometimes that's all you need to sleep soundly at night, knowing your data is safe.

Hopefully. 🤞