Distributed Event Platform
Kafka, services, and a long love–hate relationship with rebalancing.
The first version of this passed every test and fell over in production the first time a Kafka consumer group rebalanced under load. That bug taught me more about distributed systems than the previous year of reading. Until you’ve watched a partition assignment flap during a deploy, you don’t really know what your code is actually promising.
What ended up working was a combination of unsexy things: idempotent consumers with explicit deduplication, transactions for the bits where exactly-once actually mattered, and a lot of patience around partitioning. The fancy library features helped less than I expected. Boring discipline helped more, naming topics carefully, versioning schemas, treating the dead-letter queue as a real thing somebody had to look at.
I made a deliberate choice early on to push as much logic as possible into stateless consumers and treat Kafka itself as the database for state. That sounds heretical and is occasionally inconvenient, but it meant the operational story was simple: lose a consumer, replay from the offset, you’re back. The teams who tried to keep large in-memory state on the consumer side ran into the most pain.
Monitoring was where the real value lived. Consumer lag is the metric that tells you whether your pipeline is healthy in the only sense that matters, are events being processed as fast as they’re produced. Everything else is a leading or lagging indicator of that one number. I wish I’d built that dashboard first instead of the actual processing logic. It would’ve saved me three weeks.
I’m still cautious about reaching for Kafka. The day-one demo is great. The day-365 reality involves consumer lag dashboards, schema migrations, and at least one Slack thread per quarter that starts with “so the dead-letter queue is doing a thing.” It earns its keep when the workload genuinely needs it; when it doesn’t, a queue is a queue, and you should use the boringest one available.