mbagugl.blogg.se - Kafka remove duplicate messages

When there is a massive error, the program will start up, read the previous checkpoint, replay any messages after the checkpoint (usually in the 1000s), and start processing again. This is where the entire state at that point in time is written out to durable storage (S3/HDFS). Now you’re 4+ hours behind and still have to process all of the messages that accrued over that time just to get back to the current time.įor this reason, databases and processing frameworks implement checkpointing (in Flink this is called snapshots). Some potential users of Kafka Streams have told me they calculated this scenario out to be 4+ hours of downtime. This replaying of state mutation messages could translate into hours of downtime. In a disaster scenario – or human error scenario – where all machines running the Kafka Streams job die or are killed, all nodes will have to replay all state mutation messages before a single message can be processed. It’s only once all of these mutations are done that the processing can start again. The operational manifestation of this is that if a node dies, all of those messages have to be replayed from the topic and inserted into the database. If you have a growing number of keys, the size of your state will gradually increase too. If you have 100 billion keys, you will 100 billion+ messages still in the state topic because all state changes are put into the state change topic. Solved right? No, because ~1 message per key can still be a massive amount of state. They even say that you can use a compacted topic to keep the messages stored in Kafka to be limited to the ~1 per key. Now, you have to deal with storing the state and storing state means having to recover from errors while maintaining state.įor Kafka Streams, they say no problem, we have all of the messages to reconstruct the state. As soon as you get stateful, everything changes. For stateless processing, you just receive a message and then process it. For message processing, it can be stateless or stateful. You can replay any message that was sent by Kafka. That sounds esoteric and me being pedantic right?Ĭheckpointing is fundamental to operating distributed systems.

It’s a fact that Kafka Streams – and by inheritance KSQL – lacks checkpointing : I feel like Confluent’s slide content should have *, â€, â€¡, Â§, â€–, Â¶ after every statement so that you can look up all of the caveats they’re glossing over. I’m going to try to separate out my opinion from the facts. I find talks and rebuttals like this don’t really separate out opinions from facts. Unless you’ve really studied and understand Kafka, you won’ t be able to understand these differences. Saying Kafka is a database comes with so many caveats I don’t have time to address all of them in this post. It is a great messaging system, but saying it is a database is a gross overstatement.

KSQL sits on top of Kafka Streams and so it inherits all of these problems and then some more. Kafka Streams also lacks and only approximates a shuffle sort. I recommend my clients not use Kafka Streams because it lacks checkpointing. I’ll briefly state my opinions and then go through my opinions and the technical reasons in more depth. Some are good and some you have to sift through in order to figure out what’s the best for you and your organization. It’s that Kafka Summit time of year again. Update: Confluent has renamed KSQL to ksqlDB