Comparing Kafka And RabbitMQ For Event Processing

Pratik Kale

13 min readNov 8, 2020

This article will help you,

If you have basic knowledge of Kafka and RabbitMQ.
You are aware of Kafka and RabbitMQ as leading event processing options but are unsure which one to pick for your use case.
If you have a general knowledge of Kafka and RabbitMQ and want to quickly understand the differences between them.

Stream Processing vs Message Queue

Kafka

Kafka is an event streaming platform. Producers produce message for a topic and consumers listening to the topic can consume this message.
After a consumer group in kafka consumes a messages , message still remains in the kafka queue till it expires. Hence it is possible to replay this message till its expiry, even after a consumer consumes it.
This makes it very different from traditional FIFO Queues as a message is never actually dequeued after it is consumed.
I like to think of kafka(just for analogy)as a database with each row as a message and topics as different columns and an expiry column containing the timestamp after which the message expires.

RabbitMQ

RabbitMQ is a traditional message Queue.
Producers produce messages and send them to an exchange which directs it to a Queue or multiple queues based on a routing key
Consumers listen to the message and bind to a queue using queue name to start consuming messages
Once a message is consumed it is gone from the queue and cannot be replayed

Consuming same message in different ways

Kafka

Kafka uses topics to associate between producers and consumers.
Kafka also provides consumer groups which helps to consume same message across different sets of consumers.
If you have a use case where same message needs to be processed in different ways, consumer groups make it very easy to do this.
Read here to know more about consumer groups and topics

RabbitMQ

RabbitMQ uses queues to associate between producers and consumers
In RabbitMQ every message goes to a queue via an exchange
If same message needs to be consumed in different ways we will need to create different queues dedicated for specific purpose(eg 2 queues holds the same message but 1st one will be consumed for logs processing and 2nd will be consumed for metrics processing) . We would also need to create a fan out exchange to publish the same message to all queues.
Each individual consumer groups can then bind their dedicated queue to process the message in their own way.
So in the RabbitMQ world what we call as topic is essentially a separate queue.
This makes it little harder to consume same message in different ways as essentially you will have to create multiple queues with same message published to each queue and consumer bindings to each queue to consume message in their own way(instead of just creating multiple consumer groups in kafka binding to same topic)
Producing same message to different queues via a fanout exchange is explained here

Scalability (Consumer Processing)

Kafka

Kafka uses partitions to scale. Every topic has partitions and messages are pushed into partitions in a round robin fashion
Every consumer listening to a topic is assigned a partition. If we have 4 partitions and 4 consumers each listening to a partition then we have the ability to process 4 messages at a single time in parallel.
In Kafka we can scale horizontally by just adding more partitions and consumers
Read here to know more about partitions and consumers. You can also read this blog to understand partitioning better

RabbitMQ

In RabbitMQ there is no concept of a partition. Each message goes to a queue. consumers consume messages based on prefetch set in each consumer.
Hence to increase scalability of consumers we can just add more consumers.
More the no of consumers more faster the messages are dequeued from the head of the queue by these consumers.
Read here to know more about consuming messages in RabbitMQ using multiple consumers

Scalability (Broker)

Kafka

Kafka brokers can scale better horizontally. We can have multiple brokers and partitions are distributed across brokers.
If throughput increases we can just add more brokers and partitions
This blog explains very well how partitions are divided across brokers

RabbitMQ

Queues can be divided across nodes. However the queues( if HA is enabled )are just mirrored (mirror queues) or replicated (quorum queues) to other nodes
Only one node which is the designated master for the queue processes the message and others are standby’s.
Although we can try to evenly distribute queues to different nodes , it is not guaranteed as uncertain events (like network failure etc) can cause re-election and cause the queue’s master to change.
Hence thinking about this I feel RabbitMQ is designed more for vertical scaling. If throughput increases for a queue we might get better performance by increasing the RAM, Disk and CPU speed of the master node of the queue

Fault Tolerance and HA

Kafka

Kafka uses replica partitions which are distributed across nodes for HA
If a master node for a partition goes down, another node which has the replica partition takes over and becomes the new master for that partition
This blog explains the replication process

RabbitMQ

RabbitMQ uses Mirrored Queues or Quorum Queues for HA
In Mirrored Queue every queue is mirrored or replicated in another node. If the master node for a queue goes down the replica node becomes the new master
Quorum queues use raft to replicate data to disk of all other nodes in the cluster. If master node for a queue goes down, a new master is elected using raft election.
Quorum queues were released recently in version 3.8 and are preferred over Mirrored queues if data safety is of utmost importance. You can read more here
Quorum queues are built for data safety and durability. As data is replicated across all nodes, quorum queues need bigger RAM and disks to perform efficiently. Sizing a quorum queue requires analyzing the size of the message you are going to produce. This blog gives a good idea of the replication happening in quorum and classic mirrored queues

Large Messages

Kafka

Kafka has a default message.max.bytes = 1MB which can be overridden in Kafka server configuration file. Broker will reject messages with size greater than message.max.bytes
Although large messages (>10MB) can be pushed to kafka brokers and consumed, Kafka is optimized for messages < 1MB. This document from Cloudera provides some benchmarks where they found Kafka performs the best with messages with smaller sizes
It is thus advisable to keep large messages in another data or object store and just push the reference to the data(usually id identifying the message in data store) to the queue
Consumers can then extract the id from the message , and use this id to query or download the large message

RabbitMQ

There is no restriction on size of message sent to rabbitMQ
However best practices suggest to keep queues lightweight especially when operating in HA as queue data might be replicated to multiple nodes. In quorum queues all data gets replicated in all raft nodes of the cluster and large messages can fill up RAM and disk quickly decreasing throughput of the queue.
RabbitMQ has a high and low watermark threshold and if queue fills up to this threshold it halts all operations to free up memory
This blog explains how write amplifications can cause the disk of a quorum queue to fill up halting all your nodes in RabbitMQ cluster
If you cannot scale vertically to store large messages in queue and if your throughput is high which can overwhelm the disk and RAM of your node, it is best to just pass references to your messages and store your message in a data or object store
Consumers can then extract the id from the message , and use this id to query or download the large message

Large Processing time

Kafka

Long processing messages can be a little tricky in kafka ecosystem
After Kafka 0.10.0 processing time in Kafka is controlled by max.poll.interval.ms and there is a separate hear-beat thread whose timeout is controlled by session.timeout.ms.
If a consumer does not respond within max.poll.interval.ms, a rebalance is triggered.Basically the consumer is marked as down and the partition is assigned to a new consumer.
However for a rebalance to go through all the consumers need to have processed their message.
Now if the message takes a long time to process Kafka needs to wait for all consumers to respond(which might take a long time if they are processing a long processing message) to trigger the rebalance.
This can cause rebalances to take long time. Especially if you have consumers taking hours to process the message, rebalance can theoretically take hours.This stack-overflow post provides some context on long running processes and how they affect rebalances
Also in Kafka if message is taking long time on a consumer, all the messages in the partition assigned to this consumer are starved till the long running message is processed. This can cause a livelock(imagine consumer gets stuck in while(true))and hence the need of max.poll.interval.ms to handle such a situation and trigger a rebalance.
Another thing which is tricky is coming up with an optimal value of max.poll.interval.ms. This timeout controls how long it takes to process a message. If you keep it very large it might take long time to detect a livelock or deadlock situations in consumers and also might cause large rebalance intervals. On the other hand if you keep it very small you might have frequent rebalances if certain messages take long time to process. Hence it might becomes little tricky to come up with a foolproof value of this timeout to avoid the above scenarios

RabbitMQ

Heres where rabbitMQ shines. RabbitMQ does not have any processing timeout and has a separate heartbeat thread which manages the heartbeat timeout. According to RabbitMQ official tutorials, it is fine even if the consumer takes a very very long time to process a message.

“There aren’t any message timeouts; RabbitMQ will redeliver the message when the consumer dies. It’s fine even if processing a message takes a very, very long time.”

As there is no concept of partition and rebalances , even if a message is taking a long time on one consumer other consumers will keep on fetching messages from the head of the queue. Hence as long as we have enough consumers no messages will be starved (unless all consumers are busy processing messages, in which case you need to increase consumers to match the throughput)
You also do not need to manage the processing timeout as no such timeout exists in rabbitMQ

Producer Confirms

Kafka

Kafka uses zookeeper to manage its brokers.
When a message is published to broker, zookeeper makes sure that this message is safe in the queue and sends a confirmation to the producer

RabbitMQ

In RabbitMQ you can use publisher confirms to make sure if message is safe in the queue
This is needed as various network related incidents can cause the message to be lost even if a producer sends the message successfully to the RabbitMQ exchange. Below is an extract on how we can use publisher confirms in RabbitMQ’s quorum queue and its importance. You can read more here

Publishers should use publisher confirms as this is how clients can interact with the quorum queue consensus system. Publisher confirms will only be issued once a published message has been successfully replicated to a quorum of nodes and is considered “safe” within the context of the system

Additional Infrastructure

Kafka

Kafka needs zookeeper to manage the Kafka nodes which introduces an additional dependency. (Kafka has hinted that it will get rid of zookeeper in future versions. Click here to know more)

RabbitMQ

RabbitMQ is standalone and does not need any infrastructure to handle the nodes in its cluster

Documentation

Both Kafka and RabbitMQ have extensive documentation and also a huge community support and following.

Rolling deployments for long running consumers

Kafka

Rolling deployments are nowadays the default way of deploying code where new artifacts(images, manifests, war) are pushed to the system as a rolling upgrade. This ensures there is no downtime while deploying. To know more about rolling deployments you can click here
Rolling deployments to consumers might trigger rebalances as new consumers come up and partitions are assigned one by one
For large running processes rolling deployments can be time consuming.
As consumers come up one by one they might pick up messages and start processing them
However as every consumer deployment will trigger a rebalance in Kafka (Kafka will assign a partition as it discovers a consumer), it will wait for all other consumers to respond as some that have already started consuming might be still processing the message.
This makes rolling deployments time consuming for long running processes as rebalances will be triggered for rolling updates which might be lengthy
If your processes are not long running then rebalances will be fast and will not impact deployment time

RabbitMQ

Rolling deployments is not a problem as there is no concept of partitions and rebalances in rabbitMQ
As new consumers become available they will just start consuming from the head of the queue

Memory, Disk in HA mode

Kafka

As messages are distributed across nodes and partitions in kafka, it can scale horizontally if more memory or disk is required
Kafka also has an expiry for every message after which message is marked for deletion which can help manage memory and disk space

RabbitMQ

Quorum Queues is used RabbitMQ keeps all messages in memory and in disk and all messages are replicated across all nodes for the queue

Quorum queues typically require more resources (disk and RAM) than classic mirrored queues. To enable fast election of a new leader and recovery, data safety as well as good throughput characteristics all members in a quorum queue “cluster” keep all messages in the queue in memory and on disk.

This might make it hard to scale horizontally if throughput or message size increases.
Also RabbitMQ performs best in HA if all messages are stored in RAM. The more emptier the queue better is its throughput and performance.
This architecture for rabbitMQ makes sense as it is designed to operate as a traditional queue and messages get dequeued constantly.
This blog also gives you some best practices and also explains importance of keeping queues short. However queue sizes and cluster sizing varies based on individual use cases
RabbitMQ also has a TTL for every message (currently this is not supported for quorum queues but might be added in future) after which message is marked for deletion.

Consumer Acknowledgments & Retries

Kafka

We can configure kafka to not mark a message as consumed (or advance the offset) by disabling auto commit. This allows consumers to manually mark a message as consumed (using commit sync or commit async)
This gives us ability to retry a message if certain infrastructures or network calls fail while consuming a message. This is very useful feature as we get automatic retries if we do not commit the offset till a message is properly consumed
kafka will retry the same message and not advance the offset until it is committed
It is upto the consumer to manage retries and commits if auto commit is disabled. Example a consumer can commit sync if errors cannot be retried(eg: code errors) , while he may not commit sync for errors that can be fixed with a retry (eg socket timeouts or connection timeouts) so that kafka automatically replays the same message and a retry fixes the issue.

RabbitMQ

RabbitMQ by default works in fire and forget fashion in auto ack mode. In this mode once message is consumed it cannot be processed again
However we can set auto ack to false and manually acknowledge the message (this is similar to commit sync described above). We have an option to requeue the message unless it is acknowledged
RabbitMQ will try to place the message which is to be re-queued at the head or near the head of the queue. As rabbitMQ puts the message near the head and not at the end, the message does not suffer from starvation if it is re-queued due to a failure (basically the failed message is not put to the back of the queue but rabbitMQ makes a best effort to put it near the head)
We can also keep track of how many times the message is delivered by x-delivery-count header which keeps a track of how many times this message is retried. Currently this is only available in Quorum Queues.
Auto ,manual ack and re-queing is described here
x-delivery-count can be looked up here

Fair Dispatch

Kafka

In kafka messages are spit into partitions. If a consumer attached to a partition is taking a long time to process, messages in that partition might be processed slower than the other partitions
This might cause some messages which arrived early to be processed later just because they were sent to a slow moving partition (unfair dispatch)

RabbitMQ

RabbitMQ also prefetches certain messages and hence we have the same issue of an unfair dispatch. If a lightweight message is prefetched with other long running messages ahead of it , the lightweight message might get starved behind the long running messages
However if we keep prefetch count to 1, we can guarantee a fair dispatch.
Fair dispatch section in this rabbitMQ tutorial explains it the best
If every consumer picks up one message and only picks up the next after consuming and acknowledging the current message, no message will be unfairly starved.

Message Order

Kafka

Messages in a partition are ordered, while there is no guarantee across partitions
We can guarantee total ordering using just one partition. This way next messages are consumed as they arrive (like a FIFO queue) as there is just one consumer and one partition
We can also keep more than 1 consumers (partition count) in this mode .These consumers will just serve as standbys incase of failure of the consumer serving the only partition.

RabbitMQ

Ordering cannot be guaranteed unless we have prefetch count as 1 and only one consumer is attached to a queue (this is very restrictive mode of working and only to be done if we need strict ordering). However we cannot have standbys as adding another consumer would just start processing the queue by dequeuing messages off it
Also if we retry by re-queuing a message by using manual acknowledgments there is no guarantee that message will be placed at the head of the queue (although RabbitMQ makes a best effort to do this). Hence a failed message which needs an acknowledgment might be processed out of order when it is re-queued.

Complex Routing

Kafka

Routing in kafka is managed using topics.
Producers produce message for a topic and consumers will consume from that topic
Complex routing based on regex is not available in kafka.

RabbitMQ

Routing in rabbitMQ is managed using exchanges and routing keys
Hence very complex routing is possible in rabbitMQ
One of the features in rabbitMQ is flexible routing. You can read about it more here
This documentation also provides detailed explanation of message routing
We can have regex based on which an exchange might decide to route messages to various queues as shown here

The routing pattern follows the same rules as the routing key with the addition that * matches a single word, and # matches zero or more words. Thus the routing pattern *.stock.# matches the routing keys usd.stock and eur.stock.db but not stock.nasdaq.

UI and CLI

Kafka

Kafka has a robust and well documented CLI
However it lacks the UI support as there is no official UI for kafka

RabbitMQ

RabbitMQ installation comes with an official UI and CLI
UI is very handy as we can create queues, exchanges , publish messages right from the UI
Also the most important part is rabbitMQ UI comes with cluster statistics. Some of them are RAM space , disk space, throughput for each queue, no of live consumers attached to the queue etc
These statistics tend to help a lot while debugging , evaluating and iterative fine tuning of the cluster based on visual feedback

Conclusion

To conclude both Kafka and RabbitMQ can be used as event processing platforms but are optimized for different use cases. There is no clear winner but hopefully this article gives you a head-start and a direction to what fits your use case

Comparing Kafka And RabbitMQ For Event Processing

Stream Processing vs Message Queue

Kafka

RabbitMQ

Consuming same message in different ways

Kafka

RabbitMQ

Scalability (Consumer Processing)

Kafka

RabbitMQ

Scalability (Broker)

Kafka

RabbitMQ

Fault Tolerance and HA

Kafka

RabbitMQ

Large Messages

Kafka

RabbitMQ

Large Processing time

Kafka

RabbitMQ

Producer Confirms

Kafka

RabbitMQ

Additional Infrastructure

Kafka

RabbitMQ

Documentation

Rolling deployments for long running consumers

Kafka

RabbitMQ

Memory, Disk in HA mode

Kafka

RabbitMQ

Consumer Acknowledgments & Retries

Kafka

RabbitMQ

Fair Dispatch

Kafka

RabbitMQ

Message Order

Kafka

RabbitMQ

Complex Routing

Kafka

RabbitMQ

UI and CLI

Kafka

RabbitMQ

Conclusion

Written by Pratik Kale