Comparing Kafka And RabbitMQ For Event Processing

Pratik Kale
13 min readNov 8, 2020

This article will help you,

  • If you have basic knowledge of Kafka and RabbitMQ.
  • You are aware of Kafka and RabbitMQ as leading event processing options but are unsure which one to pick for your use case.
  • If you have a general knowledge of Kafka and RabbitMQ and want to quickly understand the differences between them.

Stream Processing vs Message Queue

Kafka

  • Kafka is an event streaming platform. Producers produce message for a topic and consumers listening to the topic can consume this message.
  • After a consumer group in kafka consumes a messages , message still remains in the kafka queue till it expires. Hence it is possible to replay this message till its expiry, even after a consumer consumes it.
  • This makes it very different from traditional FIFO Queues as a message is never actually dequeued after it is consumed.
  • I like to think of kafka(just for analogy)as a database with each row as a message and topics as different columns and an expiry column containing the timestamp after which the message expires.

RabbitMQ

  • RabbitMQ is a traditional message Queue.
  • Producers produce messages and send them to an exchange which directs it to a Queue or multiple queues based on a routing key
  • Consumers listen to the message and bind to a queue using queue name to start consuming messages
  • Once a message is consumed it is gone from the queue and cannot be replayed

Consuming same message in different ways

Kafka

  • Kafka uses topics to associate between producers and consumers.
  • Kafka also provides consumer groups which helps to consume same message across different sets of consumers.
  • If you have a use case where same message needs to be processed in different ways, consumer groups make it very easy to do this.
  • Read here to know more about consumer groups and topics

RabbitMQ

  • RabbitMQ uses queues to associate between producers and consumers
  • In RabbitMQ every message goes to a queue via an exchange
  • If same message needs to be consumed in different ways we will need to create different queues dedicated for specific purpose(eg 2 queues holds the same message but 1st one will be consumed for logs processing and 2nd will be consumed for metrics processing) . We would also need to create a fan out exchange to publish the same message to all queues.
  • Each individual consumer groups can then bind their dedicated queue to process the message in their own way.
  • So in the RabbitMQ world what we call as topic is essentially a separate queue.
  • This makes it little harder to consume same message in different ways as essentially you will have to create multiple queues with same message published to each queue and consumer bindings to each queue to consume message in their own way(instead of just creating multiple consumer groups in kafka binding to same topic)
  • Producing same message to different queues via a fanout exchange is explained here

Scalability (Consumer Processing)

Kafka

  • Kafka uses partitions to scale. Every topic has partitions and messages are pushed into partitions in a round robin fashion
  • Every consumer listening to a topic is assigned a partition. If we have 4 partitions and 4 consumers each listening to a partition then we have the ability to process 4 messages at a single time in parallel.
  • In Kafka we can scale horizontally by just adding more partitions and consumers
  • Read here to know more about partitions and consumers. You can also read this blog to understand partitioning better

RabbitMQ

  • In RabbitMQ there is no concept of a partition. Each message goes to a queue. consumers consume messages based on prefetch set in each consumer.
  • Hence to increase scalability of consumers we can just add more consumers.
  • More the no of consumers more faster the messages are dequeued from the head of the queue by these consumers.
  • Read here to know more about consuming messages in RabbitMQ using multiple consumers

Scalability (Broker)

Kafka

  • Kafka brokers can scale better horizontally. We can have multiple brokers and partitions are distributed across brokers.
  • If throughput increases we can just add more brokers and partitions
  • This blog explains very well how partitions are divided across brokers

RabbitMQ

  • Queues can be divided across nodes. However the queues( if HA is enabled )are just mirrored (mirror queues) or replicated (quorum queues) to other nodes
  • Only one node which is the designated master for the queue processes the message and others are standby’s.
  • Although we can try to evenly distribute queues to different nodes , it is not guaranteed as uncertain events (like network failure etc) can cause re-election and cause the queue’s master to change.
  • Hence thinking about this I feel RabbitMQ is designed more for vertical scaling. If throughput increases for a queue we might get better performance by increasing the RAM, Disk and CPU speed of the master node of the queue

Fault Tolerance and HA

Kafka

  • Kafka uses replica partitions which are distributed across nodes for HA
  • If a master node for a partition goes down, another node which has the replica partition takes over and becomes the new master for that partition
  • This blog explains the replication process

RabbitMQ

  • RabbitMQ uses Mirrored Queues or Quorum Queues for HA
  • In Mirrored Queue every queue is mirrored or replicated in another node. If the master node for a queue goes down the replica node becomes the new master
  • Quorum queues use raft to replicate data to disk of all other nodes in the cluster. If master node for a queue goes down, a new master is elected using raft election.
  • Quorum queues were released recently in version 3.8 and are preferred over Mirrored queues if data safety is of utmost importance. You can read more here
  • Quorum queues are built for data safety and durability. As data is replicated across all nodes, quorum queues need bigger RAM and disks to perform efficiently. Sizing a quorum queue requires analyzing the size of the message you are going to produce. This blog gives a good idea of the replication happening in quorum and classic mirrored queues

Large Messages

Kafka

  • Kafka has a default message.max.bytes = 1MB which can be overridden in Kafka server configuration file. Broker will reject messages with size greater than message.max.bytes
  • Although large messages (>10MB) can be pushed to kafka brokers and consumed, Kafka is optimized for messages < 1MB. This document from Cloudera provides some benchmarks where they found Kafka performs the best with messages with smaller sizes
  • It is thus advisable to keep large messages in another data or object store and just push the reference to the data(usually id identifying the message in data store) to the queue
  • Consumers can then extract the id from the message , and use this id to query or download the large message

RabbitMQ

  • There is no restriction on size of message sent to rabbitMQ
  • However best practices suggest to keep queues lightweight especially when operating in HA as queue data might be replicated to multiple nodes. In quorum queues all data gets replicated in all raft nodes of the cluster and large messages can fill up RAM and disk quickly decreasing throughput of the queue.
  • RabbitMQ has a high and low watermark threshold and if queue fills up to this threshold it halts all operations to free up memory
  • This blog explains how write amplifications can cause the disk of a quorum queue to fill up halting all your nodes in RabbitMQ cluster
  • If you cannot scale vertically to store large messages in queue and if your throughput is high which can overwhelm the disk and RAM of your node, it is best to just pass references to your messages and store your message in a data or object store
  • Consumers can then extract the id from the message , and use this id to query or download the large message

Large Processing time

Kafka

  • Long processing messages can be a little tricky in kafka ecosystem
  • After Kafka 0.10.0 processing time in Kafka is controlled by max.poll.interval.ms and there is a separate hear-beat thread whose timeout is controlled by session.timeout.ms.
  • If a consumer does not respond within max.poll.interval.ms, a rebalance is triggered.Basically the consumer is marked as down and the partition is assigned to a new consumer.
  • However for a rebalance to go through all the consumers need to have processed their message.
  • Now if the message takes a long time to process Kafka needs to wait for all consumers to respond(which might take a long time if they are processing a long processing message) to trigger the rebalance.
  • This can cause rebalances to take long time. Especially if you have consumers taking hours to process the message, rebalance can theoretically take hours.This stack-overflow post provides some context on long running processes and how they affect rebalances
  • Also in Kafka if message is taking long time on a consumer, all the messages in the partition assigned to this consumer are starved till the long running message is processed. This can cause a livelock(imagine consumer gets stuck in while(true))and hence the need of max.poll.interval.ms to handle such a situation and trigger a rebalance.
  • Another thing which is tricky is coming up with an optimal value of max.poll.interval.ms. This timeout controls how long it takes to process a message. If you keep it very large it might take long time to detect a livelock or deadlock situations in consumers and also might cause large rebalance intervals. On the other hand if you keep it very small you might have frequent rebalances if certain messages take long time to process. Hence it might becomes little tricky to come up with a foolproof value of this timeout to avoid the above scenarios

RabbitMQ

  • Heres where rabbitMQ shines. RabbitMQ does not have any processing timeout and has a separate heartbeat thread which manages the heartbeat timeout. According to RabbitMQ official tutorials, it is fine even if the consumer takes a very very long time to process a message.

“There aren’t any message timeouts; RabbitMQ will redeliver the message when the consumer dies. It’s fine even if processing a message takes a very, very long time.”

  • As there is no concept of partition and rebalances , even if a message is taking a long time on one consumer other consumers will keep on fetching messages from the head of the queue. Hence as long as we have enough consumers no messages will be starved (unless all consumers are busy processing messages, in which case you need to increase consumers to match the throughput)
  • You also do not need to manage the processing timeout as no such timeout exists in rabbitMQ

Producer Confirms

Kafka

  • Kafka uses zookeeper to manage its brokers.
  • When a message is published to broker, zookeeper makes sure that this message is safe in the queue and sends a confirmation to the producer

RabbitMQ

  • In RabbitMQ you can use publisher confirms to make sure if message is safe in the queue
  • This is needed as various network related incidents can cause the message to be lost even if a producer sends the message successfully to the RabbitMQ exchange. Below is an extract on how we can use publisher confirms in RabbitMQ’s quorum queue and its importance. You can read more here

Publishers should use publisher confirms as this is how clients can interact with the quorum queue consensus system. Publisher confirms will only be issued once a published message has been successfully replicated to a quorum of nodes and is considered “safe” within the context of the system

Additional Infrastructure

Kafka

  • Kafka needs zookeeper to manage the Kafka nodes which introduces an additional dependency. (Kafka has hinted that it will get rid of zookeeper in future versions. Click here to know more)

RabbitMQ

  • RabbitMQ is standalone and does not need any infrastructure to handle the nodes in its cluster

Documentation

  • Both Kafka and RabbitMQ have extensive documentation and also a huge community support and following.

Rolling deployments for long running consumers

Kafka

  • Rolling deployments are nowadays the default way of deploying code where new artifacts(images, manifests, war) are pushed to the system as a rolling upgrade. This ensures there is no downtime while deploying. To know more about rolling deployments you can click here
  • Rolling deployments to consumers might trigger rebalances as new consumers come up and partitions are assigned one by one
  • For large running processes rolling deployments can be time consuming.
  • As consumers come up one by one they might pick up messages and start processing them
  • However as every consumer deployment will trigger a rebalance in Kafka (Kafka will assign a partition as it discovers a consumer), it will wait for all other consumers to respond as some that have already started consuming might be still processing the message.
  • This makes rolling deployments time consuming for long running processes as rebalances will be triggered for rolling updates which might be lengthy
  • If your processes are not long running then rebalances will be fast and will not impact deployment time

RabbitMQ

  • Rolling deployments is not a problem as there is no concept of partitions and rebalances in rabbitMQ
  • As new consumers become available they will just start consuming from the head of the queue

Memory, Disk in HA mode

Kafka

  • As messages are distributed across nodes and partitions in kafka, it can scale horizontally if more memory or disk is required
  • Kafka also has an expiry for every message after which message is marked for deletion which can help manage memory and disk space

RabbitMQ

  • Quorum Queues is used RabbitMQ keeps all messages in memory and in disk and all messages are replicated across all nodes for the queue

Quorum queues typically require more resources (disk and RAM) than classic mirrored queues. To enable fast election of a new leader and recovery, data safety as well as good throughput characteristics all members in a quorum queue “cluster” keep all messages in the queue in memory and on disk.

  • This might make it hard to scale horizontally if throughput or message size increases.
  • Also RabbitMQ performs best in HA if all messages are stored in RAM. The more emptier the queue better is its throughput and performance.
  • This architecture for rabbitMQ makes sense as it is designed to operate as a traditional queue and messages get dequeued constantly.
  • This blog also gives you some best practices and also explains importance of keeping queues short. However queue sizes and cluster sizing varies based on individual use cases
  • RabbitMQ also has a TTL for every message (currently this is not supported for quorum queues but might be added in future) after which message is marked for deletion.

Consumer Acknowledgments & Retries

Kafka

  • We can configure kafka to not mark a message as consumed (or advance the offset) by disabling auto commit. This allows consumers to manually mark a message as consumed (using commit sync or commit async)
  • This gives us ability to retry a message if certain infrastructures or network calls fail while consuming a message. This is very useful feature as we get automatic retries if we do not commit the offset till a message is properly consumed
  • kafka will retry the same message and not advance the offset until it is committed
  • It is upto the consumer to manage retries and commits if auto commit is disabled. Example a consumer can commit sync if errors cannot be retried(eg: code errors) , while he may not commit sync for errors that can be fixed with a retry (eg socket timeouts or connection timeouts) so that kafka automatically replays the same message and a retry fixes the issue.

RabbitMQ

  • RabbitMQ by default works in fire and forget fashion in auto ack mode. In this mode once message is consumed it cannot be processed again
  • However we can set auto ack to false and manually acknowledge the message (this is similar to commit sync described above). We have an option to requeue the message unless it is acknowledged
  • RabbitMQ will try to place the message which is to be re-queued at the head or near the head of the queue. As rabbitMQ puts the message near the head and not at the end, the message does not suffer from starvation if it is re-queued due to a failure (basically the failed message is not put to the back of the queue but rabbitMQ makes a best effort to put it near the head)
  • We can also keep track of how many times the message is delivered by x-delivery-count header which keeps a track of how many times this message is retried. Currently this is only available in Quorum Queues.
  • Auto ,manual ack and re-queing is described here
  • x-delivery-count can be looked up here

Fair Dispatch

Kafka

  • In kafka messages are spit into partitions. If a consumer attached to a partition is taking a long time to process, messages in that partition might be processed slower than the other partitions
  • This might cause some messages which arrived early to be processed later just because they were sent to a slow moving partition (unfair dispatch)

RabbitMQ

  • RabbitMQ also prefetches certain messages and hence we have the same issue of an unfair dispatch. If a lightweight message is prefetched with other long running messages ahead of it , the lightweight message might get starved behind the long running messages
  • However if we keep prefetch count to 1, we can guarantee a fair dispatch.
  • Fair dispatch section in this rabbitMQ tutorial explains it the best
  • If every consumer picks up one message and only picks up the next after consuming and acknowledging the current message, no message will be unfairly starved.

Message Order

Kafka

  • Messages in a partition are ordered, while there is no guarantee across partitions
  • We can guarantee total ordering using just one partition. This way next messages are consumed as they arrive (like a FIFO queue) as there is just one consumer and one partition
  • We can also keep more than 1 consumers (partition count) in this mode .These consumers will just serve as standbys incase of failure of the consumer serving the only partition.

RabbitMQ

  • Ordering cannot be guaranteed unless we have prefetch count as 1 and only one consumer is attached to a queue (this is very restrictive mode of working and only to be done if we need strict ordering). However we cannot have standbys as adding another consumer would just start processing the queue by dequeuing messages off it
  • Also if we retry by re-queuing a message by using manual acknowledgments there is no guarantee that message will be placed at the head of the queue (although RabbitMQ makes a best effort to do this). Hence a failed message which needs an acknowledgment might be processed out of order when it is re-queued.

Complex Routing

Kafka

  • Routing in kafka is managed using topics.
  • Producers produce message for a topic and consumers will consume from that topic
  • Complex routing based on regex is not available in kafka.

RabbitMQ

  • Routing in rabbitMQ is managed using exchanges and routing keys
  • Hence very complex routing is possible in rabbitMQ
  • One of the features in rabbitMQ is flexible routing. You can read about it more here
  • This documentation also provides detailed explanation of message routing
  • We can have regex based on which an exchange might decide to route messages to various queues as shown here

The routing pattern follows the same rules as the routing key with the addition that * matches a single word, and # matches zero or more words. Thus the routing pattern *.stock.# matches the routing keys usd.stock and eur.stock.db but not stock.nasdaq.

UI and CLI

Kafka

  • Kafka has a robust and well documented CLI
  • However it lacks the UI support as there is no official UI for kafka

RabbitMQ

  • RabbitMQ installation comes with an official UI and CLI
  • UI is very handy as we can create queues, exchanges , publish messages right from the UI
  • Also the most important part is rabbitMQ UI comes with cluster statistics. Some of them are RAM space , disk space, throughput for each queue, no of live consumers attached to the queue etc
  • These statistics tend to help a lot while debugging , evaluating and iterative fine tuning of the cluster based on visual feedback

Conclusion

To conclude both Kafka and RabbitMQ can be used as event processing platforms but are optimized for different use cases. There is no clear winner but hopefully this article gives you a head-start and a direction to what fits your use case

--

--

Pratik Kale

Full stack software engineer with focus on working across stacks to design and architect end to end distributed systems