Kafka has four core APIs.
- Producer API: It is used by applications to publish records/data to Kafka topics.
- Consumer API: It allows applications to read records from one or more topic and process these records.
- Streams API: It allows applications to consume records from one or more topic, process them and publish these records to one or more topic.
- Connect API: This API has allows reuse the code of producers and consumers.
Topics and Partitions:
In Kafka, topics are categories of the data feed. Each record/message will get published to a topic. Consumers can read data from one or more topic. There can be n number of topics on the server as long as their names are different.
Topics are split into Partitions. Each partition has offset assigned which are increasing numbers starting from 0. Messages will be appended to end of partitions. The topic can have n number of partitions.
Messages written to partitions cannot be changed, they are immutable. The ordering of messages will be maintained only within the partition. Data is assigned randomly to partitions unless a key is provided with the message. Key will be used to assign each message to a particular partition.
Data in Kafka will be stored for a limited time (default 2 weeks). After that time, data will be erased from the server. So after 2 weeks of writing data, offset 0 in partition 0 will be deleted. Remember offset will always be increasing. Even if zero offset is deleted, zero offset won’t be assigned to any new data.
Kafka is made of many servers and these servers are called as brokers. Each broker will be identified by its ID. Once you connect to any broker within the cluster, you are connected to all brokers. Each broker contains some of the topic partitions. Each partition has one server which acts as ‘Leader’ and zero and more servers as followers. All the read and write of that partition will be handled by the leader and will get replicate on followers. If Leader of one of partition goes down due to some reason, one of the followers of that partition will become the leader of that partition automatically.
Each partition in Kafka will get replicated to one or more servers. This gives us fault-tolerant storage. Even if one of the servers goes down we can use replicated data from another server.
Producers publish data to topics. Producers have to give a topic name and one of a broker to connect to while publishing data. The producer is responsible for which records to be assigned to which partitions. Producers can choose to receive acknowledgment for data writes.
- ack = 0 : Producer will not wait for any acknowledgment
- ack = 1 : Producer will wait for only leader acknowledgment
- ack = 2 : Producer will wait for a leader as well as replica acknowledgment
Consumers read data from topics. Consumers can be grouped together in consumer groups. Each consumer in a group will data from one partition at a time. If there are 3 consumers in group and 3 partitions then each consumer will read data from one partition in parallel. So it is pointless to have more consumers than partitions as some of the consumers will sit idle as one partition will get read by only one consumer in that group at a time.
When a consumer receives data and processed it, It will commit offsets to Kafka. Kafka stores these offsets in a topic named ‘__consumer_offsets’. If a consumer dies and comes back online after some time, using this offset it will be able to read from the point it left.
Kafka gives us following guarantees,
- Messages will be appended to partition in the order they are sent.
- Consumer will read messages in the order they are stored in a partition
- For a topic with replication factor N, we can tolerate N-1 server failures without losing any records.
We have gone through basics of Apache Kafka in this article. In next article, we will learn how to install Kafka.
I am passionate about data analytics, machine learning, and artificial intelligence. Recently I have started blogging about my experience while learning these exciting technologies.