What is Kafka and Kafka Connect and How could we use them on IBM Cloud?
Last week, I had a conversation with one of my friends, we were discussing the different solutions to stream the data from one data source to another data store. One of my proposed solutions was using Kafka Connect to stream the data with almost zero code. I got surprised when I noticed it was the first time for him to know about Kafka. I decided to write this blog to give him a quick introduction and share it in my profile to spread the knowledge.
Before talking about Kafka Connect, I want to take you for a second to a few years back to remember how we were storing our data and give a quick introduction about Kafka, then I will show how we can use Kafka Connect to stream the data between different systems.
Storing our applications’ data is one of the very important aspects that all of us deal with every day. For years, we have used Databases to store our applications’ data. Databases store objects, like persons, cars, etc, or even the relationship between these objects. This data is stored in rows and columns. The Database stores the state of these objects.
In the current Data era, everything sends data to everything. Airplanes, Printers, home Appliances, cars, trains, etc that increase the need to stream the events that occurring in these devices or machines as soon as possible to take decisions at the right time.
Kafka encouraged us to change our way of storing the data. Instead of storing the object’s state, we can store the data as events and these events contain a description of what happened at a particular time.
Kafka store the data in a log structure. This log structure is called Topic. The topic is an ordered list of events.
The Applications or the microservices can produce messages or events and Kafka stores them in topics to be consumed by other applications to act based on these messages.
Apache Kafka is a distributed streaming platform that allows the development of real-time applications. Kafka is designed to provide a high level of accuracy which means the data is delivered in the order that it is received.
Kafka can run in a cluster of multiple nodes in the same data center or distribute in different data centers. The messages are replicated and partitioned to different nodes to support high availability and high volume without lags.
Kafka has 4 core APIs.
Kafka Producer:
Kafka producer is an API that is used by the applications to produce messages to a topic. These messages can be stored for minutes, hours, days, or even forever. It is based on the storage availability that you have in your cluster.
Kafka Consumer:
Consumer API subscribes to a topic and consumes the messages once they arrive. Consumer API consumes the data in the same format that has been used to persist the message to the topic.
Kafka Stream:
Kafka Stream can be used to consume then transform and aggregate the data that are persisted in the topic and produce the transformed message to another topic.
Kafka Connect:
Kafka Connect which is our key API for this blog. It is used to ingrate an external system with Kafka. Like producing data to Kafka from Mongo DB or consuming data from Kafka to move it to something else like SalesForce or MQ. I will dig deeper in a few minutes into Kafka Connect.
For more details about Kafka, you can watch these great videos:
What is Apache Kafka? (A Confluent Lightboard by Tim Berglund) + ksqlDB
What is Kafka? #Kafka #EventStreams #IBMCloud
Now Let’s talk about the different Kafka’s flavors in IBM Cloud.
• The first option is to use Event Streams as a service. IBM Event Streams is a high-throughput messaging service that is built on top of open-source Apache Kafka. It comes with 3 different plans (Lite/Free, Standards, and Enterprise). It is a managed service, and it is supported and monitored by an expert DevOps team.
To provision Event streams as a service, you will need to have your own IBM Cloud account then you can create the service in the plan that meets your need.
You can learn about the different IBM Plans from this link.
I would recommend reviewing IBM Cloud documentation for more details about the Events stream as a service.
• Also, if you have any regulations or rules that mandate storing your data in Hybrid cloud or private Cloud, you can run Event Streams on Cloud Pak for Integration on top of Red hat OpenShift in your private Cloud or Hybrid. This will help you to take control over your Event Stream service and manage your security in a way that meets your legal regulations or any other mandates. For more details, you can refer to IBM Cloud Pak for integration.
• Third option, you can use Confluent Platform as an add-on to Cloud Pak for Integration on top of Red hat OpenShift. For more details about Confluent on top of IBM Cloud Pak for integration, please refer to this documentation.
• Or you can build your own Kafka Cluster in Kubernetes, OpenShift, or any infrastructure of your choice on IBM Cloud.
Now, let us take a deeper look at Kafka Connect:
Kafka Connect has 2 types. Source connectors and Sink connectors.
Source connector gets data from a non-Kafka system and makes it available in Kafka. The sink connector consumes the data from Kafka and makes it available in a non-Kafka system.
As you can see in the above diagram, Kafka Connect allows us to connect to 3rd party systems to get data from these systems and produce the data as events or messages into Kafka topics. This is what we call Kafka source connector.
On the opposite side, Kafka Sink Connector can consume the messages or the events and send them to these 3rd party systems.
So, you can consider Kafka connect as a data integrator component that works as Kafka clients to produce or consume messages to or from Kafka.
Kafka Connect is a separate server-side component that you can deploy to any infrastructure you prefer. Kafka Connect can be deployed as a cluster of workers to be able to support high volume.
Kafka Connect is used to run Connectors. Connectors run inside your Kafka connect as a runtime component that is managed by Kafka Connect cluster.
Kafka Connectors abstract the integration details from the developers. All the developers need is to define the integration properties as JSON and send it to the connect API to start the connector.
Using Kafka Connect, you do not need to repeat what others already did. Kafka connect community has a great contribution in building tens of connectors that can connect to many 3rd party systems. All you need is to go to the Confluent hub and download the connector jar, include it into your Connect cluster and you are ready to go.
The confluent hub has many free and community license connectors that you can use for free as well as a few others that are under a commercial license that you need to pay to use.
If you could not find your connector in Confluent hub, you may find someone on the internet who has the same business case that you have and already built the connector and make it open source somewhere else like GitHub. If you could not find the connector that you need, Kafka Connect APIs made it easy to implement your connector.
In the end, I would like to share with you a few tutorials that can demonstrate 2 use-cases to use Kafka Connects on IBM Cloud.
Please review the below medium tutorials, it is step-by-step setup guidance that shows how to the setup Event stream and Kafka Connect cluster on Cloud Pak for Integration and how you can ingest the data from Cloudant DB and make it available in the Events Stream topic.
Setup Events Stream Cluster in Cloud Pak for Integration
Generate Kafka credentials to connect to Events Stream Cluster
Set up a Kafka Connect Cluster in IBM Cloud Pak for Integration
Deploy and configure IBM Cloudant Kafka Connector for IBM Events Stream
Also, you can watch this video that was prepared by Rashid A Aljohani to learn how to read messages from an IBM MQ cluster and write them to an Apache Kafka topic running on Confluent environment on top of OpenShift cluster running on IBM Cloud.