I want to make real-time data pipeline in Apache Kafka. I have database which is located at remote location and that database continuously updating. Can anybody which Kafka connect API i should use to pull the data from database and ingest into Kafka broker in real time? later on i would use kafka stream and KSQL to run ad-hoc queries to perform the metrics.
Any help would be highly appreciated!
If you're reading from a MySQL database use Confluent's JDBC Source connector. https://github.com/confluentinc/kafka-connect-jdbc/ You'll also need to download the MYSQL driver and put it with the kafka jars: https://dev.mysql.com/downloads/connector/j/5.1.html
If you want to create a real-time data pipeline you need to use a Change Data Capture (CDC) tool which is able to stream changes from MySQL. I would suggest Debezium which is an open source distributed platform for change data capture.
Capturing Inserts
When a new record is added to a table, a JSON similar to the one below will be produced:
before
object is null andafter
object contains the newly inserted values. Note that theop
attribute isc
, indicating that this was a CREATE event.Capturing Updates
Assuming that
email
attribute has been updated, a JSON similar to the one below will be produced:Notice
op
which is nowu
, indicating that this was an UPDATE event.before
object shows the row state before the update andafter
object captures the current state of the updated row.Capturing deletes
Now assume that the row has been deleted;
op
new is equal tod
, indicating a DELETE event.after
attribute will be null andbefore
object contains the row before it gets deleted.You can also have a look at the extensive tutorial provided in their website.
EDIT: Example configuration for a MySQL database