Please share your thoughts.
The requirement is to migrate the data in MySQL db to Hadoop/HBase for analytic purposes.
The data should be migrated real time or near real time. Can flume support this.
What can be a better approach.
Please share your thoughts.
The requirement is to migrate the data in MySQL db to Hadoop/HBase for analytic purposes.
The data should be migrated real time or near real time. Can flume support this.
What can be a better approach.
The direct answer to your question is yes. Flume is designed as a distributed data transport and aggregation system for event/log structured data. If set up "correctly" flume can push data for continuous ingestion in Hadoop. This is when Flume is set up correctly to collect data from various sources (in this case MySql) and I am sure if data is available at source, the sink in Flume will sync it to HDFS at millisecond level. Once data is available at HDFS/HBASE you can run queries on it and can be processed depend on infrastructure.
So I would say the Flume configuration is very important to push data in near real time to HDFS and then the rest depends on your Map/Reduce cluster performance and how the queries are written with regard to the data being processed.
I also found the following resource for you to understand using Flume and HDFS: http://assets.en.oreilly.com/1/event/61/Real-time%20Streaming%20Analysis%20for%20Hadoop%20and%20Flume%20Presentation.pdf
To my understanding Flume is not meant for that. Flume is basically for reading log like data sources(flat files) and pushing them in a structured way to a sink. Sqoop can be of use, try it out http://archive.cloudera.com/cdh/3/sqoop/SqoopUserGuide.html
Indeed, flume is more used to collect log or other time-series data. Barely heard people use flume to import data from db to hadoop.
Sqoop is good for bulk import from RDBMS to HDFS/Hive/HBase. If it's just one time import, it's very good, it does what it promises on paper. But the problem comes when you wanna have real-time incremental updates. Between two types of incremental updates Sqoop supports:
Append, this one allows you to re-run the sqoop job, and every new job starts where the last old job ends. eg. first sqoop job only imported row 0-100, then next job will start from 101 based on --last-value=100. But even if 0-100 got updated, Append mode won't cover them anymore.
last-modified, this one is even worse IMHO, it requires the source table has a timestamp field which indicates when the row gets last updated. Then based on the timestamp, it does the incremental updates import. If the source table doesn't have anything like that, this one is not useful.
AFAIK, there is no perfect solution for this, if you care more about Real-time, adding triggers or using flume to scan the edit log and keep you HBase updated in a RT fashion, this sounds nasty I know. Or use Sqoop+Oozie, periodically import your source table to HBase, You won't have RT or even near-RT response. So far, given there is nothing else around, I'd vote for flume custom source.
You can use Hadoop Applier a product of MySQL, it can be found here http://labs.mysql.com/
It transfers data from MySQL to HDFS in real-time
I use flume to store data both on HDSF and in MySQL. For latter I use JDBC com.stratio.ingestion.sink.jdbc.JDBCsink. But I am not very satisfied with it performance. Since you need to migrate data from MySQL I would suggest using Sqoop. Another option is to use Flume to send data both to MySQL and HBase/Hadoop, thus you won't need to copy from MySQL to Hadoop/