Can I have file watcher on HDFS?
Scenario: The files are landing on HDFS continuously.I want to start a Spark Job once the number of files reached a threshold(it can be number of files or size of the files).
Is it possible to implement file watcher on HDFS to achieve this . If yes, then can anyone suggest the way to do it?What are the different options available? Can the Zookeeper or the Oozie do it?
Any help will be appreciated.Thanks.
Hadoop 2.6 introduced
DFSInotifyEventInputStream
that you can use for this. You can get an instance of it fromHdfsAdmin
and then just call.take()
or.poll()
to get all the events. Event types include delete, append and create which should cover what you're looking for.Here's a basic example. Make sure you run it as the
hdfs
user as the admin interface requires HDFS root.Here's a blog post that covers it in more detail:
http://johnjianfang.blogspot.com/2015/03/hdfs-6634-inotify-in-hdfs.html?m=1
Oozie coordinator can do this. Oozie coordinator actions can be triggered based on data availability. Write a data triggered coordinator. The coordinator actions are triggered based on the done-flag. done-flag is nothing but an empty file. So when your threshold is reached write an empty file into the directory.
Yes, you can do this with Inotification. You just need to get the details of HDFS transaction thru inotifyier, to get better understanding read this link.