I have files in HDFS as:
drwxrwx--- - root supergroup 0 2016-08-19 06:21 /tmp/logs/root/logs/application_1464962104018_1639064
drwxrwx--- - root supergroup 0 2016-08-19 06:21 /tmp/logs/root/logs/application_1464962104018_1639065
Now /tmp/logs/root/logs/
directory will continuously get the new files in it.
I want to get the files which are created in last five minutes, taking current time into account. Then I need to copy these files into my local machine.
How about this:
Explanation:
List all the files:
Replace extra spaces:
Get the required columns:
Remove non-required rows:
Processing using awk:
Initialize the DIFF duration and current time:
Create a command to get the epoch value for timestamp of the file on HDFS:
Execute the command to get epoch value for HDFS file:
Get the time difference:
Print the output depending upon the difference:
You just need to change the variable value for
MIN
depending upon your requirement (here its 5 minutes). HTHI have done it using below command : it will give me files that are created between a five minute window :
It can be modified accordingly with current time stamp.