I have a directory (Final Dir) in HDFS in which some files(ex :10 mb) are loading every minute. After some time i want to combine all the small files to a large file(ex :100 mb). But the user is continuously pushing files to Final Dir. it is a continuous process.
So for the first time i need to combine the first 10 files to a large file (ex : large.txt) and save file to Finaldir.
Now my question is how i will get the next 10 files excluding the first 10 files?
can some please help me
Here is one more alternate, this is still the legacy approach pointed out by @Andrew in his comments but with extra steps of making your input folder as a buffer to receive small files pushing them to a tmp directory in a timely fashion and merging them and pushing the result back to input.
step 1 : create a tmp directory
step 2 : move all the small files to the tmp directory at a point of time
step 3 -merge the small files with the help of hadoop-streaming jar
step 4- move the output to the input folder
step 5 - remove output
step 6 - remove all the files from tmp
Create a shell script from step 2 till step 6 and schedule it to run at regular intervals to merge the smaller files at regular intervals (may be for every minute based on your need)
Steps to schedule a cron job for merging small files
step 1: create a shell script /home/abc/mergejob.sh with the help of above steps (2 to 6)
important note: you need to specify the absolute path of hadoop in the script to be understood by cron
step 2: schedule the script using cron to run every minute using cron expression
a) edit crontab by choosing an editor
b) add the following line at the end and exit from the editor
The merge job will be scheduled to run for every minute.
Hope this was helpful.
@Andrew pointed you to a solution that was appropriate 6 years ago, in a batch-oriented world.
But it's 2016, you have a micro-batch data flow running and require a non-blocking solution.
That's how I would do it:
new_data
,reorg
andhistory
new_data
Now the batch compaction logic:
new_data
directory toreorg
reorg
files, into a new file inhistory
dir (feel free to GZip it on the fly, Hive will recognize the.gz
extension)reorg
So it's basically the old 2010 story, except that your existing data flow can continue dumping new files into
new_data
while the compaction is safely running in separate directories. And in case the compaction job crashes, you can safely investigate / clean-up / resume the compaction without compromising the data flow.By the way, I am not a big fan of the 2010 solution based on a "Hadoop Streaming" job -- on one hand, "streaming" has a very different meaning now; on the second hand, "Hadoop streaming" was useful in the old days but is now out of the radar; on the gripping hand [*] you can do it quite simply with a Hive query e.g.
With a couple of
SET some.property = somevalue
before that query, you can define what compression codec will be applied on the result file(s), how many file(s) you want (or more precisely, how big you want the files to be - Hive will run the merge accordingly), etc.Look into https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties under
hive.merge.mapfiles
andhive.merge.mapredfiles
(orhive.merge.tezfiles
if you use TEZ) andhive.merge.smallfiles.avgsize
and thenhive.exec.compress.output
andmapreduce.output.fileoutputformat.compress.codec
-- plushive.hadoop.supports.splittable.combineinputformat
to reduce the number of Map containers since your input files are quite small.[*] very old SF reference here :-)