Periodic hadoop jobs running (best practice)

2019-02-19 19:54发布

Customers able to upload urls in any time to database and application should processes urls as soon as possible. So i need periodic hadoop jobs running or run hadoop job automatically from other application(any script identifies new links were added, generates data for hadoop job and runs job). For PHP or Python script, i could set up cronjob, but what is best practice for periodic hadoop jobs running (prepare data for hadoop, upload data, run hadoop job and move data back to database?

标签: hadoop cloud
2条回答
ら.Afraid
2楼-- · 2019-02-19 20:10

If you want urls to be processed as soon as possible, you'll have them processed each at a time. My recommendation is to wait for some number of links (or MB of links, or for example 10 min, every day).
And batch process them (I do my processing daily, but that jobs takes few hours)

查看更多
放我归山
3楼-- · 2019-02-19 20:32

Take a look at Oozie, the new workflow system from Y!, which can run jobs based on different triggers. A good overflow is presented by Alejandro here: http://www.slideshare.net/ydn/5-oozie-hadoopsummit2010

查看更多
登录 后发表回答