Distributed Processing of Volumetric Image Data

For the development of an object recognition algorithm, I need to repeatedly run a detection program on a large set of volumetric image files (MR scans). The detection program is a command line tool. If I run it on my local computer on a single file and single-threaded it takes about 10 seconds. Processing results are written to a text file. A typical run would be:

10000 images with 300 MB each = 3TB
10 seconds on a single core = 100000 seconds = about 27 hours

What can I do to get the results faster? I have access to a cluster of 20 servers with 24 (virtual) cores each (Xeon E5, 1TByte disks, CentOS Linux 7.2). Theoretically the 480 cores should only need 3.5 minutes for the task. I am considering to use Hadoop, but it's not designed for processing binary data and it splits input files, which is not an option. I probably need some kind of distributed file system. I tested using NFS and the network becomes a serious bottleneck. Each server should only process his locally stored files. The alternative might be to buy a single high-end workstation and forget about distributed processing.

I am not certain, if we need data locality, i.e. each node holds part of the data on a local HD and processes only his local data.

标签： linux image hadoop machine-learning distributed-computing

2条回答

我想做一个坏孩纸

2楼-- · 2019-08-14 07:44

I regularly run large scale distributed calculations on AWS using Spot Instances. You should definitely use the cluster of 20 servers at your disposal.

You don't mention which OS your servers are using but if it's linux based, your best friend is bash. You're also lucky that it's a command line programme. This means you can use ssh to run commands directly on the servers from one master node.

The typical sequence of processing would be:

run a script on the Master Node which sends and runs scripts via ssh on all the Slave Nodes
Each Slave Node downloads a section of the files from the master node where they are stored (via NFS or scp)
Each Slave Node processes its files, saving required data via scp, mysql or text scrape

To get started, you'll need to have ssh access to all the Slaves from the Master. You can then scp files to each Slave, like the script. If you're running on a private network, you don't have to be too concerned about security, so just set ssh passwords to something simple.

In terms of CPU cores, if the command line program you're using isn't designed for multi-core, you can just run several ssh commands to each Slave. Best thing to do is run a few tests and see what the optimal number of process is, given that too many processes might be slow due to insufficient memory, disk access or similar. But say you find that 12 simultaneous processes gives the fastest average time, then run 12 scripts via ssh simultaneously.

It's not a small job to get it all done, however, you will forever be able to process in a fraction of the time.

0人赞添加讨论(0) 举报

放荡不羁爱自由

3楼-- · 2019-08-14 07:44

You can use Hadoop. Yes, default implementation of FileInputFormat and RecordReader are splitting files into chunks and split chunks into lines, but you can write own implementation of FileInputFormat and RecordReader. I've created custom FileInputFormat for another purpose, I had opposite problem - to split input data more finely than default, but there is a good looking recipes for exactly your problem: https://gist.github.com/sritchie/808035 plus https://www.timofejew.com/hadoop-streaming-whole-files/

But from other side Hadoop is a heavy beast. It has significant overhead for mapper start, so optimal running time for mapper is a few minutes. Your tasks are too short. Maybe it is possible to create more clever FileInputFormat which can interpret bunch of files as single file and feed files as records to the same mapper, I'm not sure.

0人赞添加讨论(0) 举报

Distributed Processing of Volumetric Image Data

采纳回答

编辑标签

举报内容

检举类型

检举原因

检举说明(必填)

打开微信“扫一扫”，打开网页后点击屏幕右上角分享按钮

付费偷看金额在0.1-10元之间